Merging SSX data in groups with xia2

The default behaviour of xia2.ssx / xia2.ssx_reduce is to merge all data into one merged MTZ file. However there are options to produce separate merged datasets based on metadata, which can be used for experiments such as dose series experiments. The most efficient way to process such data in xia2/DIALS is to integrate the data as standard and then use the features available in xia2.ssx_reduce to split the images at the point of merging.

Dose series - the dose_series_repeat option

For dose series data, the option dose_series_repeat= can be used to trigger merging into n groups based on the image number, e.g.:

xia2.ssx_reduce ../xia2-ssx/batch_*/integrated*.{expt,refl} dose_series_repeat=5

This covers the use case of an experiment where a repeat of n measurements are made at each location, before moving to the next location and repeating, thus creating a dataset where each block of n images form a dose series for a particular location/crystal (and the images are stored sequentially in this manner). Before merging, the image filepaths from the experiment files are inspected, and the experiments split accordingly based on their image index from the filepath (the formula for splitting is image-index modulo repeat = dose). For dose_series_repeat=5, the following directory structure would be created for the merging:

- data_reduction
    - merge
        - dose_1
        - dose_2
        - dose_3
        - dose_4
        - dose_5

with each dose folder containing a merged MTZ, the dials.merge output, as well as experiment and reflection files for the images for that particular dose. The experiment and reflection files can be used as input for subsequent merging jobs, for example with a specified resolution cutoff:

xia2.ssx_reduce steps=merge ../reduce/data_reduction/merge/dose_1/group*.{expt,refl} d_min=2.5

The experiment files can also be used to verify which images were split into which dose group.

Dose series - using a grouping.yml file

xia2.ssx also supports more generalised merging, to support the wide variety of experiments possible in serial crystallography, which can be specified using a YAML file with formalised definitions. An equivalent example to the above case is the example yaml file:

metadata:
  dose_point:                                      ## <- user-defined metadata name
    "path/to/example/image.h5" : "repeat=5"        ## <- the format here is image-file : value
grouping:
  merge_by:                                        ## <- indicator to xia2 that the following definitions are for merging
    values:
      - dose_point                                 ## <- reference to the user-defined metadata name above

In the grouping section, the specification is that the dose_point metadata item should be used for grouping. The metadata section specifies how the metadata is related to the image file, in this case a sequence that repeats every 5 images. To use this form of specifying the groupings, if the above were contained in the file grouping.yml, the command would be:

xia2.ssx_reduce ../xia2-ssx/batch_*/integrated*.{expt,refl} grouping=grouping.yml

Note that for grouping images with a file template, the general image template should be provided, with hashes replacing the image numbers, e.g. the ‘image-file’ specified in the YAML file would be “path/to/example/image_#####.cbf”.

The resulting directory structure is similar to above, with each grouping merged in a separate subfolder:

- data_reduction
    - merge
        - group_1
        - group_2
        - group_3
        - group_4
        - group_5

Generalised merge grouping on metadata

By formally defining the merge groupings with YAML file, one can generalise to more complicated groupings and options. An example use case is data in HDF5 format with a metadata array which can be used for grouping. The example below demonstates a few valid ways of specifying metadata:

metadata:
  dose:
    "path/to/example/image1.h5" : "path/to/example/image1.h5:/entry/dose" ## <- format is image-file: file:/path/to/metadata/array
    "path/to/example/image2.h5" : "meta2.h5:/entry/dose"                  ## <- metadata array does not need to be in the image file
    "path/to/example/image3.h5" : 0.0                                     ## <- all images at a dose value of 0.
grouping:
  merge_by:
    values:
      - dose
    tolerances:
      - 0.1

Note that for prcoessing a dataset containing images from multiple files, each file must have valid definition in the metadata section. The metadata for image1 is an array from the image file, however there is not a strict requirement for the metadata to be contained in the image file. As shown in the definition for image2, the metadata can be contained in a separate H5 file, the only requirement is that the length of the metadata array matches the number of images. The definition for image3 shows a case where the metadata is a constant value for that image file. Although not shown in this example, it is also possible to group by more than one metadata value, if they are specified in the values and metadata sections.

The more formalised definition of merge groupings is intended to support integration into automated processing infrastructure: experiment control software can write metadata into the image files and generate the grouping.yml to be input to xia2.ssx to correctly group the data in merging. It also facilities the integration of custom classification of images for merging into processing scripts.