torchref.io.datasets.collection module

Dataset collection for handling multiple crystallographic datasets.

This module provides the DatasetCollection class for managing multiple related ReflectionData objects, useful for joint refinement, MAD phasing, and time-series crystallography.

class torchref.io.datasets.collection.DatasetCollection(hkl=None, F=None, F_sigma=None, I=None, I_sigma=None, rfree_flags=None, resolution=None, bin_indices=None, outlier_flags=None, phase=None, fom=None, _centric_flags=None, E=None, E_squared=None, F_squared_corrected=None, U_aniso=None, radial_shell_indices=None, cell=None, spacegroup=None, device=<factory>, verbose=1, rfree_source=None, amplitude_source=None, intensity_source=None, phase_source=None, wilson_b=None, wilson_b_structure=None, wilson_b_solvent=None, wilson_k_sol=None, outlier_detection_params=None, _datasets=<factory>, _dataset_order=<factory>, _reference_dataset=None, _common_hkl=None, _cell=None, _spacegroup=None, _resolution=None, _scale_factors=<factory>)[source]

Bases: CrystalDataset

Container for multiple related crystal datasets.

All datasets share a common HKL set for efficient computation. Datasets are aligned using the first dataset as a reference, with missing reflections in subsequent datasets masked out.

Parameters:
  • verbose (int, optional) – Verbosity level (0=silent, 1=normal, 2=debug). Default is 1.

  • device (str, optional) – Device for tensors (‘cpu’, ‘cuda’, etc.). Defaults to the configured device.current.

hkl

Common HKL set for all datasets.

Type:

torch.Tensor

n_datasets

Number of datasets in collection.

Type:

int

Examples

from torchref.io import DatasetCollection, ReflectionData

collection = DatasetCollection(device='cuda')

native = ReflectionData().load_mtz('native.mtz')
derivative = ReflectionData().load_mtz('derivative.mtz')

collection.add_dataset('native', native, set_as_reference=True)
collection.add_dataset('derivative', derivative)

for name, dataset in collection:
    print(f"{name}: {len(dataset)} reflections")

# Access by name
native_F = collection['native'].F
add_dataset(name, dataset, set_as_reference=False)[source]

Add a dataset to the collection.

Parameters:
  • name (str) – Identifier for this dataset.

  • dataset (ReflectionData) – The dataset to add.

  • set_as_reference (bool, optional) – If True, use this dataset’s HKL as the reference. Default is False, but the first dataset added automatically becomes the reference.

Returns:

Self, for method chaining.

Return type:

DatasetCollection

Raises:

ValueError – If a dataset with the same name already exists.

Examples

collection = DatasetCollection()
collection.add_dataset('native', native_data, set_as_reference=True)
collection.add_dataset('derivative', derivative_data)
property hkl: Tensor | None

Common HKL set for all datasets.

property datasets: Dict[str, ReflectionData]

Access all datasets as a dictionary.

property n_datasets: int

Number of datasets in collection.

property reference_dataset: str | None

Name of the reference dataset.

property spacegroup: str | None

Space group of the reference dataset.

__getitem__(name)[source]

Get dataset by name.

Parameters:

name (str) – Name of the dataset.

Returns:

The requested dataset.

Return type:

ReflectionData

Raises:

KeyError – If dataset name not found.

__iter__()[source]

Iterate over (name, dataset) pairs in order of addition.

Yields:

tuple of (str, ReflectionData) – Name and dataset for each dataset in collection.

__len__()[source]

Number of reflections in common HKL set.

__contains__(name)[source]

Check if dataset exists in collection.

__call__(mask=True)[source]

Return all datasets’ data scaled if scale factors are set.

Parameters:

mask (bool, optional) – Whether to apply masking. Default is True.

Returns:

Dictionary mapping name to (hkl, F, F_sigma, rfree) tuples.

Return type:

dict

scale()[source]

Scale all datasets to a common reference scale. This method optimizes the scaling parameters of all non-reference datasets to minimize the mean squared error between their structure factors and those of the reference dataset. The optimization corrects for both overall scale differences and anisotropy. The method uses the L-BFGS optimizer with strong Wolfe line search to iteratively refine the scaling parameters over multiple optimization steps.

The collection instance, allowing for method chaining.

Raises:

ValueError – If no reference dataset has been set prior to calling this method or only a reference dataset exists. Make sure to have at least 2 datasets duh…

Notes

The reference dataset must be set before calling this method using the appropriate setter. All datasets except the reference will have their scaling parameters optimized. “”” Scale all datasets to the same overall scale. Corrects overall scale and anisotropy based on the reference dataset.

Returns:

for method chaining.

Return type:

self

keys()[source]

Return list of dataset names.

values()[source]

Return list of datasets.

items()[source]

Return list of (name, dataset) tuples.

__init__(hkl=None, F=None, F_sigma=None, I=None, I_sigma=None, rfree_flags=None, resolution=None, bin_indices=None, outlier_flags=None, phase=None, fom=None, _centric_flags=None, E=None, E_squared=None, F_squared_corrected=None, U_aniso=None, radial_shell_indices=None, cell=None, spacegroup=None, device=<factory>, verbose=1, rfree_source=None, amplitude_source=None, intensity_source=None, phase_source=None, wilson_b=None, wilson_b_structure=None, wilson_b_solvent=None, wilson_k_sol=None, outlier_detection_params=None, _datasets=<factory>, _dataset_order=<factory>, _reference_dataset=None, _common_hkl=None, _cell=None, _spacegroup=None, _resolution=None, _scale_factors=<factory>)
get(name, default=None)[source]

Get dataset by name with default fallback.

__repr__()[source]

String representation of collection.