Configuration utilities

The fisseq_data_pipeline.utils.config module provides a Config object for managing pipeline configuration. It allows loading configuration values from YAML files, Python dictionaries, or other Config objects, and ensures that all values are validated against a default configuration.

Overview

  • Config: A wrapper around a validated configuration dictionary.
  • Loads from a path, dictionary, Config, or falls back to the default config.yaml.
  • Allows access via both attribute-style (cfg.feature_cols) and dictionary-style (cfg["feature_cols"]).
  • Automatically fills in missing keys from the default configuration and removes invalid keys.

  • DEFAULT_CFG_PATH: The path to the default configuration YAML file that ships with the pipeline.

Example usage

from fisseq_data_pipeline.utils.config import Config

# Load default configuration
cfg = Config(None)

# Load from a YAML file
cfg = Config("my_config.yaml")

# Load from a Python dict
cfg = Config({"feature_cols": ["f1", "f2"], "_batch": "batch"})

# Load from an existing Config
cfg2 = Config(cfg)

# Access values
print(cfg.feature_cols)
print(cfg["_batch"])

Validation Behavior

When initializing a Config:

  • Invalid keys not present in the default configuration are removed with a warning.
  • Missing keys are filled with the default values from config.yaml.

This ensures that the configuration is always complete and consistent with the pipeline defaults.

API Reference


fisseq_data_pipeline.utils.config.Config

A configuration object that wraps a dictionary of key-value pairs loaded from a provided path, dictionary, or another Config instance. If no configuration is provided, the default configuration file is used.

Parameters:
  • config (PathLike or dict or Config) –
    • If None, the default configuration file path is used.
    • If a dict, the dictionary is validated and used directly.
    • If a PathLike, the configuration is loaded from the YAML file.
    • If a Config, the underlying configuration data is reused.
Source code in src/fisseq_data_pipeline/utils/config.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
class Config:
    """
    A configuration object that wraps a dictionary of key-value pairs
    loaded from a provided path, dictionary, or another ``Config`` instance.
    If no configuration is provided, the default configuration file is used.

    Parameters
    ----------
    config : PathLike or dict or Config, optional
        - If ``None``, the default configuration file path is used.
        - If a ``dict``, the dictionary is validated and used directly.
        - If a ``PathLike``, the configuration is loaded from the YAML file.
        - If a ``Config``, the underlying configuration data is reused.
    """

    def __init__(self, config: Optional[PathLike | ConfigDict | "Config"]):
        if config is None:
            logging.info("No config provided, using default config")
            config = DEFAULT_CFG_PATH

        if isinstance(config, Config):
            data = config._data
        else:
            if isinstance(config, dict):
                data = config
            else:
                config = pathlib.Path(config)
                with config.open("r") as f:
                    data = yaml.safe_load(f)

            data = self._verify_config(data)

        logging.debug("Using config %s", config)
        self._data = data

    def __getattr__(self, name: str) -> Any:
        """Retrieve a configuration value as an attribute."""
        return self._data[name]

    def __getitem__(self, key: str) -> Any:
        """Retrieve a configuration value using dictionary-style indexing."""
        return self.__getattr__(key)

    def _verify_config(self, cfg_data: ConfigDict) -> ConfigDict:
        """
        Verify the provided configuration against the default configuration.
        Invalid keys are removed, and missing keys are filled with defaults.

        Parameters
        ----------
        cfg_data : dict
            The configuration data to verify.

        Returns
        -------
        dict
            The validated configuration dictionary with defaults applied.
        """
        with DEFAULT_CFG_PATH.open("r") as f:
            default_data = yaml.safe_load(f)

        for key in list(cfg_data.keys()):
            if key in default_data:
                continue

            logging.warning(
                "Removing invalid config option %s from provided config", key
            )
            del cfg_data[key]

        for key in list(default_data.keys()):
            if key in cfg_data:
                continue

            logging.warning(
                "Key %s not in provided config using default value of %s",
                key,
                default_data[key],
            )
            cfg_data[key] = default_data[key]

        return cfg_data

__getattr__(name)

Retrieve a configuration value as an attribute.

Source code in src/fisseq_data_pipeline/utils/config.py
47
48
49
def __getattr__(self, name: str) -> Any:
    """Retrieve a configuration value as an attribute."""
    return self._data[name]

__getitem__(key)

Retrieve a configuration value using dictionary-style indexing.

Source code in src/fisseq_data_pipeline/utils/config.py
51
52
53
def __getitem__(self, key: str) -> Any:
    """Retrieve a configuration value using dictionary-style indexing."""
    return self.__getattr__(key)

fisseq_data_pipeline.utils.config.DEFAULT_CFG_PATH = pathlib.Path(__file__).parent.parent / 'config.yaml' module-attribute