Normalize

The fisseq_data_pipeline.normalize module provides utilities for computing and applying z-score normalization to feature matrices.

Normalization is typically run as part of the FISSEQ pipeline, but these functions can also be used independently when you need to standardize feature values.


Overview

  • Normalizer: A dataclass container storing per-feature means and standard deviations, plus a flag indicating whether statistics were computed batch-wise.
  • fit_normalizer: Compute normalization statistics from a data_lf LazyFrame containing feature columns and _meta_batch / _meta_is_control metadata columns.
  • normalize: Apply z-score normalization to a data_lf LazyFrame using a fitted Normalizer.

Example usage

import polars as pl
from fisseq_data_pipeline.normalize import fit_normalizer, normalize

# Combined data LazyFrame: features + metadata columns
data_lf = pl.DataFrame({
    "_meta_batch":      ["A", "A", "A", "B", "B", "B"],
    "_meta_is_control": [True, False, True, False, True, False],
    "f1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
    "f2": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
}).lazy()

# Fit batch-wise normalizer on control samples only
normalizer = fit_normalizer(data_lf, fit_batch_wise=True, fit_only_on_control=True)

# Apply normalization
normalized_lf = normalize(data_lf, normalizer)

# Save for later reuse
normalizer.save("normalizer.pkl")

Global normalization

# Fit a single global normalizer across all samples
normalizer = fit_normalizer(data_lf, fit_batch_wise=False)
normalized_lf = normalize(data_lf, normalizer)

Notes

  • Columns with zero or near-zero variance are automatically dropped from the fitted Normalizer to avoid division-by-zero during normalization.
  • normalize silently drops feature columns that are absent from the Normalizer (e.g. columns removed during fitting due to zero variance).
  • Metadata columns (prefixed _meta) are preserved in the output of normalize.

API reference


fisseq_data_pipeline.normalize.Normalizer dataclass

Container object storing per-feature normalization statistics.

Attributes:
  • means (DataFrame) –

    A DataFrame of shape (n_batches, n_features) containing the mean value of each feature for each batch. When batch-wise normalization is not used, this has shape (1, n_features).

  • stds (DataFrame) –

    A DataFrame of shape (n_batches, n_features) containing the standard deviation of each feature for each batch. When batch-wise normalization is not used, this has shape (1, n_features).

  • is_batch_wise (dict[str, int] or None) –

    Whether statistics were computed batch-wise or globally

Source code in src/fisseq_data_pipeline/normalize.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
@dataclasses.dataclass
class Normalizer:
    """
    Container object storing per-feature normalization statistics.

    Attributes
    ----------
    means : pl.DataFrame
        A DataFrame of shape (n_batches, n_features) containing the mean value
        of each feature for each batch. When batch-wise normalization is not
        used, this has shape (1, n_features).
    stds : pl.DataFrame
        A DataFrame of shape (n_batches, n_features) containing the standard
        deviation of each feature for each batch. When batch-wise normalization
        is not used, this has shape (1, n_features).
    is_batch_wise : dict[str, int] or None
        Whether statistics were computed batch-wise or globally
    """

    means: pl.DataFrame
    stds: pl.DataFrame
    is_batch_wise: bool

    def save(self, save_path: PathLike) -> None:
        """
        Serialize and save the Normalizer object to disk using pickle.

        This method stores the fitted normalization statistics — per-feature
        means, standard deviations, and batch-wise configuration flag — as a
        single binary file that can later be reloaded to reproduce the same
        normalization behavior.

        Parameters
        ----------
        save_path : PathLike
            Destination file path for the serialized Normalizer object. The file
            is written in binary format (typically named ``normalizer.pkl``).

        Notes
        -----
        - The file can be reloaded using ``pickle.load(open(path, "rb"))``.
        - Only the Normalizer object and its attributes are serialized; any
        external references (e.g., LazyFrames) are not included.
        - The resulting file is Python-version dependent and not guaranteed
        to be portable across major interpreter versions.

        Examples
        --------
        >>> normalizer = fit_normalizer(feature_df, meta_data_df)
        >>> normalizer.save("output/normalizer.pkl")

        To reload later:
        >>> with open("output/normalizer.pkl", "rb") as f:
        ...     normalizer = pickle.load(f)
        """
        with open(save_path, "wb") as f:
            pickle.dump(self, f)

save(save_path)

Serialize and save the Normalizer object to disk using pickle.

This method stores the fitted normalization statistics — per-feature means, standard deviations, and batch-wise configuration flag — as a single binary file that can later be reloaded to reproduce the same normalization behavior.

Parameters:
  • save_path (PathLike) –

    Destination file path for the serialized Normalizer object. The file is written in binary format (typically named normalizer.pkl).

Notes
  • The file can be reloaded using pickle.load(open(path, "rb")).
  • Only the Normalizer object and its attributes are serialized; any external references (e.g., LazyFrames) are not included.
  • The resulting file is Python-version dependent and not guaranteed to be portable across major interpreter versions.

Examples:

>>> normalizer = fit_normalizer(feature_df, meta_data_df)
>>> normalizer.save("output/normalizer.pkl")

To reload later:

>>> with open("output/normalizer.pkl", "rb") as f:
...     normalizer = pickle.load(f)
Source code in src/fisseq_data_pipeline/normalize.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def save(self, save_path: PathLike) -> None:
    """
    Serialize and save the Normalizer object to disk using pickle.

    This method stores the fitted normalization statistics — per-feature
    means, standard deviations, and batch-wise configuration flag — as a
    single binary file that can later be reloaded to reproduce the same
    normalization behavior.

    Parameters
    ----------
    save_path : PathLike
        Destination file path for the serialized Normalizer object. The file
        is written in binary format (typically named ``normalizer.pkl``).

    Notes
    -----
    - The file can be reloaded using ``pickle.load(open(path, "rb"))``.
    - Only the Normalizer object and its attributes are serialized; any
    external references (e.g., LazyFrames) are not included.
    - The resulting file is Python-version dependent and not guaranteed
    to be portable across major interpreter versions.

    Examples
    --------
    >>> normalizer = fit_normalizer(feature_df, meta_data_df)
    >>> normalizer.save("output/normalizer.pkl")

    To reload later:
    >>> with open("output/normalizer.pkl", "rb") as f:
    ...     normalizer = pickle.load(f)
    """
    with open(save_path, "wb") as f:
        pickle.dump(self, f)

fisseq_data_pipeline.normalize.fit_normalizer(data_lf, fit_only_on_control=False, fit_batch_wise=True)

Compute per-feature mean and standard deviation statistics for z-score normalization.

The normalizer can operate in two modes:

1. Global normalization (fit_batch_wise=False) All samples are assigned to a single synthetic batch (_meta_batch = 0). A single mean and standard deviation are computed per feature.

2. Batch-wise normalization (fit_batch_wise=True) Means and standard deviations are computed independently for each batch defined by the existing _meta_batch column.

If fit_only_on_control=True, rows are filtered using the boolean _meta_is_control column before statistics are computed.

Columns with zero or near-zero variance are identified and dropped from the returned statistics to avoid division-by-zero during normalization.

Parameters:
  • data_lf (LazyFrame) –

    A LazyFrame containing: - numerical feature columns - a _meta_batch column (if fit_batch_wise=True) - optionally a _meta_is_control column

  • fit_only_on_control (bool, default: False ) –

    Whether to compute statistics only from rows where _meta_is_control == True.

  • fit_batch_wise (bool, default: True ) –

    Whether to compute statistics separately per batch. If False, all samples are assigned to a single batch.

Returns:
  • Normalizer

    A dataclass holding: - means : per-batch feature means - stds : per-batch feature standard deviations - is_batch_wise : boolean flag matching input

Source code in src/fisseq_data_pipeline/normalize.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def fit_normalizer(
    data_lf: pl.LazyFrame,
    fit_only_on_control: bool = False,
    fit_batch_wise: bool = True,
) -> Normalizer:
    """
    Compute per-feature mean and standard deviation statistics for
    z-score normalization.

    The normalizer can operate in two modes:

    **1. Global normalization (`fit_batch_wise=False`)**
       All samples are assigned to a single synthetic batch
       (`_meta_batch = 0`). A single mean and standard deviation
       are computed per feature.

    **2. Batch-wise normalization (`fit_batch_wise=True`)**
       Means and standard deviations are computed independently for
       each batch defined by the existing `_meta_batch` column.

    If `fit_only_on_control=True`, rows are filtered using the
    boolean `_meta_is_control` column before statistics are computed.

    Columns with zero or near-zero variance are identified and dropped
    from the returned statistics to avoid division-by-zero during
    normalization.

    Parameters
    ----------
    data_lf : pl.LazyFrame
        A LazyFrame containing:
          - numerical feature columns
          - a `_meta_batch` column (if `fit_batch_wise=True`)
          - optionally a `_meta_is_control` column
    fit_only_on_control : bool, default False
        Whether to compute statistics only from rows where
        `_meta_is_control == True`.
    fit_batch_wise : bool, default True
        Whether to compute statistics separately per batch. If False,
        all samples are assigned to a single batch.

    Returns
    -------
    Normalizer
        A dataclass holding:
          - `means` : per-batch feature means
          - `stds` : per-batch feature standard deviations
          - `is_batch_wise` : boolean flag matching input
    """
    if fit_only_on_control:
        logging.info("Adding query to filter for control samples")
        data_lf = data_lf.filter(pl.col("_meta_is_control"))

    if not fit_batch_wise:
        data_lf = data_lf.with_columns(pl.lit(0).alias("_meta_batch"))

    # Handle batch column (batch-wise vs global)
    logging.info("Adding normalization queries")
    agg_exprs = []
    for c in get_feature_cols(data_lf, as_string=True):
        agg_exprs.append(pl.col(c).mean().alias(f"{c}_mean"))
        agg_exprs.append(pl.col(c).std().alias(f"{c}_std"))

    agg_df = data_lf.group_by("_meta_batch").agg(agg_exprs)
    mean_cols = [c for c in agg_df.columns if c.endswith("_mean")]
    std_cols = [c for c in agg_df.columns if c.endswith("_std")]

    logging.info("Computing feature means")
    means = agg_df.select(
        pl.col("_meta_batch"),
        *[pl.col(c).alias(c.removesuffix("_mean")) for c in mean_cols],
    ).collect()

    logging.info("Computing feature standard deviations")
    stds = agg_df.select(
        pl.col("_meta_batch"),
        *[pl.col(c).alias(c.removesuffix("_std")) for c in std_cols],
    ).collect()

    logging.info("Scanning for zero variance columns")
    zero_var_cols = [
        c
        for c in get_feature_cols(data_lf, as_string=True)
        if (stds[c] <= np.finfo(np.float32).eps).any()
    ]

    if len(zero_var_cols) != 0:
        logging.warning("Dropping %d zero-variance columns", len(zero_var_cols))
        means = means.select(pl.exclude(zero_var_cols))
        stds = stds.select(pl.exclude(zero_var_cols))

    logging.info("Normalization statistics computed successfully")
    return Normalizer(means=means, stds=stds, is_batch_wise=fit_batch_wise)

fisseq_data_pipeline.normalize.normalize(data_lf, normalizer)

Apply z-score normalization to a LazyFrame using precomputed statistics.

Each feature column f is transformed into:

z_f = (f - f_mean) / f_std

where f_mean and f_std are taken from the corresponding row of normalizer.means and normalizer.stds.

If the normalizer was fitted batch-wise, the _meta_batch column of data_lf determines which batch statistics to apply. If the normalizer was fitted globally (is_batch_wise=False), a dummy batch index 0 is applied to all rows.

Columns that were removed during fitting (e.g., zero-variance features) are automatically dropped from data_lf prior to normalization.

Parameters:
  • data_lf (LazyFrame) –

    A LazyFrame containing numerical feature columns and a _meta_batch column (or none if global normalization is desired).

  • normalizer (Normalizer) –

    The object returned by :func:fit_normalizer, containing the per-feature mean and standard deviation tables.

Returns:
  • LazyFrame

    A LazyFrame where all feature columns have been z-score normalized. Columns not present in the normalizer (e.g., removed features) are excluded.

Source code in src/fisseq_data_pipeline/normalize.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
def normalize(data_lf: pl.LazyFrame, normalizer: Normalizer) -> pl.LazyFrame:
    """
    Apply z-score normalization to a LazyFrame using precomputed statistics.

    Each feature column `f` is transformed into:

        z_f = (f - f_mean) / f_std

    where `f_mean` and `f_std` are taken from the corresponding row of
    `normalizer.means` and `normalizer.stds`.

    If the normalizer was fitted batch-wise, the `_meta_batch` column of
    `data_lf` determines which batch statistics to apply. If the normalizer
    was fitted globally (`is_batch_wise=False`), a dummy batch index 0 is
    applied to all rows.

    Columns that were removed during fitting (e.g., zero-variance features)
    are automatically dropped from `data_lf` prior to normalization.

    Parameters
    ----------
    data_lf : pl.LazyFrame
        A LazyFrame containing numerical feature columns and a `_meta_batch`
        column (or none if global normalization is desired).
    normalizer : Normalizer
        The object returned by :func:`fit_normalizer`, containing the
        per-feature mean and standard deviation tables.

    Returns
    -------
    pl.LazyFrame
        A LazyFrame where all feature columns have been z-score normalized.
        Columns not present in the normalizer (e.g., removed features) are
        excluded.
    """
    logging.info("Creating normalization query")
    if not normalizer.is_batch_wise:
        data_lf = data_lf.with_columns(pl.lit(0).alias("_meta_batch"))

    feature_cols = set(get_feature_cols(data_lf, as_string=True))
    norm_cols = set(get_feature_cols(normalizer.stds, as_string=True))
    bad_cols = feature_cols - norm_cols
    data_lf = data_lf.select(pl.exclude(bad_cols))

    if len(bad_cols) > 0:
        logging.warning(
            "Dropped %d columns from feature_df not present in normalizer",
            len(bad_cols),
        )

    feature_columns = feature_cols.intersection(norm_cols)
    for suffix, df in [("_mean", normalizer.means), ("_std", normalizer.stds)]:
        data_lf = data_lf.join(df.lazy(), on="_meta_batch", how="left", suffix=suffix)

    meta_columns = [c for c in data_lf.columns if c.startswith("_meta")]
    data_lf = data_lf.with_columns(
        [
            ((pl.col(c) - pl.col(f"{c}_mean")) / pl.col(f"{c}_std")).alias(c)
            for c in feature_columns
        ]
    ).select(list(feature_columns) + meta_columns)

    return data_lf