Normalize
The fisseq_data_pipeline.normalize module provides utilities for computing
and applying z-score normalization to feature matrices.
Normalization is typically run as part of the FISSEQ pipeline, but these
functions can also be used independently when you need to standardize feature
values.
Overview
Normalizer: A dataclass container storing per-feature means and
standard deviations, plus a flag indicating whether statistics were computed
batch-wise.
fit_normalizer: Compute normalization statistics from a data_lf
LazyFrame containing feature columns and _meta_batch / _meta_is_control
metadata columns.
normalize: Apply z-score normalization to a data_lf LazyFrame using
a fitted Normalizer.
Example usage
import polars as pl
from fisseq_data_pipeline.normalize import fit_normalizer, normalize
# Combined data LazyFrame: features + metadata columns
data_lf = pl.DataFrame({
"_meta_batch": ["A", "A", "A", "B", "B", "B"],
"_meta_is_control": [True, False, True, False, True, False],
"f1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
"f2": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
}).lazy()
# Fit batch-wise normalizer on control samples only
normalizer = fit_normalizer(data_lf, fit_batch_wise=True, fit_only_on_control=True)
# Apply normalization
normalized_lf = normalize(data_lf, normalizer)
# Save for later reuse
normalizer.save("normalizer.pkl")
Global normalization
# Fit a single global normalizer across all samples
normalizer = fit_normalizer(data_lf, fit_batch_wise=False)
normalized_lf = normalize(data_lf, normalizer)
Notes
- Columns with zero or near-zero variance are automatically dropped from the
fitted
Normalizer to avoid division-by-zero during normalization.
normalize silently drops feature columns that are absent from the
Normalizer (e.g. columns removed during fitting due to zero variance).
- Metadata columns (prefixed
_meta) are preserved in the output of
normalize.
API reference
fisseq_data_pipeline.normalize.Normalizer
dataclass
Container object storing per-feature normalization statistics.
| Attributes: |
-
means
(DataFrame)
–
A DataFrame of shape (n_batches, n_features) containing the mean value
of each feature for each batch. When batch-wise normalization is not
used, this has shape (1, n_features).
-
stds
(DataFrame)
–
A DataFrame of shape (n_batches, n_features) containing the standard
deviation of each feature for each batch. When batch-wise normalization
is not used, this has shape (1, n_features).
-
is_batch_wise
(dict[str, int] or None)
–
Whether statistics were computed batch-wise or globally
|
Source code in src/fisseq_data_pipeline/normalize.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68 | @dataclasses.dataclass
class Normalizer:
"""
Container object storing per-feature normalization statistics.
Attributes
----------
means : pl.DataFrame
A DataFrame of shape (n_batches, n_features) containing the mean value
of each feature for each batch. When batch-wise normalization is not
used, this has shape (1, n_features).
stds : pl.DataFrame
A DataFrame of shape (n_batches, n_features) containing the standard
deviation of each feature for each batch. When batch-wise normalization
is not used, this has shape (1, n_features).
is_batch_wise : dict[str, int] or None
Whether statistics were computed batch-wise or globally
"""
means: pl.DataFrame
stds: pl.DataFrame
is_batch_wise: bool
def save(self, save_path: PathLike) -> None:
"""
Serialize and save the Normalizer object to disk using pickle.
This method stores the fitted normalization statistics — per-feature
means, standard deviations, and batch-wise configuration flag — as a
single binary file that can later be reloaded to reproduce the same
normalization behavior.
Parameters
----------
save_path : PathLike
Destination file path for the serialized Normalizer object. The file
is written in binary format (typically named ``normalizer.pkl``).
Notes
-----
- The file can be reloaded using ``pickle.load(open(path, "rb"))``.
- Only the Normalizer object and its attributes are serialized; any
external references (e.g., LazyFrames) are not included.
- The resulting file is Python-version dependent and not guaranteed
to be portable across major interpreter versions.
Examples
--------
>>> normalizer = fit_normalizer(feature_df, meta_data_df)
>>> normalizer.save("output/normalizer.pkl")
To reload later:
>>> with open("output/normalizer.pkl", "rb") as f:
... normalizer = pickle.load(f)
"""
with open(save_path, "wb") as f:
pickle.dump(self, f)
|
save(save_path)
Serialize and save the Normalizer object to disk using pickle.
This method stores the fitted normalization statistics — per-feature
means, standard deviations, and batch-wise configuration flag — as a
single binary file that can later be reloaded to reproduce the same
normalization behavior.
| Parameters: |
-
save_path
(PathLike)
–
Destination file path for the serialized Normalizer object. The file
is written in binary format (typically named normalizer.pkl).
|
Notes
- The file can be reloaded using
pickle.load(open(path, "rb")).
- Only the Normalizer object and its attributes are serialized; any
external references (e.g., LazyFrames) are not included.
- The resulting file is Python-version dependent and not guaranteed
to be portable across major interpreter versions.
Examples:
>>> normalizer = fit_normalizer(feature_df, meta_data_df)
>>> normalizer.save("output/normalizer.pkl")
To reload later:
>>> with open("output/normalizer.pkl", "rb") as f:
... normalizer = pickle.load(f)
Source code in src/fisseq_data_pipeline/normalize.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68 | def save(self, save_path: PathLike) -> None:
"""
Serialize and save the Normalizer object to disk using pickle.
This method stores the fitted normalization statistics — per-feature
means, standard deviations, and batch-wise configuration flag — as a
single binary file that can later be reloaded to reproduce the same
normalization behavior.
Parameters
----------
save_path : PathLike
Destination file path for the serialized Normalizer object. The file
is written in binary format (typically named ``normalizer.pkl``).
Notes
-----
- The file can be reloaded using ``pickle.load(open(path, "rb"))``.
- Only the Normalizer object and its attributes are serialized; any
external references (e.g., LazyFrames) are not included.
- The resulting file is Python-version dependent and not guaranteed
to be portable across major interpreter versions.
Examples
--------
>>> normalizer = fit_normalizer(feature_df, meta_data_df)
>>> normalizer.save("output/normalizer.pkl")
To reload later:
>>> with open("output/normalizer.pkl", "rb") as f:
... normalizer = pickle.load(f)
"""
with open(save_path, "wb") as f:
pickle.dump(self, f)
|
fisseq_data_pipeline.normalize.fit_normalizer(data_lf, fit_only_on_control=False, fit_batch_wise=True)
Compute per-feature mean and standard deviation statistics for
z-score normalization.
The normalizer can operate in two modes:
1. Global normalization (fit_batch_wise=False)
All samples are assigned to a single synthetic batch
(_meta_batch = 0). A single mean and standard deviation
are computed per feature.
2. Batch-wise normalization (fit_batch_wise=True)
Means and standard deviations are computed independently for
each batch defined by the existing _meta_batch column.
If fit_only_on_control=True, rows are filtered using the
boolean _meta_is_control column before statistics are computed.
Columns with zero or near-zero variance are identified and dropped
from the returned statistics to avoid division-by-zero during
normalization.
| Parameters: |
-
data_lf
(LazyFrame)
–
A LazyFrame containing:
- numerical feature columns
- a _meta_batch column (if fit_batch_wise=True)
- optionally a _meta_is_control column
-
fit_only_on_control
(bool, default:
False
)
–
Whether to compute statistics only from rows where
_meta_is_control == True.
-
fit_batch_wise
(bool, default:
True
)
–
Whether to compute statistics separately per batch. If False,
all samples are assigned to a single batch.
|
| Returns: |
-
Normalizer
–
A dataclass holding:
- means : per-batch feature means
- stds : per-batch feature standard deviations
- is_batch_wise : boolean flag matching input
|
Source code in src/fisseq_data_pipeline/normalize.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163 | def fit_normalizer(
data_lf: pl.LazyFrame,
fit_only_on_control: bool = False,
fit_batch_wise: bool = True,
) -> Normalizer:
"""
Compute per-feature mean and standard deviation statistics for
z-score normalization.
The normalizer can operate in two modes:
**1. Global normalization (`fit_batch_wise=False`)**
All samples are assigned to a single synthetic batch
(`_meta_batch = 0`). A single mean and standard deviation
are computed per feature.
**2. Batch-wise normalization (`fit_batch_wise=True`)**
Means and standard deviations are computed independently for
each batch defined by the existing `_meta_batch` column.
If `fit_only_on_control=True`, rows are filtered using the
boolean `_meta_is_control` column before statistics are computed.
Columns with zero or near-zero variance are identified and dropped
from the returned statistics to avoid division-by-zero during
normalization.
Parameters
----------
data_lf : pl.LazyFrame
A LazyFrame containing:
- numerical feature columns
- a `_meta_batch` column (if `fit_batch_wise=True`)
- optionally a `_meta_is_control` column
fit_only_on_control : bool, default False
Whether to compute statistics only from rows where
`_meta_is_control == True`.
fit_batch_wise : bool, default True
Whether to compute statistics separately per batch. If False,
all samples are assigned to a single batch.
Returns
-------
Normalizer
A dataclass holding:
- `means` : per-batch feature means
- `stds` : per-batch feature standard deviations
- `is_batch_wise` : boolean flag matching input
"""
if fit_only_on_control:
logging.info("Adding query to filter for control samples")
data_lf = data_lf.filter(pl.col("_meta_is_control"))
if not fit_batch_wise:
data_lf = data_lf.with_columns(pl.lit(0).alias("_meta_batch"))
# Handle batch column (batch-wise vs global)
logging.info("Adding normalization queries")
agg_exprs = []
for c in get_feature_cols(data_lf, as_string=True):
agg_exprs.append(pl.col(c).mean().alias(f"{c}_mean"))
agg_exprs.append(pl.col(c).std().alias(f"{c}_std"))
agg_df = data_lf.group_by("_meta_batch").agg(agg_exprs)
mean_cols = [c for c in agg_df.columns if c.endswith("_mean")]
std_cols = [c for c in agg_df.columns if c.endswith("_std")]
logging.info("Computing feature means")
means = agg_df.select(
pl.col("_meta_batch"),
*[pl.col(c).alias(c.removesuffix("_mean")) for c in mean_cols],
).collect()
logging.info("Computing feature standard deviations")
stds = agg_df.select(
pl.col("_meta_batch"),
*[pl.col(c).alias(c.removesuffix("_std")) for c in std_cols],
).collect()
logging.info("Scanning for zero variance columns")
zero_var_cols = [
c
for c in get_feature_cols(data_lf, as_string=True)
if (stds[c] <= np.finfo(np.float32).eps).any()
]
if len(zero_var_cols) != 0:
logging.warning("Dropping %d zero-variance columns", len(zero_var_cols))
means = means.select(pl.exclude(zero_var_cols))
stds = stds.select(pl.exclude(zero_var_cols))
logging.info("Normalization statistics computed successfully")
return Normalizer(means=means, stds=stds, is_batch_wise=fit_batch_wise)
|
fisseq_data_pipeline.normalize.normalize(data_lf, normalizer)
Apply z-score normalization to a LazyFrame using precomputed statistics.
Each feature column f is transformed into:
z_f = (f - f_mean) / f_std
where f_mean and f_std are taken from the corresponding row of
normalizer.means and normalizer.stds.
If the normalizer was fitted batch-wise, the _meta_batch column of
data_lf determines which batch statistics to apply. If the normalizer
was fitted globally (is_batch_wise=False), a dummy batch index 0 is
applied to all rows.
Columns that were removed during fitting (e.g., zero-variance features)
are automatically dropped from data_lf prior to normalization.
| Parameters: |
-
data_lf
(LazyFrame)
–
A LazyFrame containing numerical feature columns and a _meta_batch
column (or none if global normalization is desired).
-
normalizer
(Normalizer)
–
The object returned by :func:fit_normalizer, containing the
per-feature mean and standard deviation tables.
|
| Returns: |
-
LazyFrame
–
A LazyFrame where all feature columns have been z-score normalized.
Columns not present in the normalizer (e.g., removed features) are
excluded.
|
Source code in src/fisseq_data_pipeline/normalize.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228 | def normalize(data_lf: pl.LazyFrame, normalizer: Normalizer) -> pl.LazyFrame:
"""
Apply z-score normalization to a LazyFrame using precomputed statistics.
Each feature column `f` is transformed into:
z_f = (f - f_mean) / f_std
where `f_mean` and `f_std` are taken from the corresponding row of
`normalizer.means` and `normalizer.stds`.
If the normalizer was fitted batch-wise, the `_meta_batch` column of
`data_lf` determines which batch statistics to apply. If the normalizer
was fitted globally (`is_batch_wise=False`), a dummy batch index 0 is
applied to all rows.
Columns that were removed during fitting (e.g., zero-variance features)
are automatically dropped from `data_lf` prior to normalization.
Parameters
----------
data_lf : pl.LazyFrame
A LazyFrame containing numerical feature columns and a `_meta_batch`
column (or none if global normalization is desired).
normalizer : Normalizer
The object returned by :func:`fit_normalizer`, containing the
per-feature mean and standard deviation tables.
Returns
-------
pl.LazyFrame
A LazyFrame where all feature columns have been z-score normalized.
Columns not present in the normalizer (e.g., removed features) are
excluded.
"""
logging.info("Creating normalization query")
if not normalizer.is_batch_wise:
data_lf = data_lf.with_columns(pl.lit(0).alias("_meta_batch"))
feature_cols = set(get_feature_cols(data_lf, as_string=True))
norm_cols = set(get_feature_cols(normalizer.stds, as_string=True))
bad_cols = feature_cols - norm_cols
data_lf = data_lf.select(pl.exclude(bad_cols))
if len(bad_cols) > 0:
logging.warning(
"Dropped %d columns from feature_df not present in normalizer",
len(bad_cols),
)
feature_columns = feature_cols.intersection(norm_cols)
for suffix, df in [("_mean", normalizer.means), ("_std", normalizer.stds)]:
data_lf = data_lf.join(df.lazy(), on="_meta_batch", how="left", suffix=suffix)
meta_columns = [c for c in data_lf.columns if c.startswith("_meta")]
data_lf = data_lf.with_columns(
[
((pl.col(c) - pl.col(f"{c}_mean")) / pl.col(f"{c}_std")).alias(c)
for c in feature_columns
]
).select(list(feature_columns) + meta_columns)
return data_lf
|