Harmonization utilities
The fisseq_data_pipeline.harmonize
module provides utilities for
batch-effect correction using ComBat
via the neuroHarmonize
package.
Harmonization is an essential step when combining data from multiple
experiments or sources, ensuring that technical variation (batch effects)
does not obscure biological signal.
Overview
fit_harmonizer
: Learn a ComBat-based harmonization model from feature
and metadata DataFrames.
harmonize
: Apply a fitted harmonization model to adjust new feature
matrices for batch effects.
Example usage
import polars as pl
from fisseq_data_pipeline.harmonize import fit_harmonizer, harmonize
# Example feature matrix
feature_df = pl.DataFrame({
"gene1": [1.0, 2.0, 3.0, 4.0],
"gene2": [2.0, 3.0, 4.0, 5.0],
})
# Example metadata with batch column
meta_data_df = pl.DataFrame({
"_batch": [0, 0, 1, 1],
"_is_control": [True, True, True, False],
})
# Fit harmonizer (on control samples only)
harmonizer = fit_harmonizer(feature_df, meta_data_df, fit_only_on_control=True)
# Apply harmonization to full dataset
harmonized_df = harmonize(feature_df, meta_data_df, harmonizer)
print(harmonized_df)
API Reference
fisseq_data_pipeline.harmonize.fit_harmonizer(feature_df, meta_data_df, fit_only_on_control=False)
Fit a ComBat-based harmonization model using neuroHarmonize
.
Parameters: |
-
feature_df
(DataFrame )
–
Feature matrix with shape (n_samples, n_features), numeric only.
-
meta_data_df
(DataFrame )
–
Metadata aligned with rows in feature_df . Must contain a _batch
column indicating batch membership. If fit_only_on_control=True ,
must also contain a boolean _is_control column.
-
fit_only_on_control
(bool , default:
False
)
–
If True, compute the harmonization model only from control samples
(rows where _is_control is True).
|
Returns: |
-
Harmonizer
–
A fitted harmonization model dictionary returned by
neuroHarmonize.harmonizationLearn .
|
Source code in src/fisseq_data_pipeline/harmonize.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54 | def fit_harmonizer(
feature_df: pl.DataFrame,
meta_data_df: pl.DataFrame,
fit_only_on_control: bool = False,
) -> Harmonizer:
"""
Fit a ComBat-based harmonization model using `neuroHarmonize`.
Parameters
----------
feature_df : pl.DataFrame
Feature matrix with shape (n_samples, n_features), numeric only.
meta_data_df : pl.DataFrame
Metadata aligned with rows in `feature_df`. Must contain a `_batch`
column indicating batch membership. If `fit_only_on_control=True`,
must also contain a boolean `_is_control` column.
fit_only_on_control : bool, default=False
If True, compute the harmonization model only from control samples
(rows where `_is_control` is True).
Returns
-------
Harmonizer
A fitted harmonization model dictionary returned by
``neuroHarmonize.harmonizationLearn``.
"""
if fit_only_on_control:
logging.info(
"Filtering control samples, number of samples before filtering=%d",
len(feature_df),
)
feature_df = feature_df.filter(meta_data_df.get_column("_is_control"))
meta_data_df = meta_data_df.filter(meta_data_df.get_column("_is_control"))
logging.info(
"Filtering complete, remaining train set samples shape=%s",
len(feature_df.shape),
)
logging.info("Fitting harmonizer")
covar_df = meta_data_df.select(pl.col("_batch").alias("SITE")).to_pandas()
model, _ = neuroHarmonize.harmonizationLearn(feature_df.to_numpy(), covar_df)
logging.info("Done")
return model
|
fisseq_data_pipeline.harmonize.harmonize(feature_df, meta_data_df, harmonizer)
Apply a fitted harmonization model to adjust features for batch effects.
Parameters: |
-
feature_df
(DataFrame )
–
Feature matrix to harmonize; shape (n_samples, n_features).
-
meta_data_df
(DataFrame )
–
Metadata aligned with rows in feature_df . Must contain a _batch
column indicating batch membership.
-
harmonizer
(Harmonizer )
–
A fitted model dictionary produced by fit_harmonizer .
|
Returns: |
-
DataFrame
–
Harmonized feature matrix with the same shape and column names
as the input feature_df .
|
Source code in src/fisseq_data_pipeline/harmonize.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92 | def harmonize(
feature_df: pl.DataFrame,
meta_data_df: pl.DataFrame,
harmonizer: Harmonizer,
) -> pl.DataFrame:
"""
Apply a fitted harmonization model to adjust features for batch effects.
Parameters
----------
feature_df : pl.DataFrame
Feature matrix to harmonize; shape (n_samples, n_features).
meta_data_df : pl.DataFrame
Metadata aligned with rows in `feature_df`. Must contain a `_batch`
column indicating batch membership.
harmonizer : Harmonizer
A fitted model dictionary produced by ``fit_harmonizer``.
Returns
-------
pl.DataFrame
Harmonized feature matrix with the same shape and column names
as the input `feature_df`.
"""
logging.info("Setting up harmonization")
covar_df = meta_data_df.select(pl.col("_batch").alias("SITE")).to_pandas()
logging.info("Fitting harmonizer")
harmonized_matrix = neuroHarmonize.harmonizationApply(
feature_df.to_numpy(), covar_df, harmonizer
)
logging.info("Copying data")
feature_df = pl.DataFrame(harmonized_matrix, schema=feature_df.columns)
logging.info("Done")
return feature_df
|