Harmonization utilities

The fisseq_data_pipeline.harmonize module provides utilities for batch-effect correction using ComBat via the neuroHarmonize package.

Harmonization is an essential step when combining data from multiple experiments or sources, ensuring that technical variation (batch effects) does not obscure biological signal.


Overview

  • fit_harmonizer: Learn a ComBat-based harmonization model from feature and metadata DataFrames.
  • harmonize: Apply a fitted harmonization model to adjust new feature matrices for batch effects.

Example usage

import polars as pl
from fisseq_data_pipeline.harmonize import fit_harmonizer, harmonize

# Example feature matrix
feature_df = pl.DataFrame({
    "gene1": [1.0, 2.0, 3.0, 4.0],
    "gene2": [2.0, 3.0, 4.0, 5.0],
})

# Example metadata with batch column
meta_data_df = pl.DataFrame({
    "_batch": [0, 0, 1, 1],
    "_is_control": [True, True, True, False],
})

# Fit harmonizer (on control samples only)
harmonizer = fit_harmonizer(feature_df, meta_data_df, fit_only_on_control=True)

# Apply harmonization to full dataset
harmonized_df = harmonize(feature_df, meta_data_df, harmonizer)

print(harmonized_df)

API Reference


fisseq_data_pipeline.harmonize.fit_harmonizer(feature_df, meta_data_df, fit_only_on_control=False)

Fit a ComBat-based harmonization model using neuroHarmonize.

Parameters:
  • feature_df (DataFrame) –

    Feature matrix with shape (n_samples, n_features), numeric only.

  • meta_data_df (DataFrame) –

    Metadata aligned with rows in feature_df. Must contain a _batch column indicating batch membership. If fit_only_on_control=True, must also contain a boolean _is_control column.

  • fit_only_on_control (bool, default: False ) –

    If True, compute the harmonization model only from control samples (rows where _is_control is True).

Returns:
  • Harmonizer

    A fitted harmonization model dictionary returned by neuroHarmonize.harmonizationLearn.

Source code in src/fisseq_data_pipeline/harmonize.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def fit_harmonizer(
    feature_df: pl.DataFrame,
    meta_data_df: pl.DataFrame,
    fit_only_on_control: bool = False,
) -> Harmonizer:
    """
    Fit a ComBat-based harmonization model using `neuroHarmonize`.

    Parameters
    ----------
    feature_df : pl.DataFrame
        Feature matrix with shape (n_samples, n_features), numeric only.
    meta_data_df : pl.DataFrame
        Metadata aligned with rows in `feature_df`. Must contain a `_batch`
        column indicating batch membership. If `fit_only_on_control=True`,
        must also contain a boolean `_is_control` column.
    fit_only_on_control : bool, default=False
        If True, compute the harmonization model only from control samples
        (rows where `_is_control` is True).

    Returns
    -------
    Harmonizer
        A fitted harmonization model dictionary returned by
        ``neuroHarmonize.harmonizationLearn``.
    """
    if fit_only_on_control:
        logging.info(
            "Filtering control samples, number of samples before filtering=%d",
            len(feature_df),
        )
        feature_df = feature_df.filter(meta_data_df.get_column("_is_control"))
        meta_data_df = meta_data_df.filter(meta_data_df.get_column("_is_control"))
        logging.info(
            "Filtering complete, remaining train set samples shape=%s",
            len(feature_df.shape),
        )

    logging.info("Fitting harmonizer")
    covar_df = meta_data_df.select(pl.col("_batch").alias("SITE")).to_pandas()
    model, _ = neuroHarmonize.harmonizationLearn(feature_df.to_numpy(), covar_df)
    logging.info("Done")

    return model

fisseq_data_pipeline.harmonize.harmonize(feature_df, meta_data_df, harmonizer)

Apply a fitted harmonization model to adjust features for batch effects.

Parameters:
  • feature_df (DataFrame) –

    Feature matrix to harmonize; shape (n_samples, n_features).

  • meta_data_df (DataFrame) –

    Metadata aligned with rows in feature_df. Must contain a _batch column indicating batch membership.

  • harmonizer (Harmonizer) –

    A fitted model dictionary produced by fit_harmonizer.

Returns:
  • DataFrame

    Harmonized feature matrix with the same shape and column names as the input feature_df.

Source code in src/fisseq_data_pipeline/harmonize.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def harmonize(
    feature_df: pl.DataFrame,
    meta_data_df: pl.DataFrame,
    harmonizer: Harmonizer,
) -> pl.DataFrame:
    """
    Apply a fitted harmonization model to adjust features for batch effects.

    Parameters
    ----------
    feature_df : pl.DataFrame
        Feature matrix to harmonize; shape (n_samples, n_features).
    meta_data_df : pl.DataFrame
        Metadata aligned with rows in `feature_df`. Must contain a `_batch`
        column indicating batch membership.
    harmonizer : Harmonizer
        A fitted model dictionary produced by ``fit_harmonizer``.

    Returns
    -------
    pl.DataFrame
        Harmonized feature matrix with the same shape and column names
        as the input `feature_df`.
    """
    logging.info("Setting up harmonization")
    covar_df = meta_data_df.select(pl.col("_batch").alias("SITE")).to_pandas()

    logging.info("Fitting harmonizer")
    harmonized_matrix = neuroHarmonize.harmonizationApply(
        feature_df.to_numpy(), covar_df, harmonizer
    )
    logging.info("Copying data")
    feature_df = pl.DataFrame(harmonized_matrix, schema=feature_df.columns)
    logging.info("Done")

    return feature_df