Utility functions

The fisseq_data_pipeline.utils module provides helper functions for feature selection and dataset construction. These utilities are used internally by the pipeline but can also be reused in standalone scripts.

Overview

  • get_feature_selector: Build a Polars selector expression for feature columns based on the pipeline configuration.
  • get_data_lf: Construct a combined feature + metadata LazyFrame from a raw input LazyFrame.
  • get_feature_cols: Return column expressions (or names) for all non-metadata columns in a LazyFrame.
  • get_feature_lf: Return a LazyFrame containing only feature columns.

Environment variables

  • FISSEQ_PIPELINE_RAND_STATE Random seed used for reproducible operations. Default: 42.

Example Usage

import polars as pl
from fisseq_data_pipeline.config import Config
from fisseq_data_pipeline.utils import get_data_lf, get_feature_cols

# Example raw dataset
db_lf = pl.DataFrame({
    "gene1": [1.0, 2.0, 3.0, 4.0],
    "gene2": [5.0, 6.0, 7.0, 8.0],
    "batch": ["A", "A", "B", "B"],
    "label": ["X", "X", "Y", "Y"],
    "is_ctrl": [True, False, True, False],
}).lazy()

# Config (using dict for simplicity)
cfg = Config({
    "feature_cols": ["gene1", "gene2"],
    "batch_col_name": "batch",
    "label_col_name": "label",
    "control_sample_query": "is_ctrl = true",
})

# Build combined data LazyFrame
data_lf = get_data_lf(db_lf, cfg)

# Get feature column names
feature_names = get_feature_cols(data_lf, as_string=True)

API Reference


fisseq_data_pipeline.utils.get_feature_selector(data_df, config)

Build a Polars column selector for feature columns based on the config.

This utility interprets config.feature_cols and returns a Polars selector expression usable in .select() or .with_columns() calls.

Selection modes: - Regex: if feature_cols is a string, all column names matching the regex are selected. - Explicit list: if feature_cols is a list of strings, those columns are selected in the given order. Missing columns are ignored with a warning.

Parameters:
  • data_df (LazyFrame) –

    Input dataset containing feature columns.

  • config (Config) –

    Configuration object with a feature_cols attribute defining which columns to select (regex pattern or explicit list).

Returns:
  • PlSelector

    A Polars selector expression suitable for use in .select() calls.

Notes
  • Missing columns are ignored but logged as a warning.
  • Column order is preserved when using an explicit list.
Source code in src/fisseq_data_pipeline/utils.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def get_feature_selector(data_df: pl.LazyFrame, config: Config) -> PlSelector:
    """
    Build a Polars column selector for feature columns based on the config.

    This utility interprets ``config.feature_cols`` and returns a Polars
    selector expression usable in ``.select()`` or ``.with_columns()`` calls.

    Selection modes:
      - **Regex**: if ``feature_cols`` is a string, all column names matching
        the regex are selected.
      - **Explicit list**: if ``feature_cols`` is a list of strings, those
        columns are selected in the given order. Missing columns are ignored
        with a warning.

    Parameters
    ----------
    data_df : pl.LazyFrame
        Input dataset containing feature columns.
    config : Config
        Configuration object with a ``feature_cols`` attribute defining which
        columns to select (regex pattern or explicit list).

    Returns
    -------
    PlSelector
        A Polars selector expression suitable for use in ``.select()`` calls.

    Notes
    -----
    - Missing columns are ignored but logged as a warning.
    - Column order is preserved when using an explicit list.
    """
    if isinstance(config.feature_cols, str):
        selector_type = "regex"
        selector = cs.matches(config.feature_cols)
    else:
        selector_type = "list"
        feature_cols = set(config.feature_cols)
        missing = feature_cols - set(data_df.columns)

        if len(missing) != 0:
            logging.warning(
                "Some columns are specified in the config but are not currently"
                " present in the dataframe, this can happen if columns are"
                " removed during data cleaning. The following columns will be"
                " ignored: %s",
                missing,
            )

        # Column order must be preserved
        not_missing = feature_cols - missing
        selector = pl.col(
            col for col in list(config.feature_cols) if col in not_missing
        )

    logging.debug("Using feature %s selector: %s", selector_type, config.feature_cols)
    return selector

fisseq_data_pipeline.utils.get_data_lf(db_lf, config, dtype=pl.Float32)

Construct separate feature and metadata DataFrames from a Polars LazyFrame.

This function builds a complete computation graph that: 1. Adds a row index (_sample_idx) for reproducible sample tracking. 2. Selects and casts feature columns defined in config.feature_cols. 3. Extracts metadata columns: - _batch from config.batch_col_name - _label from config.label_col_name - _is_control by evaluating config.control_sample_query 4. Materializes the LazyFrame once to produce a feature matrix and an eager metadata DataFrame.

Parameters:
  • data_df (LazyFrame) –

    Input dataset containing both features and metadata.

  • config (Config) –

    Configuration object

  • dtype (DataType, default: Float32 ) –

    Data type to cast feature columns to. Defaults to pl.Float32.

Returns:
  • (DataFrame, DataFrame)

    A tuple containing: - feature_df : eager pl.DataFrame of numerical features, shape (n_samples, n_features), with values cast to dtype. - meta_data_df : eager pl.DataFrame with columns: * _batch — batch labels * _label — class labels * _is_control — boolean control mask * _sample_idx — stable sample index

Notes
  • The returned feature and metadata frames share identical row ordering.
Source code in src/fisseq_data_pipeline/utils.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def get_data_lf(
    db_lf: pl.LazyFrame,
    config: Config,
    dtype: pl.DataType = pl.Float32,
) -> Tuple[pl.LazyFrame, pl.LazyFrame]:
    """
    Construct separate feature and metadata DataFrames from a Polars LazyFrame.

    This function builds a complete computation graph that:
      1. Adds a row index (``_sample_idx``) for reproducible sample tracking.
      2. Selects and casts feature columns defined in ``config.feature_cols``.
      3. Extracts metadata columns:
         - ``_batch`` from ``config.batch_col_name``
         - ``_label`` from ``config.label_col_name``
         - ``_is_control`` by evaluating ``config.control_sample_query``
      4. Materializes the LazyFrame once to produce a feature matrix
         and an eager metadata DataFrame.

    Parameters
    ----------
    data_df : pl.LazyFrame
        Input dataset containing both features and metadata.
    config : Config
        Configuration object
    dtype : pl.DataType, optional
        Data type to cast feature columns to. Defaults to ``pl.Float32``.

    Returns
    -------
    (pl.DataFrame, pl.DataFrame)
        A tuple containing:
          - **feature_df** : eager ``pl.DataFrame`` of numerical features, shape
            ``(n_samples, n_features)``, with values cast to ``dtype``.
          - **meta_data_df** : eager ``pl.DataFrame`` with columns:
                * ``_batch`` — batch labels
                * ``_label`` — class labels
                * ``_is_control`` — boolean control mask
                * ``_sample_idx`` — stable sample index

    Notes
    -----
    - The returned feature and metadata frames share identical row ordering.
    """
    logging.info(
        "Starting get_data_matrices for batch_col=%s, dtype=%s",
        config.batch_col_name,
        dtype,
    )

    # Attach row indices to preserve mapping later
    base = db_lf.with_row_index(name="_meta_sample_idx").cache()

    # Build feature selector
    feature_expr = get_feature_selector(base, config).cast(dtype=dtype)
    logging.debug("Feature selector resolved: %s", feature_expr)

    # Control mask expr
    logging.debug("Parsing control sample query: %s", config.control_sample_query)

    control_mask_expr = pl.sql_expr(config.control_sample_query).alias(
        "_meta_is_control"
    )
    batch_expr = pl.col(config.batch_col_name).alias("_meta_batch")
    label_expr = pl.col(config.label_col_name).alias("_meta_label")

    # Execute the full plan
    logging.info("Creating combined dataframe query")
    lf = base.with_columns(
        label_expr,
        batch_expr,
        control_mask_expr,
    ).select(
        feature_expr,
        pl.col("_meta_batch"),
        pl.col("_meta_is_control"),
        pl.col("_meta_sample_idx"),
        pl.col("_meta_label"),
    )

    return lf

fisseq_data_pipeline.utils.get_feature_cols(data_lf, as_string=False)

Source code in src/fisseq_data_pipeline/utils.py
159
160
161
162
163
def get_feature_cols(
    data_lf: pl.LazyFrame, as_string: bool = False
) -> list[pl.Expr] | list[str]:
    wrapper = str if as_string else pl.col
    return [wrapper(c) for c in data_lf.columns if not c.startswith("_meta")]

fisseq_data_pipeline.utils.get_feature_lf(data_lf)

Source code in src/fisseq_data_pipeline/utils.py
166
167
def get_feature_lf(data_lf: pl.LazyFrame) -> pl.LazyFrame:
    return data_lf.select(get_feature_cols(data_lf))