Data Cleaning Utilities
The fisseq_data_pipeline.filter
module provides functions to clean and
filter feature/metadata tables prior to normalization and harmonization.
These utilities are invoked automatically in the pipeline, but can also be
used independently.
Overview
clean_data
: Removes invalid rows/columns from feature and metadata tables while keeping them aligned.drop_infrequent_pairs
: Drops rows from rare(label, batch)
groups according to a configurable threshold.
Environment variables
FISSEQ_PIPELINE_MIN_CLASS_MEMBERS
Minimum number of samples required per(label, batch)
group when runningdrop_infrequent_pairs
.
Default:2
.
Example:
# Require at least 5 samples per label–batch group
FISSEQ_PIPELINE_MIN_CLASS_MEMBERS=5 fisseq-data-pipeline validate ...
Example Usage
import polars as pl
from fisseq_data_pipeline.filter import clean_data, drop_infrequent_pairs
# Example feature matrix
feature_df = pl.DataFrame({
"f1": [1.0, 2.0, float("nan"), 4.0],
"f2": [5.0, 6.0, 7.0, 8.0],
})
# Example metadata with batch + label
meta_df = pl.DataFrame({
"_label": ["A", "A", "B", "B"],
"_batch": ["X", "Y", "X", "Y"],
})
# Clean non-finite and zero-variance columns/rows
feature_df, meta_df = clean_data(feature_df, meta_df)
# Drop infrequent (label, batch) pairs
feature_df, meta_df = drop_infrequent_pairs(feature_df, meta_df)
API reference
fisseq_data_pipeline.filter.clean_data(feature_df, meta_data_df)
Clean feature and metadata tables and keep them row-aligned.
The cleaning pipeline performs four passes:
1) Drop feature columns that are entirely non-finite. 2) Drop rows that contain any remaining non-finite value across. 3) Drop feature columns with (near) zero variance.
All feature columns must be numeric
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/fisseq_data_pipeline/filter.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
fisseq_data_pipeline.filter.drop_infrequent_pairs(feature_df, meta_data_df)
Remove rows belonging to rare (_label
, _batch
) groups.
Rows are grouped by the concatenation of _label
and _batch
.
Any group with a sample count less than
FISSEQ_PIPELINE_MIN_CLASS_MEMBERS
(default: 2) is dropped.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/fisseq_data_pipeline/filter.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
|