Data Cleaning Utilities
The fisseq_data_pipeline.filter module provides functions to clean and
filter feature tables prior to normalization. These utilities are invoked
automatically in the pipeline, but can also be used independently.
Overview
clean_data: Applies a configurable sequence of filtering stages to aLazyFrame. The default pipeline removes all-non-finite columns then rows with any non-finite value.drop_cols_all_nonfinite: Drops columns where every value is NaN, +inf, or -inf.drop_rows_any_nonfinite: Drops rows containing any non-finite feature value.
Example Usage
import polars as pl
from fisseq_data_pipeline.filter import clean_data
# Example combined data LazyFrame (features + metadata)
data_lf = pl.DataFrame({
"_meta_batch": ["A", "A", "B"],
"_meta_label": ["X", "X", "Y"],
"f1": [1.0, float("nan"), 3.0],
"f2": [5.0, 6.0, float("inf")],
"f3": [float("nan"), float("nan"), float("nan")],
}).lazy()
# Run default pipeline: drop all-nonfinite columns, then rows with any nonfinite
cleaned_lf = clean_data(data_lf)
# Run only one stage
cleaned_lf = clean_data(data_lf, stages=["drop_cols_all_nonfinite"])
# Insert a custom filtering stage
def my_filter(lf: pl.LazyFrame) -> pl.LazyFrame:
return lf.filter(pl.col("f1") > 0)
cleaned_lf = clean_data(data_lf, stages=["drop_cols_all_nonfinite", my_filter])
Notes
- Only columns not prefixed with
_metaare treated as feature columns and considered during non-finite checks. Metadata columns are carried through unchanged. - Unknown string stage names are skipped with a
WARNINGlog message. - Custom stages must accept and return a
pl.LazyFrame.
API reference
fisseq_data_pipeline.filter.clean_data(feature_lf, stages=['drop_cols_all_nonfinite', 'drop_rows_any_nonfinite'])
Apply a sequence of filtering stages to a LazyFrame.
Each stage may be specified by:
- A string matching a known stage name; or
- A Callable of type FilterFun receiving a LazyFrame and returning one.
Stages are executed in order, and each stage returns a transformed LazyFrame that becomes the input for the next stage.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Notes
- Invalid stage names are skipped with a warning.
- Custom filtering functions can be inserted as callables.
Source code in src/fisseq_data_pipeline/filter.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
fisseq_data_pipeline.filter.drop_cols_all_nonfinite(data_lf)
Remove columns where every value is non-finite.
A column is dropped if all of its elements are one of: - NaN - +inf - -inf
This scan is performed eagerly for the minimal number of rows needed to determine which columns contain at least one finite value. The final returned LazyFrame excludes all-non-finite columns.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in src/fisseq_data_pipeline/filter.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | |
fisseq_data_pipeline.filter.drop_rows_any_nonfinite(data_lf)
Remove rows containing any non-finite value in the feature columns.
A row is dropped if any feature column contains: - NaN - +inf - -inf
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in src/fisseq_data_pipeline/filter.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |