Pipeline
The FISSEQ data pipeline exposes a small CLI via the entry point
fisseq-data-pipeline. Subcommands are provided by
Python Fire:
run— Production, single-pass run: clean, normalize, and write outputs.configure— Write a default configuration file.
Quick start
# Run the full pipeline
fisseq-data-pipeline run \
--input_data_path data.parquet \
--config config.yaml \
--output_dir out
Write a default config to the current directory
fisseq-data-pipeline configure
# Write to a custom location
fisseq-data-pipeline configure --output_path path/to/config.yaml
Logging
Log level is controlled via the FISSEQ_PIPELINE_LOG_LEVEL environment
variable (default: info).
FISSEQ_PIPELINE_LOG_LEVEL=debug fisseq-data-pipeline run \
--input_data_path data.parquet
Command Interface
Run
fisseq_data_pipeline.pipeline.run(input_data_path, config=None, output_dir=None, eager_db_loading=False)
Run the full batch-correction pipeline.
This function performs a one-pass processing workflow that reads a full dataset, cleans invalid rows and columns, fits normalization statistics (optionally batch-wise and on control samples), applies normalization, and writes the resulting cleaned and normalized outputs to disk.
Production Pipeline steps
- Load and scan the input Parquet dataset into a Polars LazyFrame.
- Derive feature and metadata frames using configuration-specified column selections.
- Clean the dataset by removing:
- Columns that contain only non-finite (NaN/inf) values.
- Rows containing any non-finite feature values.
- Fit a batch-wise normalizer (computed from control samples only).
- Apply normalization to the full cleaned dataset.
- Write the cleaned, normalized, and fitted model artifacts to disk.
| Parameters: |
|
|---|
Outputs
Written to output_dir:
meta_data.parquet— cleaned metadata table.features.parquet— cleaned feature matrix.normalized.parquet— z-score normalized feature matrix.normalizer.pkl— serialized :class:Normalizerobject containing per-feature mean and standard deviation statistics.
CLI
Exposed via Fire at the fisseq-data-pipeline entry point, e.g.::
fisseq-data-pipeline run
--input_data_path data.parquet
--config config.yaml
--output_dir out
Source code in src/fisseq_data_pipeline/pipeline.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
Configure
fisseq_data_pipeline.pipeline.configure(output_path=None)
Write a copy of the default configuration to output_path.
| Parameters: |
|
|---|
| Returns: |
|
|---|
CLI
Exposed via Fire at the fisseq-data-pipeline entry point
# Write config.yaml to CWD
fisseq-data-pipeline configure
# Write to a custom location
fisseq-data-pipeline configure --output_path path/to/config.yaml
Source code in src/fisseq_data_pipeline/pipeline.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | |