Build a Polars column selector for feature columns based on the config.
This utility interprets config.feature_cols and returns a Polars
selector expression usable in .select() or .with_columns() calls.
Selection modes:
- Regex: if feature_cols is a string, all column names matching
the regex are selected.
- Explicit list: if feature_cols is a list of strings, those
columns are selected in the given order. Missing columns are ignored
with a warning.
| Parameters: |
-
data_df
(LazyFrame)
–
Input dataset containing feature columns.
-
config
(Config)
–
Configuration object with a feature_cols attribute defining which
columns to select (regex pattern or explicit list).
|
| Returns: |
-
PlSelector
–
A Polars selector expression suitable for use in .select() calls.
|
Notes
- Missing columns are ignored but logged as a warning.
- Column order is preserved when using an explicit list.
Source code in src/fisseq_data_pipeline/utils.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74 | def get_feature_selector(data_df: pl.LazyFrame, config: Config) -> PlSelector:
"""
Build a Polars column selector for feature columns based on the config.
This utility interprets ``config.feature_cols`` and returns a Polars
selector expression usable in ``.select()`` or ``.with_columns()`` calls.
Selection modes:
- **Regex**: if ``feature_cols`` is a string, all column names matching
the regex are selected.
- **Explicit list**: if ``feature_cols`` is a list of strings, those
columns are selected in the given order. Missing columns are ignored
with a warning.
Parameters
----------
data_df : pl.LazyFrame
Input dataset containing feature columns.
config : Config
Configuration object with a ``feature_cols`` attribute defining which
columns to select (regex pattern or explicit list).
Returns
-------
PlSelector
A Polars selector expression suitable for use in ``.select()`` calls.
Notes
-----
- Missing columns are ignored but logged as a warning.
- Column order is preserved when using an explicit list.
"""
if isinstance(config.feature_cols, str):
selector_type = "regex"
selector = cs.matches(config.feature_cols)
else:
selector_type = "list"
feature_cols = set(config.feature_cols)
missing = feature_cols - set(data_df.columns)
if len(missing) != 0:
logging.warning(
"Some columns are specified in the config but are not currently"
" present in the dataframe, this can happen if columns are"
" removed during data cleaning. The following columns will be"
" ignored: %s",
missing,
)
# Column order must be preserved
not_missing = feature_cols - missing
selector = pl.col(
col for col in list(config.feature_cols) if col in not_missing
)
logging.debug("Using feature %s selector: %s", selector_type, config.feature_cols)
return selector
|