pipeline package¶
Submodules¶
pipeline.DataHandler module¶
pipeline.FeatureSelector module¶
pipeline.LOORFE module¶
- class pipeline.LOORFE.LeaveOneOutRecursiveFeatureElimination(data_handler, model_eval_class, num_classes=2, seed=None)[source]¶
Bases:
object
Implements an unbiased Leave-One-Out Recursive Feature Elimination (LOO-RFE) approach for feature selection in machine learning models. This class leverages cross-validation (CV) splits to generate feature rankings that exclude the influence of the test subject, thereby providing an unbiased evaluation of feature importance.
The LOO-RFE method iteratively eliminates features based on their impact on model performance, utilizing a unique approach to ensure that the evaluation of each feature’s importance is not biased by the inclusion of the test subject’s data in the training set. This is achieved by constructing feature rankings from subsets of data where the test subject was excluded, using these unbiased rankings to identify the most relevant features for model prediction.
Parameters: - data_handler: An object responsible for managing data operations such as scaling and batching,
ensuring that data is appropriately preprocessed for model training and evaluation.
- model_eval_class: A class that provides methods for model compilation, fitting, and evaluation,
designed to work with variable numbers of features and target classes.
num_classes (int): The number of target classes in the dataset, defaulting to 2 for binary classification.
seed (int, optional): A random seed to ensure reproducibility of results across runs.
Methods: - get_keys_where_value_exists: Retrieves keys from a list of dictionaries where a specified value is present,
aiding in the identification of CV splits relevant to each test subject.
- get_top_features: Extracts top features based on their occurrence and average position in ordered rankings,
utilizing an unbiased ranking list compiled from CV splits excluding the test subject.
- evaluate: Conducts the LOO-RFE evaluation, training models on subsets of features and assessing their performance
in an unbiased manner, guided by the principle of excluding the test subject’s data from feature ranking generation.
This class provides a robust framework for feature selection by prioritizing the most impactful features for model performance while ensuring the evaluation process is unbiased by the test data. The approach enhances the generalizability and reliability of the selected features, making it a valuable tool for machine learning tasks that require careful feature selection and validation.
- evaluate(data, train_index, test_index, batch_size, num_selected_features, sample_num, subs_in_val_per_sample, selected_feature_indices, scaler_obj, save_path)[source]¶
Performs the LOO-RFE evaluation process, training models on subsets of features and assessing their performance.
Parameters: - data: The dataset containing features and labels. - train_index: Indices for the training data. - test_index: Indices for the test data. - batch_size: The size of batches for training and testing. - num_selected_features: The number of features to select in the current iteration. - sample_num: Identifier for the current sample or iteration. - subs_in_val_per_sample: Subsets involved in validation for each sample. - selected_feature_indices: Indices of features selected in the current iteration. - scaler_obj: An instance of a scaler for data normalization. - save_path: Path where evaluation results and plots are saved.
Returns: - A DataFrame containing evaluation results across different feature subsets.
- get_keys_where_value_exists(dict_list, X)[source]¶
Finds keys in a list of dictionaries where a specified value exists.
Parameters: - dict_list (list of dict): The list of dictionaries to search. - X (int): The value to search for within the dictionary values.
Returns: - keys (list of str): A list of keys where the specified value is present among the values.
- get_top_features(ordered_rankings, ascending=False, n_features=None)[source]¶
Identifies top features based on their occurrence and average position in ordered rankings.
Parameters: - ordered_rankings (list of lists): A list where each sublist represents feature rankings in an iteration. - ascending (bool): Determines the sorting order of mean positions. Defaults to False. - n_features (int, optional): Specifies the number of top features to select. Defaults to selecting all.
Returns: - top_features (DataFrame): A DataFrame containing statistics of top features including frequency and position metrics.