52 lines (44 loc) · 5.8 KB

Useful AI computing procedures

Environment setup

Dataframe manipulation

Read data
- Check column names
Deep copy features
Slice and index a dataframe
- Select target columns
- Randomly select columns
- Drop a columm
- Create balanced folds in cross-validation for continuous target, where each each bin/stratum includes a roughly equal number of samples
Split data into training and validation sets before any fitting of preprocessing steps (e.g., imputation) to avoid train-test contamination
- Use train_test_split
- Randomly select rows

Data exploration

Feature engineering

Encode categorical variables
- Which variables are categorical?
- Ordinal encoding
- One-hot encoding
- Should the year be encoded as an original integer or treated with one-hot encoding? It depends. If the year has a meaningful numeric relationship with the target (e.g., older years might correlate with higher frequency of a disease), we keep it as an original integer. However, if specific years have unique relationships with the target, the year is better treated with one-hot encoding.
Manage missing values
- Impute missing values
- Drop missing values
Normalize a feature

Build a model: define -- fit -- predict -- evaluate

Make a pipeline that bundles imputation, one-hot encoding, model training, and model evaluation

Pipeline

Fine-tune a model for better performance

Evaluate the impact of different arguments on model performance
Use cross-validation (CV) to evaluate the impact of different arguments on model performance; CV is suitable for small datasets, as computational burden isn't a big issue
Define several models with varied arguments and select the model with the best performance
Add more relevant features
Use another model algorithm (e.g., graph neural network to process connectome)