- What files are contained in training and test folders?
- Create symbolic links to access the file from another location without duplicating it
- Suppress warning messages
- Read data
- Deep copy features
- Slice and index a dataframe
- Select target columns
- Randomly select columns
- Drop a columm
- Create balanced folds in cross-validation for continuous target, where each each bin/stratum includes a roughly equal number of samples
- Split data into training and validation sets before any fitting of preprocessing steps (e.g., imputation) to avoid train-test contamination
- Encode categorical variables
- Which variables are categorical?
- Ordinal encoding
- One-hot encoding
- Should the year be encoded as an original integer or treated with one-hot encoding? It depends. If the year has a meaningful numeric relationship with the target (e.g., older years might correlate with higher frequency of a disease), we keep it as an original integer. However, if specific years have unique relationships with the target, the year is better treated with one-hot encoding.
- Manage missing values
- Normalize a feature
- Train a decision tree regressor
- Train a decision tree classifier
- Train a random forest regressor
- Train a gradient boosting regressor
- Save model parameters and full model
- Evaluate the impact of different arguments on model performance
- Use cross-validation (CV) to evaluate the impact of different arguments on model performance; CV is suitable for small datasets, as computational burden isn't a big issue
- Define several models with varied arguments and select the model with the best performance
- Add more relevant features
- Use another model algorithm (e.g., graph neural network to process connectome)