Key Learnings from Intro to ML Course
in Kaggle Learn
- How to
train_test_split
the data - Briefly discussed the concept of underfitting and overfitting (Loss vs model complexity curve)
- How to train a typical scikit-learn model like
DecisionTreeRegressor
orRandomForestRegressor
- both need no scaling of continuous or discrete data;
- for sklearn might have to convert categorical data into encoded values
- After finding out the best parameters, one should train with the identified hyperparameters on the whole data
- (so that model will learn a bit more from held out data too)
Key Learnings from Intermediary ML Course
in Kaggle Learn
Missing Value Treatment:
- Remove the Null Rows OR Columns (by column meaning the whole feature containing the missing value)
- Impute (by some strategy like Mean, Median, some regression like KNN)
- Impute + Add a boolean variable for every column imputed (so as to make the model hopefully treat the imputed row differently)
- Do removing missing values help or imputing missing values help more for the model accuracy?
- Opinion shared by the Author: SimpleImputer works as effectively as a complex imputing algorithm when used inside sophisticated ML models
Categorical Column Treatment:
- `Drop Categorical columns` (worst approach)
- `OrdinalEncoder`
- `OneHotEncoder` (most cases, the best approach)
- Learnt the concept of “good_category_cols” and “bad_category_columns”
(if a particular class occurs new in the unseen dataset;handle_unknown
argument in “OneHotEncoder” is possible) - Think twice before applying onehot encoding because of “high cardinality columns”
Data Leakage
- An example of Data Leakage:
- Doing Inputer before train_test_split. Validation data then would have “seen” training data
Example 1 - Nike
:
- Objective: How much shoelace material will be used?
- Situation: With the feature
Leather used this month
, the prediction accuracy is 98%+. Without this featur, the accuracy is just ~80% - IsDataLeakage?:
Depends
!- ❌
Leather used this month
is a bad feature if the number is populated during the month (which makes it not available to predict the amount of shoe lace material needed) - ✔️
Leather used this month
is a okay feature to use if the number determined during the beginning of the month (making it available during predition time on unseen data)
- ❌
Example 2 - Nike
:
- Objective: How much shoelace material will be used?
- Situation: Can we use the feature
Leather order this month
? - IsDataLeakage?
Most likely no, however ...
- ❌ If
Shoelaces ordered
(our Target Variable) is determined first and then onlyLeather Ordered
is planned,
then we won’t haveLeather Ordered
during the time of prediction of unseen data - ✔️ If
Leather Ordered
is determined beforeShoelaces Ordered
, then it is a useful feature
- ❌ If
Example 3 - Cryptocurrency
:
- Objective: Predicting tomo’s crypto price with a error of <$1
- Situation: Are the following features susceptible to leakage?
Current price of Crypto
Change in the price of crypto from 1 hour ago
Avg Price of crypto in the largest 24 h0urs
Macro-economic Features
Tweets in the last 24 hours
- IsDataLeakage?
No
, none of the features seem to cause leakage. - However, more useful Target Variable
Change in Price (pos/neg) the next day
. If this can be consistently predicted higher, then it is a useful model
Example 4 - Surgeon's Infection Rate Performance
:
- Objective: How to predict if a patient who has undergone a surgery will get infected post surgery?
- Situation: How can information about each surgeon’s infection rate performance be carefully utilized while training?
- The independent features are strictly data points collected until the surgery had taken place
- The dependent variable - whether infected or not - should be post surgery measurement
- IsDataLeakage?
Depends
on the what are the features used.- If a surgeon’s infection rate is used as a feature while training the model (that predicts whether a patient will be infected post surgery), that will lead to data leakage
Key Learnings from Feature Engineering
Course in Kaggle Learn
- Key Topics of this course:
- Mutual Information
- Inventing New Features (like
apparent temparature
= {Air Temparature + Humidity + Wind Speed}) - Segmentation Features (using K-Means Clustering)
- Variance in the Dataset based features (using Principal Component Analysis)
- Encode (high cardinality) category variables using
Target Encoding
- Why Feature Engineering?
- To improve model performance
- To reduce computational complexity by combining many features into a few
- To improve interpretability of results
- Wherever the model cannot identify a proper relationship between a dependent and a particular independent variable,
- we can engineer/transform 1 or more of the independent variables
- so as to let model learn a better relationship between the engineered features and dependent variable
- E.g.: In
compressive_strength
prediction incement
data, synthetic feature - ratio of Water to Cement helps
Mutual Information
- Mutual information is similar to correlation but correlation only looks for linear relationship whereas Mutual information can talk about any relationship
Mutual Information
decribes relationship between two variables in terms of uncertainty (or certainty)- For e.g.: Knowing
ExteriorQuality
of a house (one of 4 values - Fair, Typical, Good and Excellent) can help one reduce uncertainty overSalePrice
. Better the ExteriorQuality, more the SalesPrice - Typical values: If two variables have a MI score of 0.0 - they are totally indepndent.
- Mutual Information is a logarithmic quantity. So it increases slowly
- Mutual Information is a univariate metric. MI can’t detect interactions between features Meaning, if multiple features together make sense to a dependent variable but not independently, then MI cannot determine that. Before deciding a feature is unimportant from its MI score, it’s good to investigate any possible interaction effects
- For e.g.: Knowing
- Parallel Read for MI like metrics:
Types of New Features:
- Mathematical Transformations (Ratio, Log)
- Grouping Columns Features -
df['New_Group_Feature'] = df[list_of_boolean_features].sum(axis=1)
-df['New_Group_Feature'] = df[list_of_numerical_features].gt(0).sum(axis=1)
gt – greater than - Grouping
numerical
Rows Features -customer['Avg_Income_by_State'] = customer.groupby('State')['Income'].transform('mean')
- Grouping
categorical
columns features -customer['StateFreq'] = customer.groupby('State')['State'].transform('count')/customer.State.count()
- Split Features - df[[‘Type’, ‘Count’]] = df[‘some_var’].str.split(” “,expand=True)
- Combine Features -
df['new_feture'] = df['var1'] + "_" + df['var2']
Useful Tips on Feature Engineering:
- Linear models learn sum and differences naturally
- Neural Networks work better with scaled features
- Ratios are difficult for many models, so can yeild better results when incorporated as additional feature
- Tree models do not have the ability to factor in cound feature
- Clustering as
feature discovery
tool (add a categorical feature based on clustering of a subset of features)
Principal Component Analysis
- PCA is like partitioning of the variation in data
- Instead of describing the data with the original features,
- you do an orthogonl transformation of the features and compute “principal components”
- which are used to explain the variation in the data.
- Convert the correlated variables into mutually orthogonal (uncorrelated)
principal components
- Principal components can be more informative than the original features
- Advantages of PCA: - Dimensionality Reducton - Anamoly Detection - Boosting signal to noise ratio - Decorrelation
- PCA works only for numeric variables; works best for scaled data
- Pipeline for PCA: original_features –> Scaled_features –> PCA Features –> MI_computed_on_PCA_features
Target Encoding
Target Encoding
: A Supervised Feature Engineering technique for encoding categorical variables by including the target labels- Target Encoding is basically assigning a number to a categorical variable where in the number is derived from target variable -
autos['target_encoded_make'] = autos.groupby('make')['Price'].transform('mean')
- Disadvantages of Target Encoding: - Overfits for low volume (rare) classes - What if there are missing values
- Where is Target Encoding most suitable? - For High cardinality features - Domain-motivated features (features that could have been scored poorly using feature importance metric function, we can unearth its real usefulness using target encoding)
ML Coding Tips
- sklearn:
Pipleine
: Bundles togetherpreprocessing
andmodeling
steps | makes codebase easier for productionalizingColumnTransformer
: Bundles together different preprocessing steps- Sklearn classes used often:
- Model
from sklearn.tree import DecisionTreeRegressor
ANDfrom sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score
- cross_val_score(my_pipeline, X, y, scorung=‘neg_mean_absolute_error’)
from xgboost import XGBRegressor
n_estimators
: Number of estimators is same as the number of cycles the data is processed by the model (100-1000
)early_stopping_rounds
: Early stopping stops the iteration when the validation score stops improvinglearning_rate
- xgboost_model = XGBRegressor(n_estimators=500) - xgboost_model.fit(X_train,y_train,early_stopping_rounds=5,eval_set=[(X_valid, y_valid)], verbose=False)
- Preprocessing & Feature Engineering
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer, KNNImputer
- `from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
- Pandas:
- X = df.copy() y = X.pop(‘TheDependentVariable’) # remove the dependent variable from the X (features) and save in y)
df[encoded_colname], unique_values = df[colname].factorize()
# for converting a categorical list of values into encoded numbers using pandasdf[list_of_oh_encoded_col_name_values] = pd.get_dummies(df[colname])
# for converting a categorical variable into a list of oh-encoded-values using pandas- Exclude all categorical columns at once:
df = df.select_dtypes(exclude=['object'])
- Creating a new column just to show the model if which row of a particular column have null values
- df[col + ’__ismissing’] = df[col].isnull()
- Isolate all categorical columns:
object_cols = [col for col in df.columns if df[col].dtype == "object"]
- Segregate good and bad object columns (defined by the presence of “unknown” or new categories in validation or test dataset)
good_object_cols = [col for col in object_cols if set(X_valid[col]).issubset(set(X_train[col])]
bad_object_cols = list(set(good_object_cols) - set(bad_object_cols))
- Getting number of unique entries (
cardinality
) acrossobject
or categorical columnsnum_of_uniques_in_object_cols = list(map(lambda col: df[col].nunique(), object_cols))
sorted(list(zip(object_cols, num_of_uniques_in_object_cols)), key=lambda x: x[1], reverse=True)
Source:
- Kaggle.com/learn