Understanding Machine Learning Fundamentals - from a coder’s viewpoint

This blog is inspired from my notes on Kaggle Learn Course on ML. It has code snippets in Scikit-learn and Pandas to put concepts into practices
Coding
ML
Python
Author

Senthil Kumar

Published

March 4, 2022


Key Learnings from Intro to ML Course in Kaggle Learn

  • How to train_test_split the data
  • Briefly discussed the concept of underfitting and overfitting (Loss vs model complexity curve)
  • How to train a typical scikit-learn model like DecisionTreeRegressor or RandomForestRegressor
    • both need no scaling of continuous or discrete data;
    • for sklearn might have to convert categorical data into encoded values
  • After finding out the best parameters, one should train with the identified hyperparameters on the whole data
    • (so that model will learn a bit more from held out data too)

Key Learnings from Intermediary ML Course in Kaggle Learn

Missing Value Treatment:

  • Remove the Null Rows OR Columns (by column meaning the whole feature containing the missing value)
  • Impute (by some strategy like Mean, Median, some regression like KNN)
  • Impute + Add a boolean variable for every column imputed (so as to make the model hopefully treat the imputed row differently)
  • Do removing missing values help or imputing missing values help more for the model accuracy?
  • Opinion shared by the Author: SimpleImputer works as effectively as a complex imputing algorithm when used inside sophisticated ML models

Categorical Column Treatment:

- `Drop Categorical columns` (worst approach)
- `OrdinalEncoder` 
- `OneHotEncoder` (most cases, the best approach)
  • Learnt the concept of “good_category_cols” and “bad_category_columns”
    (if a particular class occurs new in the unseen dataset; handle_unknown argument in “OneHotEncoder” is possible)
  • Think twice before applying onehot encoding because of “high cardinality columns”

Data Leakage

  • An example of Data Leakage:
    • Doing Inputer before train_test_split. Validation data then would have “seen” training data

Example 1 - Nike:

  • Objective: How much shoelace material will be used?
  • Situation: With the feature Leather used this month , the prediction accuracy is 98%+. Without this featur, the accuracy is just ~80%
  • IsDataLeakage?: Depends !
    • Leather used this month is a bad feature if the number is populated during the month (which makes it not available to predict the amount of shoe lace material needed)
    • ✔️ Leather used this month is a okay feature to use if the number determined during the beginning of the month (making it available during predition time on unseen data)

Example 2 - Nike:

  • Objective: How much shoelace material will be used?
  • Situation: Can we use the feature Leather order this month?
  • IsDataLeakage? Most likely no, however ...
    • ❌ If Shoelaces ordered (our Target Variable) is determined first and then only Leather Ordered is planned,
      then we won’t have Leather Ordered during the time of prediction of unseen data
    • ✔️ If Leather Ordered is determined before Shoelaces Ordered, then it is a useful feature

Example 3 - Cryptocurrency:

  • Objective: Predicting tomo’s crypto price with a error of <$1
  • Situation: Are the following features susceptible to leakage?
    • Current price of Crypto
    • Change in the price of crypto from 1 hour ago
    • Avg Price of crypto in the largest 24 h0urs
    • Macro-economic Features
    • Tweets in the last 24 hours
  • IsDataLeakage? No, none of the features seem to cause leakage.
  • However, more useful Target Variable Change in Price (pos/neg) the next day. If this can be consistently predicted higher, then it is a useful model

Example 4 - Surgeon's Infection Rate Performance:

  • Objective: How to predict if a patient who has undergone a surgery will get infected post surgery?
  • Situation: How can information about each surgeon’s infection rate performance be carefully utilized while training?
    • The independent features are strictly data points collected until the surgery had taken place
    • The dependent variable - whether infected or not - should be post surgery measurement
  • IsDataLeakage? Depends on the what are the features used.
    • If a surgeon’s infection rate is used as a feature while training the model (that predicts whether a patient will be infected post surgery), that will lead to data leakage

Key Learnings from Feature Engineering Course in Kaggle Learn

  • Key Topics of this course:
    • Mutual Information
    • Inventing New Features (like apparent temparature = {Air Temparature + Humidity + Wind Speed})
    • Segmentation Features (using K-Means Clustering)
    • Variance in the Dataset based features (using Principal Component Analysis)
    • Encode (high cardinality) category variables using Target Encoding
  • Why Feature Engineering?
    • To improve model performance
    • To reduce computational complexity by combining many features into a few
    • To improve interpretability of results
  • Wherever the model cannot identify a proper relationship between a dependent and a particular independent variable,
    • we can engineer/transform 1 or more of the independent variables
    • so as to let model learn a better relationship between the engineered features and dependent variable
  • E.g.: In compressive_strength prediction in cement data, synthetic feature - ratio of Water to Cement helps

Mutual Information

  • Mutual information is similar to correlation but correlation only looks for linear relationship whereas Mutual information can talk about any relationship
  • Mutual Information decribes relationship between two variables in terms of uncertainty (or certainty)
    • For e.g.: Knowing ExteriorQuality of a house (one of 4 values - Fair, Typical, Good and Excellent) can help one reduce uncertainty over SalePrice. Better the ExteriorQuality, more the SalesPrice
    • Typical values: If two variables have a MI score of 0.0 - they are totally indepndent.
    • Mutual Information is a logarithmic quantity. So it increases slowly
    • Mutual Information is a univariate metric. MI can’t detect interactions between features Meaning, if multiple features together make sense to a dependent variable but not independently, then MI cannot determine that. Before deciding a feature is unimportant from its MI score, it’s good to investigate any possible interaction effects
  • Parallel Read for MI like metrics:

Types of New Features:

  • Mathematical Transformations (Ratio, Log)
  • Grouping Columns Features - df['New_Group_Feature'] = df[list_of_boolean_features].sum(axis=1) - df['New_Group_Feature'] = df[list_of_numerical_features].gt(0).sum(axis=1) gt – greater than
  • Grouping numerical Rows Features - customer['Avg_Income_by_State'] = customer.groupby('State')['Income'].transform('mean')
  • Grouping categorical columns features - customer['StateFreq'] = customer.groupby('State')['State'].transform('count')/customer.State.count()
  • Split Features - df[[‘Type’, ‘Count’]] = df[‘some_var’].str.split(” “,expand=True)
  • Combine Features - df['new_feture'] = df['var1'] + "_" + df['var2']

Useful Tips on Feature Engineering:

  • Linear models learn sum and differences naturally
  • Neural Networks work better with scaled features
  • Ratios are difficult for many models, so can yeild better results when incorporated as additional feature
  • Tree models do not have the ability to factor in cound feature
  • Clustering as feature discovery tool (add a categorical feature based on clustering of a subset of features)

Principal Component Analysis

  • PCA is like partitioning of the variation in data
  • Instead of describing the data with the original features,
  • you do an orthogonl transformation of the features and compute “principal components”
  • which are used to explain the variation in the data.
  • Convert the correlated variables into mutually orthogonal (uncorrelated) principal components
  • Principal components can be more informative than the original features
  • Advantages of PCA: - Dimensionality Reducton - Anamoly Detection - Boosting signal to noise ratio - Decorrelation
  • PCA works only for numeric variables; works best for scaled data
  • Pipeline for PCA: original_features –> Scaled_features –> PCA Features –> MI_computed_on_PCA_features

Target Encoding

  • Target Encoding: A Supervised Feature Engineering technique for encoding categorical variables by including the target labels
  • Target Encoding is basically assigning a number to a categorical variable where in the number is derived from target variable - autos['target_encoded_make'] = autos.groupby('make')['Price'].transform('mean')
  • Disadvantages of Target Encoding: - Overfits for low volume (rare) classes - What if there are missing values
  • Where is Target Encoding most suitable? - For High cardinality features - Domain-motivated features (features that could have been scored poorly using feature importance metric function, we can unearth its real usefulness using target encoding)

ML Coding Tips

  • sklearn:
    • Pipleine: Bundles together preprocessing and modeling steps | makes codebase easier for productionalizing
    • ColumnTransformer: Bundles together different preprocessing steps
    • Sklearn classes used often:
    • Model
      • from sklearn.tree import DecisionTreeRegressor AND from sklearn.ensemble import RandomForestRegressor
      • from sklearn.model_selection import train_test_split
      • from sklearn.metrics import mean_absolute_error
      • from sklearn.model_selection import cross_val_score
        • cross_val_score(my_pipeline, X, y, scorung=‘neg_mean_absolute_error’)
      • from xgboost import XGBRegressor
        • n_estimators: Number of estimators is same as the number of cycles the data is processed by the model (100-1000)
        • early_stopping_rounds: Early stopping stops the iteration when the validation score stops improving
        • learning_rate - xgboost_model = XGBRegressor(n_estimators=500) - xgboost_model.fit(X_train,y_train,early_stopping_rounds=5,eval_set=[(X_valid, y_valid)], verbose=False)
    • Preprocessing & Feature Engineering
      • from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
      • from sklearn.decomposition import PCA
      • from sklearn.impute import SimpleImputer, KNNImputer
      • `from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
  • Pandas:
    • X = df.copy() y = X.pop(‘TheDependentVariable’) # remove the dependent variable from the X (features) and save in y)
    • df[encoded_colname], unique_values = df[colname].factorize() # for converting a categorical list of values into encoded numbers using pandas
    • df[list_of_oh_encoded_col_name_values] = pd.get_dummies(df[colname]) # for converting a categorical variable into a list of oh-encoded-values using pandas
    • Exclude all categorical columns at once:
      • df = df.select_dtypes(exclude=['object'])
    • Creating a new column just to show the model if which row of a particular column have null values
      • df[col + ’__ismissing’] = df[col].isnull()
    • Isolate all categorical columns:
      • object_cols = [col for col in df.columns if df[col].dtype == "object"]
    • Segregate good and bad object columns (defined by the presence of “unknown” or new categories in validation or test dataset)
      • good_object_cols = [col for col in object_cols if set(X_valid[col]).issubset(set(X_train[col])]
      • bad_object_cols = list(set(good_object_cols) - set(bad_object_cols))
    • Getting number of unique entries (cardinality) across object or categorical columns
      • num_of_uniques_in_object_cols = list(map(lambda col: df[col].nunique(), object_cols))
      • sorted(list(zip(object_cols, num_of_uniques_in_object_cols)), key=lambda x: x[1], reverse=True)

Source:
- Kaggle.com/learn