AML class

class AutoMxL.__main__.AML(*args, target=None, **kwargs)[source]

Covers the complete pipeline of a classification project from a raw dataset to a deployable model.

AML is built as a class inherited from pandas DataFrame. Each Machine Learning step corresponds to method that can be called with default or filled parameters.

  • explore: explore dataset and identify features types
  • preprocess: clean and prepare data (optional : outliers processing).
  • select_features: features selection (optional)
  • model_train_predict : split AML in train/test sets to fits/apply models with random search. Returns the list of the valid models (without overfitting) and the best one.

deployment methods:

  • preprocess_apply : apply fitted preprocessing transformation to a new dataset
  • select_features_apply : idem
  • model_apply : apply fitted models to a new dataset

Notes :

  • A method requires that the former one has been applied (actuel step is given by “step” attribute)
  • Target has to be binary and encoded as int (1/0) (see MLGB59.Start.Encode_Target module if you need help)
  • don’t call your target “target” please :>
Parameters:
  • _obj (DataFrame) – Source Dataset
  • target (string) – target name
explore(verbose=False)[source]

data exploration and features type identification

Note : if you disagree with automated identification, you can directly modify d_features attribute

Create self.d_features : dict {x : list of variables names}
  • date: date features
  • identifier: identifier features
  • verbatim: verbatim features
  • boolean: boolean features
  • categorical: categorical features
  • numerical: numerical features
  • NA: features which contains NA values
  • low_variance: list of the features with low variance and unique values
Parameters:verbose (boolean (Default False)) – Get logging information
model_predict(df, metric='F1', delta_auc=0.03, verbose=False)[source]

apply fitted models on a dataset

  • identifies valid models (auc(train)-auc(test)<0.03
  • gets the best model in respect of a selected metric among valid model
Parameters:
  • metric (string (Default : 'F1')) – objective metric
  • verbose (boolean (Default False)) – Get logging information
Returns:

  • dict – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
  • list – valid models indexes
  • int – best model index
  • DataFrame – models summary

model_train(clf='XGBOOST', grid_param=None, top_bagging=False, n_comb=10, comb_seed=None, verbose=False)[source]

train models with random search

  • creates models with random hyper-parameters combinations from HP grid
  • fits models on self

Notes :

  • Available classifiers : Random Forest, XGBOOST
  • can enable bagging algo with top_bagging parameter
Parameters:
  • clf (string (Default : 'XGBOOST')) – classifier used for modelisation
  • grid_param (dict) – random search grid {Hyperparameter name : values list}
  • top_bagging (boolean (Default : False)) – enable Bagging
  • n_comb (int (Default : 10)) – HP combination number
  • comb_seed (int (Default : None)) – random combination seed
  • verbose (boolean (Default False)) – Get logging information
model_train_test(clf='XGBOOST', grid_param=None, metric='F1', delta_auc=0.03, top_bagging=False, n_comb=10, comb_seed=None, verbose=False)[source]

train and test models with random search

  • creates models with random hyper-parameters combinations from HP grid
  • splits (random 80/20) train/test sets to fit/apply models
  • identifies valid models (auc(train)-auc(test)<0.03
  • gets the best model in respect of a selected metric among valid model

Notes :

  • Available classifiers : Random Forest, XGBOOST
  • can enable bagging algo with top_bagging parameter
Parameters:
  • clf (string (Default : 'XGBOOST')) – classifier used for modelisation
  • grid_param (dict) – random search grid {Hyperparameter name : values list}
  • metric (string (Default : 'F1')) – objective metric
  • top_bagging (boolean (Default : False)) – enable Bagging
  • n_comb (int (Default : 10)) – HP combination number
  • comb_seed (int (Default : None)) – random combination seed
  • verbose (boolean (Default False)) – Get logging information
Returns:

  • dict – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
  • list – valid models indexes
  • int – best model index
  • DataFrame – models summary

preprocess(date_ref=None, process_outliers=False, cat_method='deep_encoder', verbose=False)[source]

Prepare the data before feeding it to the model :

  • remove low variance features
  • remove identifiers and verbatims features
  • transform date features to timedelta
  • fill missing values
  • process categorical and boolean data (one-hot-encoding or Pytorch NN encoder)
  • replace outliers (optional)
create self.d_preprocess : dict {step : transformation}
  • remove: list of the features to remove
  • date: fitted DateEncoder object
  • NA: fitted NAEncoder object
  • categorical: fitted CategoricalEncoder object
  • outlier: fitted OutlierEncoder object
Parameters:
  • date_ref (string '%d/%m/%y' (Default : None)) – ref date to compute date features timedelta. If None, today date
  • process_outliers (boolean (Default : False)) – Enable outliers replacement
  • cat_method (string (Default : 'deep_encoder')) – Categorical features encoding method
  • verbose (boolean (Default False)) – Get logging information
preprocess_apply(df, verbose=False)[source]

Apply preprocessing.

Requires preprocess method to have been applied (so that all encoder are fitted).

Parameters:
  • df (DataFrame) – dataset to apply preprocessing on
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

Preprocessed dataset

select_features(method='pca', verbose=False)[source]

fit and apply features selection (optional)

Parameters:
  • method (string (Default pca)) – method use to select features
  • verbose (boolean (Default False)) – Get logging information
select_features_apply(df, verbose=False)[source]

Apply features selection.

Requires Select_Features method to have been applied

Parameters:
  • df (DataFrame) – dataset to apply selection on
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

reduced dataset