AML class¶

class AutoMxL.__main__.AML(*args, target=None, **kwargs)[source]¶

Covers the complete pipeline of a classification project from a raw dataset to a deployable model.

AML is built as a class inherited from pandas DataFrame. Each Machine Learning step corresponds to method that can be called with default or filled parameters.

explore: explore dataset and identify features types
preprocess: clean and prepare data (optional : outliers processing).
select_features: features selection (optional)
model_train_predict : split AML in train/test sets to fits/apply models with random search. Returns the list of the valid models (without overfitting) and the best one.

deployment methods:

preprocess_apply : apply fitted preprocessing transformation to a new dataset
select_features_apply : idem
model_apply : apply fitted models to a new dataset

Notes :

A method requires that the former one has been applied (actuel step is given by “step” attribute)
Target has to be binary and encoded as int (1/0) (see MLGB59.Start.Encode_Target module if you need help)
don’t call your target “target” please :>

Parameters:	_obj (DataFrame) – Source Dataset target (string) – target name

explore(verbose=False)[source]¶

data exploration and features type identification

Note : if you disagree with automated identification, you can directly modify d_features attribute

Create self.d_features : dict {x : list of variables names}

date: date features
identifier: identifier features
verbatim: verbatim features
boolean: boolean features
categorical: categorical features
numerical: numerical features
NA: features which contains NA values
low_variance: list of the features with low variance and unique values

Parameters:	verbose (boolean (Default False)) – Get logging information

model_predict(df, metric='F1', delta_auc=0.03, verbose=False)[source]¶

apply fitted models on a dataset

identifies valid models (auc(train)-auc(test)<0.03
gets the best model in respect of a selected metric among valid model

Parameters:

metric (string (Default : 'F1')) – objective metric
verbose (boolean (Default False)) – Get logging information

Returns:

dict – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
list – valid models indexes
int – best model index
DataFrame – models summary

model_train(clf='XGBOOST', grid_param=None, top_bagging=False, n_comb=10, comb_seed=None, verbose=False)[source]¶

train models with random search

creates models with random hyper-parameters combinations from HP grid
fits models on self

Notes :

Available classifiers : Random Forest, XGBOOST
can enable bagging algo with top_bagging parameter

Parameters:

clf (string (Default : 'XGBOOST')) – classifier used for modelisation
grid_param (dict) – random search grid {Hyperparameter name : values list}
top_bagging (boolean (Default : False)) – enable Bagging
n_comb (int (Default : 10)) – HP combination number
comb_seed (int (Default : None)) – random combination seed
verbose (boolean (Default False)) – Get logging information

model_train_test(clf='XGBOOST', grid_param=None, metric='F1', delta_auc=0.03, top_bagging=False, n_comb=10, comb_seed=None, verbose=False)[source]¶

train and test models with random search

creates models with random hyper-parameters combinations from HP grid
splits (random 80/20) train/test sets to fit/apply models
identifies valid models (auc(train)-auc(test)<0.03
gets the best model in respect of a selected metric among valid model

Notes :

Available classifiers : Random Forest, XGBOOST
can enable bagging algo with top_bagging parameter

Parameters:

clf (string (Default : 'XGBOOST')) – classifier used for modelisation
grid_param (dict) – random search grid {Hyperparameter name : values list}
metric (string (Default : 'F1')) – objective metric
top_bagging (boolean (Default : False)) – enable Bagging
n_comb (int (Default : 10)) – HP combination number
comb_seed (int (Default : None)) – random combination seed
verbose (boolean (Default False)) – Get logging information

Returns:

dict – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
list – valid models indexes
int – best model index
DataFrame – models summary

preprocess(date_ref=None, process_outliers=False, cat_method='deep_encoder', verbose=False)[source]¶

Prepare the data before feeding it to the model :

remove low variance features

remove identifiers and verbatims features

transform date features to timedelta

fill missing values

process categorical and boolean data (one-hot-encoding or Pytorch NN encoder)

replace outliers (optional)

create self.d_preprocess : dict {step : transformation}

remove: list of the features to remove

date: fitted DateEncoder object

NA: fitted NAEncoder object

categorical: fitted CategoricalEncoder object

outlier: fitted OutlierEncoder object

Parameters:	date_ref (string '%d/%m/%y' (Default : None)) – ref date to compute date features timedelta. If None, today date process_outliers (boolean (Default : False)) – Enable outliers replacement cat_method (string (Default : 'deep_encoder')) – Categorical features encoding method verbose (boolean (Default False)) – Get logging information

preprocess_apply(df, verbose=False)[source]¶

Apply preprocessing.

Requires preprocess method to have been applied (so that all encoder are fitted).

Parameters:	df (DataFrame) – dataset to apply preprocessing on verbose (boolean (Default False)) – Get logging information
Returns:	DataFrame
Return type:	Preprocessed dataset

select_features(method='pca', verbose=False)[source]¶

fit and apply features selection (optional)

Parameters:	method (string (Default pca)) – method use to select features verbose (boolean (Default False)) – Get logging information

select_features_apply(df, verbose=False)[source]¶

Apply features selection.

Requires Select_Features method to have been applied

Parameters:	df (DataFrame) – dataset to apply selection on verbose (boolean (Default False)) – Get logging information
Returns:	DataFrame
Return type:	reduced dataset