AML class¶
-
class
AutoMxL.__main__.AML(*args, target=None, **kwargs)[source]¶ Covers the complete pipeline of a classification project from a raw dataset to a deployable model.
AML is built as a class inherited from pandas DataFrame. Each Machine Learning step corresponds to method that can be called with default or filled parameters.
- explore: explore dataset and identify features types
- preprocess: clean and prepare data (optional : outliers processing).
- select_features: features selection (optional)
- model_train_predict : split AML in train/test sets to fits/apply models with random search. Returns the list of the valid models (without overfitting) and the best one.
deployment methods:
- preprocess_apply : apply fitted preprocessing transformation to a new dataset
- select_features_apply : idem
- model_apply : apply fitted models to a new dataset
Notes :
- A method requires that the former one has been applied (actuel step is given by “step” attribute)
- Target has to be binary and encoded as int (1/0) (see MLGB59.Start.Encode_Target module if you need help)
- don’t call your target “target” please :>
Parameters: - _obj (DataFrame) – Source Dataset
- target (string) – target name
-
explore(verbose=False)[source]¶ data exploration and features type identification
Note : if you disagree with automated identification, you can directly modify d_features attribute
- Create self.d_features : dict {x : list of variables names}
- date: date features
- identifier: identifier features
- verbatim: verbatim features
- boolean: boolean features
- categorical: categorical features
- numerical: numerical features
- NA: features which contains NA values
- low_variance: list of the features with low variance and unique values
Parameters: verbose (boolean (Default False)) – Get logging information
-
model_predict(df, metric='F1', delta_auc=0.03, verbose=False)[source]¶ apply fitted models on a dataset
- identifies valid models (auc(train)-auc(test)<0.03
- gets the best model in respect of a selected metric among valid model
Parameters: - metric (string (Default : 'F1')) – objective metric
- verbose (boolean (Default False)) – Get logging information
Returns: - dict – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
- list – valid models indexes
- int – best model index
- DataFrame – models summary
-
model_train(clf='XGBOOST', grid_param=None, top_bagging=False, n_comb=10, comb_seed=None, verbose=False)[source]¶ train models with random search
- creates models with random hyper-parameters combinations from HP grid
- fits models on self
Notes :
- Available classifiers : Random Forest, XGBOOST
- can enable bagging algo with top_bagging parameter
Parameters: - clf (string (Default : 'XGBOOST')) – classifier used for modelisation
- grid_param (dict) – random search grid {Hyperparameter name : values list}
- top_bagging (boolean (Default : False)) – enable Bagging
- n_comb (int (Default : 10)) – HP combination number
- comb_seed (int (Default : None)) – random combination seed
- verbose (boolean (Default False)) – Get logging information
-
model_train_test(clf='XGBOOST', grid_param=None, metric='F1', delta_auc=0.03, top_bagging=False, n_comb=10, comb_seed=None, verbose=False)[source]¶ train and test models with random search
- creates models with random hyper-parameters combinations from HP grid
- splits (random 80/20) train/test sets to fit/apply models
- identifies valid models (auc(train)-auc(test)<0.03
- gets the best model in respect of a selected metric among valid model
Notes :
- Available classifiers : Random Forest, XGBOOST
- can enable bagging algo with top_bagging parameter
Parameters: - clf (string (Default : 'XGBOOST')) – classifier used for modelisation
- grid_param (dict) – random search grid {Hyperparameter name : values list}
- metric (string (Default : 'F1')) – objective metric
- top_bagging (boolean (Default : False)) – enable Bagging
- n_comb (int (Default : 10)) – HP combination number
- comb_seed (int (Default : None)) – random combination seed
- verbose (boolean (Default False)) – Get logging information
Returns: - dict – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
- list – valid models indexes
- int – best model index
- DataFrame – models summary
-
preprocess(date_ref=None, process_outliers=False, cat_method='deep_encoder', verbose=False)[source]¶ Prepare the data before feeding it to the model :
- remove low variance features
- remove identifiers and verbatims features
- transform date features to timedelta
- fill missing values
- process categorical and boolean data (one-hot-encoding or Pytorch NN encoder)
- replace outliers (optional)
- create self.d_preprocess : dict {step : transformation}
- remove: list of the features to remove
- date: fitted DateEncoder object
- NA: fitted NAEncoder object
- categorical: fitted CategoricalEncoder object
- outlier: fitted OutlierEncoder object
Parameters: - date_ref (string '%d/%m/%y' (Default : None)) – ref date to compute date features timedelta. If None, today date
- process_outliers (boolean (Default : False)) – Enable outliers replacement
- cat_method (string (Default : 'deep_encoder')) – Categorical features encoding method
- verbose (boolean (Default False)) – Get logging information
-
preprocess_apply(df, verbose=False)[source]¶ Apply preprocessing.
Requires preprocess method to have been applied (so that all encoder are fitted).
Parameters: - df (DataFrame) – dataset to apply preprocessing on
- verbose (boolean (Default False)) – Get logging information
Returns: DataFrame
Return type: Preprocessed dataset