Start

Load

Data_handling import functions :

  • get_delimiter : identify delimiter for a .csv/.txt file
  • load_data : import dataset file into dataframe
AutoMxL.Start.Load.get_delimiter(file)[source]

Identify the delimiter for a csv/txt file

Parameters:file (string) – Path and name of the file (Ex : “data/file.csv”)
Returns:identified delimiter
Return type:string
AutoMxL.Start.Load.import_data(file, index_col=None, verbose=False)[source]

Import dataset as a DataFrame (identify delimiter for txt and csv files)

Available files : .txt, .csv, .xlsx, .xls files

Parameters:
  • file (string) – Path and name of the file (Ex : “data/file.csv”) If file is .csv, automatically identify delimiter
  • index_col (int, str, sequence of int / str, or False (Default None)) – Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.
  • verbose (boolean (Default False)) – Get logging information
Returns:

imported dataset

Return type:

DataFrame

Encode_Target

Target encoding functions :

  • category_to_target : create a target variable (1/0) from a selected category
  • range_to_target : create a target variable (1/0) from a selected range
AutoMxL.Start.Encode_Target.category_to_target(df, var, cat)[source]

Create a target variable (1/0) from a selected category

Parameters:
  • df (DataFrame) – input dataset
  • var (string) – variable containing the target category
  • cat (string) – target category
Returns:

  • DataFrame (modified dataset)
  • string (new target name (var+’_’+cat))

AutoMxL.Start.Encode_Target.range_to_target(df, var, min=None, max=None, verbose=False)[source]

Create a target variable (1/0) from a selected range

Parameters:
  • df (DataFrame) – input dataset
  • var (string) – variable containing the target range
  • min (float) – lower limit. If None, no min
  • max (float) – upper limit. If None, no max
  • verbose (boolean (Default False)) – Get logging information
Returns:

  • DataFrame (modified dataset)
  • string (new target name (var+’_’+lower+’_’+upper))

Explore

Explore

Global dataset information functions :

  • explore (func): Identify variables types and gives global information about the dataset (NA, low variance features)
  • low variance features (func): identify features with low variance
  • get_features_type (func): get all features per type
AutoMxL.Explore.Explore.explore(df, verbose=False)[source]

Identify variables types and gives global information about the dataset

  • Variables type :
    • date
    • identifier
    • verbatim
    • boolean
    • categorical
    • numerical
  • variables containing NA values
  • low variance and unique values variables

See get_features_type function doc for type identification heuristics

Parameters:
  • df (DataFrame) – input dataset
  • verbose (boolean (Default False)) – Get logging information
Returns:

{x : variables names list }

  • date : date features
  • identifier : identifier features
  • verbatim : verbatim features
  • boolean : boolean features
  • categorical : categorical features
  • numerical : numerical features
  • categorical : categorical features
  • date : date features
  • NA : features which contains NA values
  • low_variance : list of the features with low variance

Return type:

dict

AutoMxL.Explore.Explore.get_features_type(df, l_var=None, th=0.95)[source]

Get all features per type :

  • date : try to apply to_datetime
  • identifier :
    • #(unique values)/#(total values) > threshold (default 0.95)
    • AND length is the same for all values (for non NA)
  • verbatim :
    • #(unique values)/#(total values) >= threshold (default 0.95)
    • AND length is NOT the same for all values (for non NA)
  • boolean : #(distinct values) = 2
  • categorical :
    • not a date
    • #(unique values)/#(total values) < threshold (default 0.95)
    • AND #(uniques values)>2
    • AND for num values #(unique values)<30
  • numerical : others
Parameters:
  • df (DataFrame) – input dataset
  • l_var (list (Default : None)) – variable names
  • th (float (Default : 0.95)) – threshold used to identify identifiers/verbatims variables
Returns:

{ type : variables name list}

Return type:

dict

AutoMxL.Explore.Explore.low_variance_features(df, var_list=None, threshold=0, rescale=True, verbose=False)[source]

Identify numerical features with low variance : (< threshold). Possible to rescale feature before computing.

Parameters:
  • df (DataFrame) – input DataFrame
  • var_list (list (default : None)) – names of the variables to check variance if None : all the numerical features
  • threshold (float (default : 0)) – variance threshold
  • rescale (bool (default : true)) – enable MinMaxScaler before computing variance
verbose : boolean (Default False)
Get logging information
Returns:Names of the variables with low variance
Return type:list

Features_Type

Variables type identification function

  • features_from_type (func): get all features for a selected type
  • is_date (func): test if a variable is a date
  • is_identifier (func): test if a variable is an identifier
  • is_verbatim (func): test if a variable is a verbatim
  • is_boolean (func): test if a variable is a boolean
  • is_categorical (func): test if a variable is a categorical one (with more than 2 categories)
AutoMxL.Explore.Features_Type.features_from_type(df, typ, l_var=None, th=0.95)[source]

Get features of a selected type :

  • date : try to apply to_datetime
  • identifier :
    • #(unique values)/#(total values) > threshold (default 0.95)
    • AND length is the same for all values (for non NA)
  • verbatim :
    • #(unique values)/#(total values) >= threshold (default 0.95)
    • AND length is NOT the same for all values (for non NA)
  • boolean : #(distinct values) = 2
  • categorical :
    • not a date
    • #(unique values)/#(total values) < threshold (default 0.95)
    • AND #(uniques values)>2
    • AND for num values #(unique values)<30
Parameters:
  • df (DataFrame) – input dataset
  • typ (string) –

    selected type to get features:

    • ’date’
    • ’identifier’
    • ’verbatim’
    • ’boolean’
    • categorical
  • l_var (list (Default : None)) – variables names. If None, all dataset columns
  • th (float (Default : 0.95)) – threshold used to identify identifiers/verbatims variables
Returns:

identified variables names

Return type:

list

AutoMxL.Explore.Features_Type.is_boolean(df, col)[source]

Test if a variable is a boolean.

  • #(distinct values) = 2
Parameters:
  • df (DataFrame) – input dataset
  • col (string) – variable name
Returns:

res – test result

Return type:

boolean

AutoMxL.Explore.Features_Type.is_categorical(df, col, th=0.95)[source]

Test if a variable is a categorical one (with more than 2 categories).

  • not a date
  • #(unique values)/#(total values) < threshold (default 0.95
  • AND #(uniques values)>2
  • AND for num values #(unique values)<30
Parameters:
  • df (DataFrame) – input dataset
  • col (string) – variable name
  • th (float (Default : 0.95)) – threshold
Returns:

res – test result

Return type:

boolean

AutoMxL.Explore.Features_Type.is_date(df, col)[source]

Test if a variable is a date.

Method : try to apply to_datetime

Parameters:
  • df (DataFrame) – input dataset
  • col (string) – variable name
Returns:

res – test result

Return type:

boolean

AutoMxL.Explore.Features_Type.is_identifier(df, col, th=0.95)[source]

Test if a variable is an identifier.

  • #(unique values)/#(total values) > threshold (default 0.95)
  • AND length is the same for all values (for non NA)
  • AND not date
Parameters:
  • df (DataFrame) – input dataset
  • col (string) – variable name
  • th (float (Default : 0.95)) – threshold rate
Returns:

res – test result

Return type:

boolean

AutoMxL.Explore.Features_Type.is_verbatim(df, col, th=0.95)[source]

Test if a variable is a verbatim.

  • #(unique values)/#(total values) >= threshold (default 0.95)
  • AND length is NOT the same for all values (for non NA)
Parameters:
  • df (DataFrame) – input dataset
  • col (string) – variable name
  • th (float (Default : 0.95)) – threshold rate
Returns:

res – test result

Return type:

boolean

Preprocessing

Missing_Values

Missing values handling functions :

  • NAEncoder (class): encoder that replaces missing values
  • fill_numerical (func): replace missing values for numerical features
  • fill_categorical (func): replace missing values for categorical features
  • get_NA_features (func): get features containing NA values
class AutoMxL.Preprocessing.Missing_Values.NAEncoder(replace_num_with='median', replace_cat_with='NR', track_num_NA=True)[source]

Missing values filling

Available methods to replace missing values

  • num : metdian/mean/zero
  • cat : ‘NR’
Parameters:
  • replace_num_with (string) – method used to replace numerical missing values
  • replace_cat_with (string) – method used to replace categorical missing values
fit(df, l_var, verbose=False)[source]

fit encoder

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features
  • verbose (boolean (Default False)) – Get logging information
fit_transform(df, l_var=None, verbose=False)[source]

fit and transform dataset with encoder

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features identified as dates (see Features_Type module)
  • verbose (boolean (Default False)) – Get logging information
transform(df, verbose=False)[source]

transform dataset categorical features using the encoder. Can be done only if encoder has been fitted

Parameters:
  • df (DataFrame) – dataset to transform
  • verbose (boolean (Default False)) – Get logging information
AutoMxL.Preprocessing.Missing_Values.fill_categorical(df, l_var=None, method='NR', verbose=False)[source]

Fill missing values for selected/all categorical features.

Parameters:
  • df (DataFrame) – Input dataset
  • l_var (list (Default : None)) – list of the features to fill. If None, contains all the categorical features
  • method (string (Default : 'NR')) –

    Method used to fill the NA values :

    • NR : replace NA with ‘NR’
  • verbose (boolean (Default False)) – Get logging information
Returns:

Modified dataset

Return type:

DataFrame

AutoMxL.Preprocessing.Missing_Values.fill_numerical(df, l_var=None, method='median', track_num_NA=True, verbose=False)[source]

Fill missing values for selected/all numerical features. top_var_NA parameter allows to create a variable to keep track of missing values.

Available methods : replace with zero, median or mean (Default = median)

Parameters:
  • df (DataFrame) – Input dataset
  • l_var (list (Default : None)) – names of the features to fill. If None, all the numerical features
  • method (string (Default : 'median')) –

    Method used to fill the NA values :

    • zero : replace with zero
    • median : replace with median
    • mean : replace with mean
  • track_num_NA (boolean (Defaut : True)) – If True, create a boolean column to keep track of missing values
  • verbose (boolean (Default False)) – Get logging information
Returns:

Modified dataset

Return type:

DataFrame

AutoMxL.Preprocessing.Missing_Values.get_NA_features(df)[source]

identify features containing NA values

Parameters:df (DataFrame) – input dataset
Returns:list
Return type:features containing missing values

Categorical Data

Categorical features processing

  • CategoricalEncoder (class) : Encode categorical features
  • dummy_all_var (func) : get one hot encoded vector for each category of a categorical features list
  • get_embedded_cat (func) : get embedding representation with NN
  • mca (func) : to do
class AutoMxL.Preprocessing.Categorical.CategoricalEncoder(method='deep_encoder')[source]

Encode categorical features

Available encoding methods :

  • one hot encoding
  • deep_encoder : Build and train a Neural Network for the creation of embeddings for categorical variables.

(https://www.fast.ai/2018/04/29/categorical-embeddings/)

Default NN model parameters are stored in param_config.py file

Parameters:method (string (Default : deep_encoder)) – method used to get categorical encoding Available methods : “one_hot”, “deep_encoder”
fit(df, l_var=None, target=None, verbose=False)[source]

Fit encoder on dataset following method

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list (Default None)) – names of the variables to encode. If None, all the categorical and boolean features
  • target (string (Default None)) – name of the target for deep_encoder method
  • verbose (boolean (Default False)) – Get logging information
fit_transform(df, l_var=None, target=None, verbose=False)[source]

fit and transform dataset categorical features

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list (Default None)) – names of the variables to encode. If None, all the categorical and boolean features
  • target (string (Default None)) – name of the target for deep_encoder method
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

modified dataset

transform(df, verbose=False)[source]

transform dataset categorical features using the encoder. Can be done only if encoder has been fitted

Parameters:
  • df (DataFrame) – dataset to transform
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

modified dataset

AutoMxL.Preprocessing.Categorical.dummy_all_var(df, var_list=None, prefix_list=None, keep=False, verbose=False)[source]

Get one hot encoded vector for selected/all categorical features

Parameters:
  • df (DatraFrame) – Input dataset
  • var_list (list (Default : None)) – Names of the features to dummify If None, all the num features
  • prefix_list (list (default : None)) – Prefix to add before new features name (prefix+’_’+cat). If None, prefix=variable name
  • keep (boolean (Default = False)) – If True, delete the original feature
  • verbose (boolean (Default False)) – Get logging information
Returns:

Modified dataset

Return type:

DataFrame

AutoMxL.Preprocessing.Categorical.get_embedded_cat(df, var_list, target, batchsize, n_epochs, lr, verbose=False)[source]

Get embedded representation for categorical features using NN encoder

Parameters:
  • df (DataFrame) – input Dataset
  • var_list (list of strings) – features names
  • target (string) – target name
  • batchsize (int) – batch size for encoder training
  • n_epochs (int) – number of epoch for encoder training
  • lr (float) – encoder learning rate
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

modified dataset

Date Data

Date Features processing functions:

  • DateEncoder (class) : encode date features
  • all_to_date (func): detect dates from num/cat features and transform them to datetime format.
  • date_to_anc (func): transform datetime features to timedelta according to a ref date
class AutoMxL.Preprocessing.Date.DateEncoder(method='timedelta', date_ref=None)[source]

Encode categorical features

Available methods :

  • timedelta : compute time between date feature and parameter date_ref
Parameters:
  • method (string (Default : timedelta)) – method used to encode dates Available methods : “timedelta”
  • date_ref (string '%d/%m/%y' (Default : None)) – Date to compute timedelta. If None, today date
fit(df, l_var=None, verbose=False)[source]

fit encoder

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, contains all features identified as dates (see Features_Type module)
  • verbose (boolean (Default False)) – Get logging information
fit_transform(df, l_var=None, verbose=False)[source]

fit and transform dataset with encoder

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features identified as dates (see Features_Type module)
  • verbose (boolean (Default False)) – Get logging information
transform(df, verbose=False)[source]

transform dataset date features using the encoder. Can be done only if encoder has been fitted

Parameters:
  • df (DataFrame) – dataset to transform
  • verbose (boolean (Default False)) – Get logging information
AutoMxL.Preprocessing.Date.all_to_date(df, l_var=None, verbose=False)[source]

Detect dates from selected/all features and transform them to datetime format.

Parameters:
  • df (DataFrame) – Input dataset
  • l_var (list (Default : None)) – Names of the features If None, all the features
  • verbose (boolean (Default False)) – Get logging information
Returns:

Modified dataset

Return type:

DataFrame

AutoMxL.Preprocessing.Date.date_to_anc(df, l_var=None, date_ref=None, verbose=False)[source]

Transform selected/all datetime features to timedelta according to a ref date

Parameters:
  • df (DataFrame) – Input dataset
  • l_var (list (Default : None)) – List of the features to analyze. If None, contains all the datetime features
  • date_ref (string '%d/%m/%y' (Default : None)) – Date to compute timedelta. If None, today date
  • verbose (boolean (Default False)) – Get logging information
Returns:

  • DataFrame – Modified dataset
  • list – New timedelta features names

Process Outliers

Outliers handling functions

  • OutliersEncoding (class) : identify and replace outliers
  • get_cat_outliers (funct): identify categorical features containing outliers
  • get_num_outliers (func): identify numerical features containing outliers
  • replace_category (func): replace categories of a categorical variable
  • replace_extreme_values (func): replace extreme values (oh!)
class AutoMxL.Preprocessing.Outliers.OutliersEncoder(cat_threshold=0.02, num_xstd=4)[source]

Identify et replace outliers for categorical dang numerical features

  • num : x outlier <=> abs(x - mean) > xstd * var
  • cat : x outlier category <=> with frequency <x% (Default 5%)
Parameters:
  • cat_threshold (float (default 0.02)) – Minimum modality frequency
  • num_xstd (int (Default : 3)) – Variance gap coef
fit(df, l_var, verbose=False)[source]

Fit encoder

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features
  • verbose (boolean (Default False)) – Get logging information
fit_transform(df, l_var=None, verbose=False)[source]

Fit and transform dataset with encoder

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features identified as dates (see Features_Type module)
  • verbose (boolean (Default False)) – Get logging information
transform(df, verbose=False)[source]

Transform dataset features using the encoder. Can be done only if encoder has been fitted

Parameters:
  • df (DataFrame) – dataset to transform
  • verbose (boolean (Default False)) – Get logging information
AutoMxL.Preprocessing.Outliers.get_cat_outliers(df, l_var=None, threshold=0.05, verbose=False)[source]

Outliers detection for selected/all categorical features.

Method : Modalities with frequency <x% (Default 5%)

Parameters:
  • df (DataFrame) – Input dataset
  • l_var (list (Default : None)) – Names of the features If None, all the categorical features
  • threshold (float (Default : 0.05)) – Minimum modality frequency
  • verbose (boolean (Default False)) – Get logging information
Returns:

{variable : list of categories considered as outliers}

Return type:

dict

AutoMxL.Preprocessing.Outliers.get_num_outliers(df, l_var=None, xstd=3, verbose=False)[source]

Outliers detection for selected/all numerical features.

Method : x outlier <=> abs(x - mean) > xstd * var

Parameters:
  • df (DataFrame) – Input dataset
  • l_var (list (Default : None)) – Names of the features If None, all the num features
  • xstd (int (Default : 3)) – Variance gap coef
  • verbose (boolean (Default False)) – Get logging information
Returns:

{variable : [lower_limit, upper_limit]}

Return type:

dict

AutoMxL.Preprocessing.Outliers.replace_category(df, var, categories, replace_with='outliers', verbose=False)[source]

Replace categories of a categorical variable

Parameters:
  • df (DataFrame) – Input dataset
  • var (string) – variable to modify
  • categories (list(string)) – categories to replace
  • replace_with (string (Default : 'outliers')) – word to replace categories with
  • verbose (boolean (Default False)) – Get logging information
Returns:

Modified dataset

Return type:

DataFrame

AutoMxL.Preprocessing.Outliers.replace_extreme_values(df, var, lower_th=None, upper_th=None, verbose=False)[source]

Replace extrem values : > upper threshold or < lower threshold

Parameters:
  • df (DataFrame) – Input dataset
  • var (string) – variable to modify
  • lower_th (int/float (Default=None)) – lower threshold
  • upper_th (int/float (Default=None)) – upper threshold
  • verbose (boolean (Default False)) – Get logging information
Returns:

Modified dataset

Return type:

DataFrame

Features Selection

Features selection

  • select_features (func) : features selection following method
class AutoMxL.Select_Features.Select_Features.FeatSelector(method='pca')[source]

features selection following method

  • pca : use pca to reduce dataset dimensions
  • no_rescale_pca : use pca without rescaling data
Parameters:method (string (Default pca)) – method use to select features
fit(df, l_var=None, verbose=False)[source]

fit selector

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features identified as numerical
  • verbose (boolean (Default False)) – Get logging information
fit_transform(df, l_var, verbose=False)[source]

fit and apply features selection

Parameters:
  • df (DataFrame) – input dataset
  • l_var (list) – features to encode. If None, all features identified as dates (see Features_Type module)
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

modified dataset

transform(df, verbose=False)[source]

apply features selection on a dataset

Parameters:
  • df (DataFrame) – dataset to transform
  • verbose (boolean (Default False)) – Get logging information
Returns:

DataFrame

Return type:

modified dataset

AutoMxL.Select_Features.Select_Features.select_features(df, target, method='pca', verbose=False)[source]

features selection following method

  • pca : use pca to reduce dataset dimensions
  • no_rescale_pca : use pca without rescaling data
Parameters:
  • df (DataFrame) – input dataset containing features
  • target (string) – target name
  • method (string (Default pca)) – method use to select features
  • verbose (boolean (Default False)) – Get logging information
Returns:

modified dataset

Return type:

DataFrame

Modelisation

Bagging

Bagging algorithm class. Methods :

  • Bagging (class) : generate new training more balanced and train model for each
  • Bagging_sample (func) : generate bagging sample
class AutoMxL.Modelisation.Bagging.Bagging(clf=<sphinx.ext.autodoc.importer._MockObject object>, n_sample=5, pos_sample_size=1.0, replace=True)[source]

Meta-algo designed to improve the stability and accuracy of ML classif/regression algos or to face an “imbalanced target distribution” issue.

Bagging generates m new training sets more balanced. Then, a model is fitted on each sample and outputs are combined by averaging (for regression) or voting (for classification).

Available classifiers : Random Forest and XGBOOST

Parameters:
  • clf (Model fitted on samples (Default : RandomForestClassifier(n_estimators=100, max_leaf_nodes=100)) – Model fitted on the samples
  • n_sample (int (Default : 5)) – number a samples
  • pos_sample_size (int/float (Default : 1.0)) –

    Number/rate of target=1 observations in each sample (filled with 3 times more target=0 )

    • if int : number of target=1
    • if float : rate of total target=1
  • replace (Boolean (Default : False)) – Enable sampling with replacement
  • list_model (list (Default : None)) – Fitted models (created with fit method)
bag_feature_importance(X)[source]

Get features importance of the model by averaging importance of models fitted on the samples

Parameters:X (DataFrame) – Input Dataset
Returns:{feature : importance}
Return type:dict
fit(df_train, target)[source]

Create bagging samples from a DataFrame and fit the model (self.clf) on each sample

Parameters:
  • df_train (DataFrame) – Training dataset
  • target (String) – Target name
Returns:

self.list_model – Fitted models

Return type:

list

get_params()[source]

Get bagging object parameters

Returns:{param : value}
Return type:dict
predict(df)[source]

Apply models fitted on sample to a dataset. Combine models by averaging the outputs (for regression) or voting (for classification)

Parameters:df (DataFrame) – Dataset to apply the model
Returns:
  • numpy.ndarray (float) – Averaged classification probabilities
  • numpy.ndarray (int) – Predictions for each observation
AutoMxL.Modelisation.Bagging.create_sample(df, target, pos_target_nb, replace=False)[source]

Generate a DataFrame sample with selected number of target=1

Parameters:
  • df (DataFrame) – Input dataset
  • target (String) – Target name
  • pos_target_nb (int) – Number of target=1 observations in the sample
  • replace (Boolean (défaut : False)) – If True, create samples with replacement
Returns:

sample dataset

Return type:

DataFrame

Hyperoptimisation

Hyperopt class : Model hyper-optimisation with random search

  • Hyperopt (class) : Model hyper-optimisation with random search
class AutoMxL.Modelisation.HyperOpt.HyperOpt(classifier='RF', grid_param=None, n_param_comb=10, bagging=False, bagging_param={'n_sample': 5, 'pos_sample_size': 1.0, 'replace': False}, comb_seed=None)[source]

Model hyper-optimisation with random search :

  • From a hyper-parameters grid, creates random HPs combinations
  • train a model for each combination
  • apply the model
Parameters:
  • classifier (string (Default : 'RF')) – classifier for modelisation
  • grid_param (dict (Default : Default_RF_grid_param)) – HP grid
  • n_param_comb (int (Default : 10)) – number of HP combinations
  • bagging (Boolean (Default = False)) – use bagging method
  • bagging_param (n-uple) – bagging parameters (Default : default_bagging_param (Bagging module))
  • (created with fit method) (train_model_dict) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’}
  • bagging_object (Bagging) – bagging object
  • comb_seed (int) – seed for randomized HP combinations
fit(df_train, target, verbose=False)[source]

Fit a model for each HP combination

Parameters:
  • df_train (DataFrame) – Training dataset
  • target (string) – Target name
  • verbose (boolean (Default False)) – Get logging information
Returns:

self.train_model_dict (created with fit method) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’}

Return type:

dict

get_best_model(d_model_info, metric='F1', delta_auc_th=0.03, verbose=False)[source]

Identify valid models according to delta auc (test/train). Get the best model in respect of a selected metric among valid model

Parameters:
  • d_model_info (dict) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
  • metric (string (default = F1-score)) – Metric used to get the best model
  • delta_auc_th (float) – Threshold for valid models : abs(auc(train) - auc(test))
  • verbose (boolean (Default False)) – Get logging information
Returns:

  • int – Best model index
  • list – Valid model indexes

get_params()[source]

Return Hyperopt object parameters

Returns:{param : value}
Return type:dict
model_res_to_df(d_model_infos, sort_metric='F1')[source]

Store models summary in DataFrame

Parameters:
  • d_model_info (dict) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
  • sort_metric (string (default = 'F1')) – metric to sort models (descendant)
Returns:

model infos and metrics

Return type:

DataFrame

predict(df, target, delta_auc, verbose=False)[source]

Apply the models

Parameters:
  • df (DataFrame) – Dataset to apply the models
  • target (string) – Target name
  • delta_auc_th (float) – Threshold for valid models : abs(auc(train) - auc(test))
  • verbose (boolean (Default False)) – Get logging information
Returns:

{model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}

Return type:

dict