Start¶
Load¶
Data_handling import functions :
- get_delimiter : identify delimiter for a .csv/.txt file
- load_data : import dataset file into dataframe
-
AutoMxL.Start.Load.get_delimiter(file)[source]¶ Identify the delimiter for a csv/txt file
Parameters: file (string) – Path and name of the file (Ex : “data/file.csv”) Returns: identified delimiter Return type: string
-
AutoMxL.Start.Load.import_data(file, index_col=None, verbose=False)[source]¶ Import dataset as a DataFrame (identify delimiter for txt and csv files)
Available files : .txt, .csv, .xlsx, .xls files
Parameters: - file (string) – Path and name of the file (Ex : “data/file.csv”) If file is .csv, automatically identify delimiter
- index_col (int, str, sequence of int / str, or False (Default None)) – Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.
- verbose (boolean (Default False)) – Get logging information
Returns: imported dataset
Return type: DataFrame
Encode_Target¶
Target encoding functions :
- category_to_target : create a target variable (1/0) from a selected category
- range_to_target : create a target variable (1/0) from a selected range
-
AutoMxL.Start.Encode_Target.category_to_target(df, var, cat)[source]¶ Create a target variable (1/0) from a selected category
Parameters: - df (DataFrame) – input dataset
- var (string) – variable containing the target category
- cat (string) – target category
Returns: - DataFrame (modified dataset)
- string (new target name (var+’_’+cat))
-
AutoMxL.Start.Encode_Target.range_to_target(df, var, min=None, max=None, verbose=False)[source]¶ Create a target variable (1/0) from a selected range
Parameters: - df (DataFrame) – input dataset
- var (string) – variable containing the target range
- min (float) – lower limit. If None, no min
- max (float) – upper limit. If None, no max
- verbose (boolean (Default False)) – Get logging information
Returns: - DataFrame (modified dataset)
- string (new target name (var+’_’+lower+’_’+upper))
Explore¶
Explore¶
Global dataset information functions :
- explore (func): Identify variables types and gives global information about the dataset (NA, low variance features)
- low variance features (func): identify features with low variance
- get_features_type (func): get all features per type
-
AutoMxL.Explore.Explore.explore(df, verbose=False)[source]¶ Identify variables types and gives global information about the dataset
- Variables type :
- date
- identifier
- verbatim
- boolean
- categorical
- numerical
- variables containing NA values
- low variance and unique values variables
See get_features_type function doc for type identification heuristics
Parameters: - df (DataFrame) – input dataset
- verbose (boolean (Default False)) – Get logging information
Returns: {x : variables names list }
- date : date features
- identifier : identifier features
- verbatim : verbatim features
- boolean : boolean features
- categorical : categorical features
- numerical : numerical features
- categorical : categorical features
- date : date features
- NA : features which contains NA values
- low_variance : list of the features with low variance
Return type: dict
-
AutoMxL.Explore.Explore.get_features_type(df, l_var=None, th=0.95)[source]¶ Get all features per type :
- date : try to apply to_datetime
- identifier :
- #(unique values)/#(total values) > threshold (default 0.95)
- AND length is the same for all values (for non NA)
- verbatim :
- #(unique values)/#(total values) >= threshold (default 0.95)
- AND length is NOT the same for all values (for non NA)
- boolean : #(distinct values) = 2
- categorical :
- not a date
- #(unique values)/#(total values) < threshold (default 0.95)
- AND #(uniques values)>2
- AND for num values #(unique values)<30
- numerical : others
Parameters: - df (DataFrame) – input dataset
- l_var (list (Default : None)) – variable names
- th (float (Default : 0.95)) – threshold used to identify identifiers/verbatims variables
Returns: { type : variables name list}
Return type: dict
-
AutoMxL.Explore.Explore.low_variance_features(df, var_list=None, threshold=0, rescale=True, verbose=False)[source]¶ Identify numerical features with low variance : (< threshold). Possible to rescale feature before computing.
Parameters: - df (DataFrame) – input DataFrame
- var_list (list (default : None)) – names of the variables to check variance if None : all the numerical features
- threshold (float (default : 0)) – variance threshold
- rescale (bool (default : true)) – enable MinMaxScaler before computing variance
- verbose : boolean (Default False)
- Get logging information
Returns: Names of the variables with low variance Return type: list
Features_Type¶
Variables type identification function
- features_from_type (func): get all features for a selected type
- is_date (func): test if a variable is a date
- is_identifier (func): test if a variable is an identifier
- is_verbatim (func): test if a variable is a verbatim
- is_boolean (func): test if a variable is a boolean
- is_categorical (func): test if a variable is a categorical one (with more than 2 categories)
-
AutoMxL.Explore.Features_Type.features_from_type(df, typ, l_var=None, th=0.95)[source]¶ Get features of a selected type :
- date : try to apply to_datetime
- identifier :
- #(unique values)/#(total values) > threshold (default 0.95)
- AND length is the same for all values (for non NA)
- verbatim :
- #(unique values)/#(total values) >= threshold (default 0.95)
- AND length is NOT the same for all values (for non NA)
- boolean : #(distinct values) = 2
- categorical :
- not a date
- #(unique values)/#(total values) < threshold (default 0.95)
- AND #(uniques values)>2
- AND for num values #(unique values)<30
Parameters: - df (DataFrame) – input dataset
- typ (string) –
selected type to get features:
- ’date’
- ’identifier’
- ’verbatim’
- ’boolean’
- categorical
- l_var (list (Default : None)) – variables names. If None, all dataset columns
- th (float (Default : 0.95)) – threshold used to identify identifiers/verbatims variables
Returns: identified variables names
Return type: list
-
AutoMxL.Explore.Features_Type.is_boolean(df, col)[source]¶ Test if a variable is a boolean.
- #(distinct values) = 2
Parameters: - df (DataFrame) – input dataset
- col (string) – variable name
Returns: res – test result
Return type: boolean
-
AutoMxL.Explore.Features_Type.is_categorical(df, col, th=0.95)[source]¶ Test if a variable is a categorical one (with more than 2 categories).
- not a date
- #(unique values)/#(total values) < threshold (default 0.95
- AND #(uniques values)>2
- AND for num values #(unique values)<30
Parameters: - df (DataFrame) – input dataset
- col (string) – variable name
- th (float (Default : 0.95)) – threshold
Returns: res – test result
Return type: boolean
-
AutoMxL.Explore.Features_Type.is_date(df, col)[source]¶ Test if a variable is a date.
Method : try to apply to_datetime
Parameters: - df (DataFrame) – input dataset
- col (string) – variable name
Returns: res – test result
Return type: boolean
-
AutoMxL.Explore.Features_Type.is_identifier(df, col, th=0.95)[source]¶ Test if a variable is an identifier.
- #(unique values)/#(total values) > threshold (default 0.95)
- AND length is the same for all values (for non NA)
- AND not date
Parameters: - df (DataFrame) – input dataset
- col (string) – variable name
- th (float (Default : 0.95)) – threshold rate
Returns: res – test result
Return type: boolean
-
AutoMxL.Explore.Features_Type.is_verbatim(df, col, th=0.95)[source]¶ Test if a variable is a verbatim.
- #(unique values)/#(total values) >= threshold (default 0.95)
- AND length is NOT the same for all values (for non NA)
Parameters: - df (DataFrame) – input dataset
- col (string) – variable name
- th (float (Default : 0.95)) – threshold rate
Returns: res – test result
Return type: boolean
Preprocessing¶
Missing_Values¶
Missing values handling functions :
- NAEncoder (class): encoder that replaces missing values
- fill_numerical (func): replace missing values for numerical features
- fill_categorical (func): replace missing values for categorical features
- get_NA_features (func): get features containing NA values
-
class
AutoMxL.Preprocessing.Missing_Values.NAEncoder(replace_num_with='median', replace_cat_with='NR', track_num_NA=True)[source]¶ Missing values filling
Available methods to replace missing values
- num : metdian/mean/zero
- cat : ‘NR’
Parameters: - replace_num_with (string) – method used to replace numerical missing values
- replace_cat_with (string) – method used to replace categorical missing values
-
fit(df, l_var, verbose=False)[source]¶ fit encoder
Parameters: - df (DataFrame) – input dataset
- l_var (list) – features to encode. If None, all features
- verbose (boolean (Default False)) – Get logging information
-
AutoMxL.Preprocessing.Missing_Values.fill_categorical(df, l_var=None, method='NR', verbose=False)[source]¶ Fill missing values for selected/all categorical features.
Parameters: - df (DataFrame) – Input dataset
- l_var (list (Default : None)) – list of the features to fill. If None, contains all the categorical features
- method (string (Default : 'NR')) –
Method used to fill the NA values :
- NR : replace NA with ‘NR’
- verbose (boolean (Default False)) – Get logging information
Returns: Modified dataset
Return type: DataFrame
-
AutoMxL.Preprocessing.Missing_Values.fill_numerical(df, l_var=None, method='median', track_num_NA=True, verbose=False)[source]¶ Fill missing values for selected/all numerical features. top_var_NA parameter allows to create a variable to keep track of missing values.
Available methods : replace with zero, median or mean (Default = median)
Parameters: - df (DataFrame) – Input dataset
- l_var (list (Default : None)) – names of the features to fill. If None, all the numerical features
- method (string (Default : 'median')) –
Method used to fill the NA values :
- zero : replace with zero
- median : replace with median
- mean : replace with mean
- track_num_NA (boolean (Defaut : True)) – If True, create a boolean column to keep track of missing values
- verbose (boolean (Default False)) – Get logging information
Returns: Modified dataset
Return type: DataFrame
Categorical Data¶
Categorical features processing
- CategoricalEncoder (class) : Encode categorical features
- dummy_all_var (func) : get one hot encoded vector for each category of a categorical features list
- get_embedded_cat (func) : get embedding representation with NN
- mca (func) : to do
-
class
AutoMxL.Preprocessing.Categorical.CategoricalEncoder(method='deep_encoder')[source]¶ Encode categorical features
Available encoding methods :
- one hot encoding
- deep_encoder : Build and train a Neural Network for the creation of embeddings for categorical variables.
(https://www.fast.ai/2018/04/29/categorical-embeddings/)
Default NN model parameters are stored in param_config.py file
Parameters: method (string (Default : deep_encoder)) – method used to get categorical encoding Available methods : “one_hot”, “deep_encoder” -
fit(df, l_var=None, target=None, verbose=False)[source]¶ Fit encoder on dataset following method
Parameters: - df (DataFrame) – input dataset
- l_var (list (Default None)) – names of the variables to encode. If None, all the categorical and boolean features
- target (string (Default None)) – name of the target for deep_encoder method
- verbose (boolean (Default False)) – Get logging information
-
fit_transform(df, l_var=None, target=None, verbose=False)[source]¶ fit and transform dataset categorical features
Parameters: - df (DataFrame) – input dataset
- l_var (list (Default None)) – names of the variables to encode. If None, all the categorical and boolean features
- target (string (Default None)) – name of the target for deep_encoder method
- verbose (boolean (Default False)) – Get logging information
Returns: DataFrame
Return type: modified dataset
-
AutoMxL.Preprocessing.Categorical.dummy_all_var(df, var_list=None, prefix_list=None, keep=False, verbose=False)[source]¶ Get one hot encoded vector for selected/all categorical features
Parameters: - df (DatraFrame) – Input dataset
- var_list (list (Default : None)) – Names of the features to dummify If None, all the num features
- prefix_list (list (default : None)) – Prefix to add before new features name (prefix+’_’+cat). If None, prefix=variable name
- keep (boolean (Default = False)) – If True, delete the original feature
- verbose (boolean (Default False)) – Get logging information
Returns: Modified dataset
Return type: DataFrame
-
AutoMxL.Preprocessing.Categorical.get_embedded_cat(df, var_list, target, batchsize, n_epochs, lr, verbose=False)[source]¶ Get embedded representation for categorical features using NN encoder
Parameters: - df (DataFrame) – input Dataset
- var_list (list of strings) – features names
- target (string) – target name
- batchsize (int) – batch size for encoder training
- n_epochs (int) – number of epoch for encoder training
- lr (float) – encoder learning rate
- verbose (boolean (Default False)) – Get logging information
Returns: DataFrame
Return type: modified dataset
Date Data¶
Date Features processing functions:
- DateEncoder (class) : encode date features
- all_to_date (func): detect dates from num/cat features and transform them to datetime format.
- date_to_anc (func): transform datetime features to timedelta according to a ref date
-
class
AutoMxL.Preprocessing.Date.DateEncoder(method='timedelta', date_ref=None)[source]¶ Encode categorical features
Available methods :
- timedelta : compute time between date feature and parameter date_ref
Parameters: - method (string (Default : timedelta)) – method used to encode dates Available methods : “timedelta”
- date_ref (string '%d/%m/%y' (Default : None)) – Date to compute timedelta. If None, today date
-
fit(df, l_var=None, verbose=False)[source]¶ fit encoder
Parameters: - df (DataFrame) – input dataset
- l_var (list) – features to encode. If None, contains all features identified as dates (see Features_Type module)
- verbose (boolean (Default False)) – Get logging information
-
AutoMxL.Preprocessing.Date.all_to_date(df, l_var=None, verbose=False)[source]¶ Detect dates from selected/all features and transform them to datetime format.
Parameters: - df (DataFrame) – Input dataset
- l_var (list (Default : None)) – Names of the features If None, all the features
- verbose (boolean (Default False)) – Get logging information
Returns: Modified dataset
Return type: DataFrame
-
AutoMxL.Preprocessing.Date.date_to_anc(df, l_var=None, date_ref=None, verbose=False)[source]¶ Transform selected/all datetime features to timedelta according to a ref date
Parameters: - df (DataFrame) – Input dataset
- l_var (list (Default : None)) – List of the features to analyze. If None, contains all the datetime features
- date_ref (string '%d/%m/%y' (Default : None)) – Date to compute timedelta. If None, today date
- verbose (boolean (Default False)) – Get logging information
Returns: - DataFrame – Modified dataset
- list – New timedelta features names
Process Outliers¶
Outliers handling functions
- OutliersEncoding (class) : identify and replace outliers
- get_cat_outliers (funct): identify categorical features containing outliers
- get_num_outliers (func): identify numerical features containing outliers
- replace_category (func): replace categories of a categorical variable
- replace_extreme_values (func): replace extreme values (oh!)
-
class
AutoMxL.Preprocessing.Outliers.OutliersEncoder(cat_threshold=0.02, num_xstd=4)[source]¶ Identify et replace outliers for categorical dang numerical features
- num : x outlier <=> abs(x - mean) > xstd * var
- cat : x outlier category <=> with frequency <x% (Default 5%)
Parameters: - cat_threshold (float (default 0.02)) – Minimum modality frequency
- num_xstd (int (Default : 3)) – Variance gap coef
-
fit(df, l_var, verbose=False)[source]¶ Fit encoder
Parameters: - df (DataFrame) – input dataset
- l_var (list) – features to encode. If None, all features
- verbose (boolean (Default False)) – Get logging information
-
AutoMxL.Preprocessing.Outliers.get_cat_outliers(df, l_var=None, threshold=0.05, verbose=False)[source]¶ Outliers detection for selected/all categorical features.
Method : Modalities with frequency <x% (Default 5%)
Parameters: - df (DataFrame) – Input dataset
- l_var (list (Default : None)) – Names of the features If None, all the categorical features
- threshold (float (Default : 0.05)) – Minimum modality frequency
- verbose (boolean (Default False)) – Get logging information
Returns: {variable : list of categories considered as outliers}
Return type: dict
-
AutoMxL.Preprocessing.Outliers.get_num_outliers(df, l_var=None, xstd=3, verbose=False)[source]¶ Outliers detection for selected/all numerical features.
Method : x outlier <=> abs(x - mean) > xstd * var
Parameters: - df (DataFrame) – Input dataset
- l_var (list (Default : None)) – Names of the features If None, all the num features
- xstd (int (Default : 3)) – Variance gap coef
- verbose (boolean (Default False)) – Get logging information
Returns: {variable : [lower_limit, upper_limit]}
Return type: dict
-
AutoMxL.Preprocessing.Outliers.replace_category(df, var, categories, replace_with='outliers', verbose=False)[source]¶ Replace categories of a categorical variable
Parameters: - df (DataFrame) – Input dataset
- var (string) – variable to modify
- categories (list(string)) – categories to replace
- replace_with (string (Default : 'outliers')) – word to replace categories with
- verbose (boolean (Default False)) – Get logging information
Returns: Modified dataset
Return type: DataFrame
-
AutoMxL.Preprocessing.Outliers.replace_extreme_values(df, var, lower_th=None, upper_th=None, verbose=False)[source]¶ Replace extrem values : > upper threshold or < lower threshold
Parameters: - df (DataFrame) – Input dataset
- var (string) – variable to modify
- lower_th (int/float (Default=None)) – lower threshold
- upper_th (int/float (Default=None)) – upper threshold
- verbose (boolean (Default False)) – Get logging information
Returns: Modified dataset
Return type: DataFrame
Features Selection¶
Features selection
- select_features (func) : features selection following method
-
class
AutoMxL.Select_Features.Select_Features.FeatSelector(method='pca')[source]¶ features selection following method
- pca : use pca to reduce dataset dimensions
- no_rescale_pca : use pca without rescaling data
Parameters: method (string (Default pca)) – method use to select features -
fit(df, l_var=None, verbose=False)[source]¶ fit selector
Parameters: - df (DataFrame) – input dataset
- l_var (list) – features to encode. If None, all features identified as numerical
- verbose (boolean (Default False)) – Get logging information
-
fit_transform(df, l_var, verbose=False)[source]¶ fit and apply features selection
Parameters: - df (DataFrame) – input dataset
- l_var (list) – features to encode. If None, all features identified as dates (see Features_Type module)
- verbose (boolean (Default False)) – Get logging information
Returns: DataFrame
Return type: modified dataset
-
AutoMxL.Select_Features.Select_Features.select_features(df, target, method='pca', verbose=False)[source]¶ features selection following method
- pca : use pca to reduce dataset dimensions
- no_rescale_pca : use pca without rescaling data
Parameters: - df (DataFrame) – input dataset containing features
- target (string) – target name
- method (string (Default pca)) – method use to select features
- verbose (boolean (Default False)) – Get logging information
Returns: modified dataset
Return type: DataFrame
Modelisation¶
Bagging¶
Bagging algorithm class. Methods :
- Bagging (class) : generate new training more balanced and train model for each
- Bagging_sample (func) : generate bagging sample
-
class
AutoMxL.Modelisation.Bagging.Bagging(clf=<sphinx.ext.autodoc.importer._MockObject object>, n_sample=5, pos_sample_size=1.0, replace=True)[source]¶ Meta-algo designed to improve the stability and accuracy of ML classif/regression algos or to face an “imbalanced target distribution” issue.
Bagging generates m new training sets more balanced. Then, a model is fitted on each sample and outputs are combined by averaging (for regression) or voting (for classification).
Available classifiers : Random Forest and XGBOOST
Parameters: - clf (Model fitted on samples (Default : RandomForestClassifier(n_estimators=100, max_leaf_nodes=100)) – Model fitted on the samples
- n_sample (int (Default : 5)) – number a samples
- pos_sample_size (int/float (Default : 1.0)) –
Number/rate of target=1 observations in each sample (filled with 3 times more target=0 )
- if int : number of target=1
- if float : rate of total target=1
- replace (Boolean (Default : False)) – Enable sampling with replacement
- list_model (list (Default : None)) – Fitted models (created with fit method)
-
bag_feature_importance(X)[source]¶ Get features importance of the model by averaging importance of models fitted on the samples
Parameters: X (DataFrame) – Input Dataset Returns: {feature : importance} Return type: dict
-
fit(df_train, target)[source]¶ Create bagging samples from a DataFrame and fit the model (self.clf) on each sample
Parameters: - df_train (DataFrame) – Training dataset
- target (String) – Target name
Returns: self.list_model – Fitted models
Return type: list
-
predict(df)[source]¶ Apply models fitted on sample to a dataset. Combine models by averaging the outputs (for regression) or voting (for classification)
Parameters: df (DataFrame) – Dataset to apply the model Returns: - numpy.ndarray (float) – Averaged classification probabilities
- numpy.ndarray (int) – Predictions for each observation
-
AutoMxL.Modelisation.Bagging.create_sample(df, target, pos_target_nb, replace=False)[source]¶ Generate a DataFrame sample with selected number of target=1
Parameters: - df (DataFrame) – Input dataset
- target (String) – Target name
- pos_target_nb (int) – Number of target=1 observations in the sample
- replace (Boolean (défaut : False)) – If True, create samples with replacement
Returns: sample dataset
Return type: DataFrame
Hyperoptimisation¶
Hyperopt class : Model hyper-optimisation with random search
- Hyperopt (class) : Model hyper-optimisation with random search
-
class
AutoMxL.Modelisation.HyperOpt.HyperOpt(classifier='RF', grid_param=None, n_param_comb=10, bagging=False, bagging_param={'n_sample': 5, 'pos_sample_size': 1.0, 'replace': False}, comb_seed=None)[source]¶ Model hyper-optimisation with random search :
- From a hyper-parameters grid, creates random HPs combinations
- train a model for each combination
- apply the model
Parameters: - classifier (string (Default : 'RF')) – classifier for modelisation
- grid_param (dict (Default : Default_RF_grid_param)) – HP grid
- n_param_comb (int (Default : 10)) – number of HP combinations
- bagging (Boolean (Default = False)) – use bagging method
- bagging_param (n-uple) – bagging parameters (Default : default_bagging_param (Bagging module))
- (created with fit method) (train_model_dict) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’}
- bagging_object (Bagging) – bagging object
- comb_seed (int) – seed for randomized HP combinations
-
fit(df_train, target, verbose=False)[source]¶ Fit a model for each HP combination
Parameters: - df_train (DataFrame) – Training dataset
- target (string) – Target name
- verbose (boolean (Default False)) – Get logging information
Returns: self.train_model_dict (created with fit method) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’}
Return type: dict
-
get_best_model(d_model_info, metric='F1', delta_auc_th=0.03, verbose=False)[source]¶ Identify valid models according to delta auc (test/train). Get the best model in respect of a selected metric among valid model
Parameters: - d_model_info (dict) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
- metric (string (default = F1-score)) – Metric used to get the best model
- delta_auc_th (float) – Threshold for valid models : abs(auc(train) - auc(test))
- verbose (boolean (Default False)) – Get logging information
Returns: - int – Best model index
- list – Valid model indexes
-
model_res_to_df(d_model_infos, sort_metric='F1')[source]¶ Store models summary in DataFrame
Parameters: - d_model_info (dict) – {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
- sort_metric (string (default = 'F1')) – metric to sort models (descendant)
Returns: model infos and metrics
Return type: DataFrame
-
predict(df, target, delta_auc, verbose=False)[source]¶ Apply the models
Parameters: - df (DataFrame) – Dataset to apply the models
- target (string) – Target name
- delta_auc_th (float) – Threshold for valid models : abs(auc(train) - auc(test))
- verbose (boolean (Default False)) – Get logging information
Returns: {model_index : {‘HP’, ‘probas’, ‘model’, ‘features_importance’, ‘train_metrics’, ‘metrics’, ‘output’}
Return type: dict