Welcome to pycaret’s documentation!

PyCaret is an open source low-code machine learning library in Python that aims to reduce the hypothesis to insights cycle time in a ML experiment. It enables data scientists to perform end-to-end experiments quickly and efficiently. In comparison with the other open source machine learning libraries, PyCaret is an alternate low-code library that can be used to perform complex machine learning tasks with only few lines of code. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy and many more.

The design and simplicity of PyCaret is inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more expertise. Seasoned data scientists are often difficult to find and expensive to hire but citizen data scientists can be an effective way to mitigate this gap and address data related challenges in business setting.

PyCaret is simple, easy to use and deployment ready. All the steps performed in a ML experiment can be reproduced using a pipeline that is automatically developed and orchestrated in PyCaret as you progress through the experiment. A pipeline can be saved in a binary file format that is transferable across environments.

For more information on PyCaret, please visit our official website https://www.pycaret.org

Classification

pycaret.classification.setup(data, target, train_size=0.7, sampling=True, sample_estimator=None, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, high_cardinality_method='frequency', numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_outliers=False, outliers_threshold=0.05, remove_multicollinearity=False, multicollinearity_threshold=0.9, remove_perfect_collinearity=False, create_clusters=False, cluster_iter=20, polynomial_features=False, polynomial_degree=2, trigonometry_features=False, polynomial_threshold=0.1, group_features=None, group_names=None, feature_selection=False, feature_selection_threshold=0.8, feature_selection_method='classic', feature_interaction=False, feature_ratio=False, interaction_threshold=0.01, fix_imbalance=False, fix_imbalance_method=None, data_split_shuffle=True, folds_shuffle=False, n_jobs=-1, use_gpu=False, html=True, session_id=None, log_experiment=False, experiment_name=None, log_plots=False, log_profile=False, log_data=False, silent=False, verbose=True, profile=False)

This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: data and name of the target column.

All other parameters are optional.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')

‘juice’ is a pandas.DataFrame and ‘Purchase’ is the name of target column.

Parameters:
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.
  • target (string) – Name of the target column to be passed in as a string. The target variable could be binary or multiclass. In case of a multiclass target, all estimators are wrapped with a OneVsRest classifier.
  • train_size (float, default = 0.7) – Size of the training set. By default, 70% of the data will be used for training and validation. The remaining data will be used for a test / hold-out set.
  • sampling (bool, default = True) – When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance plot of AUC, Accuracy, Recall, Precision, Kappa and F1 values at various sample levels, that will assist in deciding the preferred sample size for modeling. The desired sample size must then be entered for training and validation in the pycaret environment. When sample_size entered is less than 1, the remaining dataset (1 - sample) is used for fitting the model only when finalize_model() is called.
  • sample_estimator (object, default = None) – If None, Logistic Regression is used by default.
  • categorical_features (string, default = None) – If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].
  • categorical_imputation (string, default = 'constant') – If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.
  • ordinal_features (dictionary, default = None) – When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. The list sequence must be in increasing order from lowest to highest.
  • high_cardinality_features (string, default = None) – When the data containts features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using method defined in high_cardinality_method param.
  • high_cardinality_method (string, default = 'frequency') – When method set to ‘frequency’ it will replace the original value of feature with the frequency distribution and convert the feature into numeric. Other available method is ‘clustering’ which performs the clustering on statistical attribute of data and replaces the original value of feature with cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.
  • numeric_features (string, default = None) – If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].
  • numeric_imputation (string, default = 'mean') – If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available options are ‘median’ which imputes the value using the median value in the training dataset and ‘zero’ which replaces missing values with zeroes.
  • date_features (string, default = None) – If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features = ‘date_column_name’. It can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.
  • ignore_features (string, default = None) – If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically set to ignore for modeling.
  • normalize (bool, default = False) – When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.
  • normalize_method (string, default = 'zscore') –

    Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x - u) / s. The other available options are:

    ’minmax’ : scales and translates each feature individually such that it is in
    the range of 0 - 1.
    ’maxabs’ : scales and translates each feature individually such that the maximal
    absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
    ’robust’ : scales and translates each feature according to the Interquartile
    range. When the dataset contains outliers, robust scaler often gives better results.
  • transformation (bool, default = False) – When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
  • transformation_method (string, default = 'yeo-johnson') – Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.
  • handle_unknown_categorical (bool, default = True) – When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is defined under the unknown_categorical_method param.
  • unknown_categorical_method (string, default = 'least_frequent') – Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.
  • pca (bool, default = False) – When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.
  • pca_method (string, default = 'linear') –

    The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:

    kernel : dimensionality reduction through the use of RVF kernel.

    incremental : replacement for ‘linear’ pca when the dataset to be decomposed is
    too large to fit in memory
  • pca_components (int/float, default = 0.99) – Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.
  • ignore_low_variance (bool, default = False) – When set to True, all categorical features with insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
  • combine_rare_levels (bool, default = False) – When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be atleast two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.
  • rare_level_threshold (float, default = 0.1) – Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.
  • bin_numeric_features (list, default = None) – When a list of numeric features is passed they are transformed into categorical features using KMeans, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.
  • remove_outliers (bool, default = False) – When set to True, outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique.
  • outliers_threshold (float, default = 0.05) – The percentage / proportion of outliers in the dataset can be defined using the outliers_threshold param. By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.
  • remove_multicollinearity (bool, default = False) – When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.
  • multicollinearity_threshold (float, default = 0.9) – Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.
  • remove_perfect_collinearity (bool, default = False) – When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, When two features are 100% correlated, one of it is randomly dropped from the dataset.
  • create_clusters (bool, default = False) – When set to True, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.
  • cluster_iter (int, default = 20) – Number of iterations used to create a cluster. Each iteration represents cluster size. Only comes into effect when create_clusters param is set to True.
  • polynomial_features (bool, default = False) – When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in polynomial_degree param.
  • polynomial_degree (int, default = 2) – Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].
  • trigonometry_features (bool, default = False) – When set to True, new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree param.
  • polynomial_threshold (float, default = 0.1) – This is used to compress a sparse matrix of polynomial and trigonometric features. Polynomial and trigonometric features whose feature importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
  • group_features (list or list of list, default = None) – When a dataset contains features that have related characteristics, group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.
  • group_names (list, default = None) – When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.
  • feature_selection (bool, default = False) – When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When polynomial_features and feature_interaction are used, it is highly recommended to define the feature_selection_threshold param with a lower value. Feature selection algorithm by default is ‘classic’ but could be ‘boruta’, which will lead PyCaret to create use the Boruta selection algorithm.
  • feature_selection_threshold (float, default = 0.8) – Threshold used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of feature_selection_threshold specially in cases where polynomial_features and feature_interaction are used. Setting a very low value may be efficient but could result in under-fitting.
  • feature_selection_method (str, default = 'classic') – Can be either ‘classic’ or ‘boruta’. Selects the algorithm responsible for choosing a subset of features. For the ‘classic’ selection method, PyCaret will use various permutation importance techniques. For the ‘boruta’ algorithm, PyCaret will create an instance of boosted trees model, which will iterate with permutation over all features and choose the best ones based on the distributions of feature importance.
  • feature_interaction (bool, default = False) – When set to True, it will create new features by interacting (a * b) for all numeric variables in the dataset including polynomial and trigonometric features (if created). This feature is not scalable and may not work as expected on datasets with large feature space.
  • feature_ratio (bool, default = False) – When set to True, it will create new features by calculating the ratios (a / b) of all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.
  • interaction_threshold (bool, default = 0.01) – Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
  • fix_imbalance (bool, default = False) – When dataset has unequal distribution of target class it can be fixed using fix_imbalance parameter. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class.
  • fix_imbalance_method (obj, default = None) – When fix_imbalance is set to True and fix_imbalance_method is None, ‘smote’ is applied by default to oversample minority class during cross validation. This parameter accepts any module from ‘imblearn’ that supports ‘fit_resample’ method.
  • data_split_shuffle (bool, default = True) – If set to False, prevents shuffling of rows when splitting data.
  • folds_shuffle (bool, default = False) – If set to False, prevents shuffling of rows when using cross validation.
  • n_jobs (int, default = -1) – The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.
  • use_gpu (bool, default = False) – If set to True, algorithms that supports gpu are trained using gpu.
  • html (bool, default = True) – If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.
  • session_id (int, default = None) – If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
  • log_experiment (bool, default = False) – When set to True, all metrics and parameters are logged on MLFlow server.
  • experiment_name (str, default = None) – Name of experiment for logging. When set to None, ‘clf’ is by default used as alias for the experiment name.
  • log_plots (bool, default = False) – When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.
  • log_profile (bool, default = False) – When set to True, data profile is also logged on MLflow as a html file. By default, it is set to False.
  • log_data (bool, default = False) – When set to True, train and test dataset are logged as csv.
  • silent (bool, default = False) – When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.
  • verbose (Boolean, default = True) – Information grid is not printed when verbose is set to False.
  • profile (bool, default = False) – If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.
Returns:

  • info_grid – Information grid is printed.
  • environment – This function returns various outputs that are stored in variables as tuples. They are used by other functions in pycaret.

pycaret.classification.compare_models(exclude=None, include=None, fold=10, round=4, sort='Accuracy', n_select=1, budget_time=0, turbo=True, verbose=True)

This function train all the models available in the model library and scores them using Stratified Cross Validation. The output prints a score grid with Accuracy, AUC, Recall, Precision, F1, Kappa and MCC (averaged accross folds), determined by fold parameter.

This function returns the best model based on metric defined in sort parameter.

To select top N models, use n_select parameter that is set to 1 by default. Where n_select parameter > 1, it will return a list of trained model objects.

When turbo is set to True (‘rbfsvm’, ‘gpc’ and ‘mlp’) are excluded due to longer training time. By default turbo param is set to True.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> best_model = compare_models()

This will return the averaged score grid of all the models except ‘rbfsvm’, ‘gpc’ and ‘mlp’. When turbo param is set to False, all models including ‘rbfsvm’, ‘gpc’ and ‘mlp’ are used but this may result in longer training time.

>>> best_model = compare_models( exclude = [ 'knn', 'gbc' ] , turbo = False)

This will return a comparison of all models except K Nearest Neighbour and Gradient Boosting Classifier.

>>> best_model = compare_models( exclude = [ 'knn', 'gbc' ] , turbo = True)

This will return comparison of all models except K Nearest Neighbour, Gradient Boosting Classifier, SVM (RBF), Gaussian Process Classifier and Multi Level Perceptron.

Parameters:
  • exclude (list of strings, default = None) – In order to omit certain models from the comparison model ID’s can be passed as a list of strings in exclude param.
  • include (list of strings, default = None) – In order to run only certain models for the comparison, the model ID’s can be passed as a list of strings in include param.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • sort (string, default = 'Accuracy') – The scoring measure specified is used for sorting the average score grid Other options are ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’ and ‘MCC’.
  • n_select (int, default = 1) – Number of top_n models to return. use negative argument for bottom selection. for example, n_select = -3 means bottom 3 models.
  • budget_time (int or float, default = 0) – If set above 0, will terminate execution of the function after budget_time minutes have passed and return results up to that point.
  • turbo (Boolean, default = True) – When turbo is set to True, it excludes estimators that have longer training time.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.

Return type:

score_grid

Warning

  • compare_models() though attractive, might be time consuming with large datasets. By default turbo is set to True, which excludes models that have longer training times. Changing turbo parameter to False may result in very high training times with datasets where number of samples exceed 10,000.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0)
pycaret.classification.create_model(estimator=None, ensemble=False, method=None, fold=10, round=4, cross_validation=True, verbose=True, system=True, **kwargs)

This function creates a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Fold).

This function returns a trained model object.

setup() function must be called before using create_model()

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')

This will create a trained Logistic Regression model.

Parameters:
  • estimator (string / object, default = None) –

    Enter ID of the estimators available in model library or pass an untrained model object consistent with fit / predict API to train and evaluate model. All estimators support binary or multiclass problem. List of estimators in model library (ID - Name):

    • ’lr’ - Logistic Regression
    • ’knn’ - K Nearest Neighbour
    • ’nb’ - Naive Bayes
    • ’dt’ - Decision Tree Classifier
    • ’svm’ - SVM - Linear Kernel
    • ’rbfsvm’ - SVM - Radial Kernel
    • ’gpc’ - Gaussian Process Classifier
    • ’mlp’ - Multi Level Perceptron
    • ’ridge’ - Ridge Classifier
    • ’rf’ - Random Forest Classifier
    • ’qda’ - Quadratic Discriminant Analysis
    • ’ada’ - Ada Boost Classifier
    • ’gbc’ - Gradient Boosting Classifier
    • ’lda’ - Linear Discriminant Analysis
    • ’et’ - Extra Trees Classifier
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Classifier
  • ensemble (Boolean, default = False) – True would result in an ensemble of estimator using the method parameter defined.
  • method (String, 'Bagging' or 'Boosting', default = None.) – method must be defined when ensemble is set to True. Default method is set to None.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • cross_validation (bool, default = True) – When cross_validation set to False fold parameter is ignored and model is trained on entire training dataset. No metric evaluation is returned.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
  • **kwargs – Additional keyword arguments to pass to the estimator.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are highlighted in yellow.
  • model – trained model object

Warning

  • ‘svm’ and ‘ridge’ doesn’t support predict_proba method. As such, AUC will be returned as zero (0.0)
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0)
  • ‘rbfsvm’ and ‘gpc’ uses non-linear kernel and hence the fit time complexity is more than quadratic. These estimators are hard to scale on datasets with more than 10,000 samples.
pycaret.classification.tune_model(estimator=None, fold=10, round=4, n_iter=10, custom_grid=None, optimize='Accuracy', custom_scorer=None, choose_better=False, verbose=True)

This function tunes the hyperparameters of a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall Precision, F1, Kappa and MCC by fold (by default = 10 Folds).

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> xgboost = create_model('xgboost')
>>> tuned_xgboost = tune_model(xgboost)

This will tune the hyperparameters of Extreme Gradient Boosting Classifier.

Parameters:
  • estimator (object, default = None) –
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • n_iter (integer, default = 10) – Number of iterations within the Random Grid Search. For every iteration, the model randomly selects one value from the pre-defined grid of hyperparameters.
  • custom_grid (dictionary, default = None) – To use custom hyperparameters for tuning pass a dictionary with parameter name and values to be iterated. When set to None it uses pre-defined tuning grid.
  • optimize (string, default = 'accuracy') – Measure used to select the best model through hyperparameter tuning. The default scoring measure is ‘Accuracy’. Other measures include ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’.
  • custom_scorer (object, default = None) – custom_scorer can be passed to tune hyperparameters of the model. It must be created using sklearn.make_scorer.
  • choose_better (Boolean, default = False) – When set to set to True, base estimator is returned when the performance doesn’t improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained and tuned model object.

Warning

  • If target variable is multiclass (more than 2 classes), optimize param ‘AUC’ is not acceptable.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0)
pycaret.classification.ensemble_model(estimator, method='Bagging', fold=10, n_estimators=10, round=4, choose_better=False, optimize='Accuracy', verbose=True)

This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Fold).

This function returns a trained model object.

Model must be created using create_model() or tune_model().

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> dt = create_model('dt')
>>> ensembled_dt = ensemble_model(dt)

This will return an ensembled Decision Tree model using ‘Bagging’.

Parameters:
  • estimator (object, default = None) –
  • method (String, default = 'Bagging') – Bagging method will create an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset. The other available method is ‘Boosting’ which will create a meta-estimators by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • n_estimators (integer, default = 10) – The number of base estimators in the ensemble. In case of perfect fit, the learning procedure is stopped early.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • choose_better (Boolean, default = False) – When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • optimize (string, default = 'Accuracy') – Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, ‘MCC’.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained ensembled model object.

Warning

  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).
pycaret.classification.blend_models(estimator_list='All', fold=10, round=4, choose_better=False, optimize='Accuracy', method='hard', turbo=True, verbose=True)

This function creates a Soft Voting / Majority Rule classifier for all the estimators in the model library (excluding the few when turbo is True) or for specific trained estimators passed as a list in estimator_list param. It scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default CV = 10 Folds).

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> blend_all = blend_models()

This will create a VotingClassifier for all models in the model library except for ‘rbfsvm’, ‘gpc’ and ‘mlp’.

For specific models, you can use:

>>> lr = create_model('lr')
>>> rf = create_model('rf')
>>> knn = create_model('knn')
>>> blend_three = blend_models(estimator_list = [lr,rf,knn])

This will create a VotingClassifier of lr, rf and knn.

Parameters:
  • estimator_list (string ('All') or list of object, default = 'All') –
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • choose_better (Boolean, default = False) – When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • optimize (string, default = 'Accuracy') – Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, ‘MCC’.
  • method (string, default = 'hard') – ‘hard’ uses predicted class labels for majority rule voting.’soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
  • turbo (Boolean, default = True) – When turbo is set to True, it excludes estimator that uses Radial Kernel.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained Voting Classifier model object.

Warning

  • When passing estimator_list with method set to ‘soft’. All the models in the estimator_list must support predict_proba function. ‘svm’ and ‘ridge’ doesnt support the predict_proba and hence an exception will be raised.
  • When estimator_list is set to ‘All’ and method is forced to ‘soft’, estimators that doesnt support the predict_proba function will be dropped from the estimator list.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).
pycaret.classification.stack_models(estimator_list, meta_model=None, fold=10, round=4, method='auto', restack=True, choose_better=False, optimize='Accuracy', verbose=True)

This function trains a meta model and scores it using Stratified Cross Validation. The predictions from the base level models as passed in the estimator_list param are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False).

The output prints the score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Folds).

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> dt = create_model('dt')
>>> rf = create_model('rf')
>>> ada = create_model('ada')
>>> ridge = create_model('ridge')
>>> knn = create_model('knn')
>>> stacked_models = stack_models(estimator_list=[dt,rf,ada,ridge,knn])

This will create a meta model that will use the predictions of all the models provided in estimator_list param. By default, the meta model is Logistic Regression but can be changed with meta_model param.

Parameters:
  • estimator_list (list of objects) –
  • meta_model (object, default = None) – If set to None, Logistic Regression is used as a meta model.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • method (string, default = 'auto') –
    • if ‘auto’, it will try to invoke, for each estimator, ‘predict_proba’,

    ’decision_function’ or ‘predict’ in that order. - otherwise, one of ‘predict_proba’, ‘decision_function’ or ‘predict’. If the method is not implemented by the estimator, it will raise an error.

  • restack (Boolean, default = True) – When restack is set to True, raw data will be exposed to meta model when making predictions, otherwise when False, only the predicted label or probabilities is passed to meta model when making final predictions.
  • choose_better (Boolean, default = False) – When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • optimize (string, default = 'Accuracy') – Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, ‘MCC’.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained model object.

Warning

  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).
pycaret.classification.plot_model(estimator, plot='auc', scale=1, save=False, verbose=True, system=True)

This function takes a trained model object and returns a plot based on the test / hold-out set. The process may require the model to be re-trained in certain cases. See list of plots supported below.

Model must be created using create_model() or tune_model().

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> plot_model(lr)

This will return an AUC plot of a trained Logistic Regression model.

Parameters:
  • estimator (object, default = none) – A trained model object should be passed as an estimator.
  • plot (string, default = auc) –

    Enter abbreviation of type of plot. The current list of plots supported are (Plot - Name):

    • ’auc’ - Area Under the Curve
    • ’threshold’ - Discrimination Threshold
    • ’pr’ - Precision Recall Curve
    • ’confusion_matrix’ - Confusion Matrix
    • ’error’ - Class Prediction Error
    • ’class_report’ - Classification Report
    • ’boundary’ - Decision Boundary
    • ’rfe’ - Recursive Feature Selection
    • ’learning’ - Learning Curve
    • ’manifold’ - Manifold Learning
    • ’calibration’ - Calibration Curve
    • ’vc’ - Validation Curve
    • ’dimension’ - Dimension Learning
    • ’feature’ - Feature Importance
    • ’parameter’ - Model Hyperparameter
    • ’lift’ - Lift Curve
    • ’gain’ - Gain Chart
  • scale (float, default = 1) – The resolution scale of the figure.
  • save (Boolean, default = False) – When set to True, Plot is saved as a ‘png’ file in current working directory.
  • verbose (Boolean, default = True) – Progress bar not shown when verbose set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

Warning

  • ‘svm’ and ‘ridge’ doesn’t support the predict_proba method. As such, AUC and
    calibration plots are not available for these estimators.
  • When the ‘max_features’ parameter of a trained model object is not equal to the number of samples in training set, the ‘rfe’ plot is not available.
  • ‘calibration’, ‘threshold’, ‘manifold’ and ‘rfe’ plots are not available for
    multiclass problems.
pycaret.classification.evaluate_model(estimator)

This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> evaluate_model(lr)

This will display the User Interface for all of the plots for a given estimator.

Parameters:estimator (object, default = none) – A trained model object should be passed as an estimator.
Returns:Displays the user interface for plotting.
Return type:User_Interface
pycaret.classification.interpret_model(estimator, plot='summary', feature=None, observation=None, **kwargs)

This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms.

This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

For more information : https://shap.readthedocs.io/en/latest/

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> dt = create_model('dt')
>>> interpret_model(dt)

This will return a summary interpretation plot of Decision Tree model.

Parameters:
  • estimator (object, default = none) – A trained tree based model object should be passed as an estimator.
  • plot (string, default = 'summary') – Other available options are ‘correlation’ and ‘reason’.
  • feature (string, default = None) – This parameter is only needed when plot = ‘correlation’. By default feature is set to None which means the first column of the dataset will be used as a variable. A feature parameter must be passed to change this.
  • observation (integer, default = None) – This parameter only comes into effect when plot is set to ‘reason’. If no observation number is provided, it will return an analysis of all observations with the option to select the feature on x and y axes through drop down interactivity. For analysis at the sample level, an observation parameter must be passed with the index value of the observation in test / hold-out set.
  • **kwargs – Additional keyword arguments to pass to the plot.
Returns:

Returns the visual plot. Returns the interactive JS plot when plot = ‘reason’.

Return type:

Visual_Plot

Warning

  • interpret_model doesn’t support multiclass problems.
pycaret.classification.calibrate_model(estimator, method='sigmoid', fold=10, round=4, verbose=True)

This function takes the input of trained estimator and performs probability calibration with sigmoid or isotonic regression. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Fold). The ouput of the original estimator and the calibrated estimator (created using this function) might not differ much. In order to see the calibration differences, use ‘calibration’ plot in plot_model to see the difference before and after.

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> dt_boosted = create_model('dt', ensemble = True, method = 'Boosting')
>>> calibrated_dt = calibrate_model(dt_boosted)

This will return Calibrated Boosted Decision Tree Model.

Parameters:
  • estimator (object) –
  • method (string, default = 'sigmoid') – The method to use for calibration. Can be ‘sigmoid’ which corresponds to Platt’s method or ‘isotonic’ which is a non-parametric approach. It is not advised to use isotonic calibration with too few calibration samples
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.
  • model – trained and calibrated model object.

Warning

  • Avoid isotonic calibration with too few calibration samples (<1000) since it tends to overfit.
  • calibration plot not available for multiclass problems.
pycaret.classification.optimize_threshold(estimator, true_positive=0, true_negative=0, false_positive=0, false_negative=0)

This function optimizes probability threshold for a trained model using custom cost function that can be defined using combination of True Positives, True Negatives, False Positives (also known as Type I error), and False Negatives (Type II error).

This function returns a plot of optimized cost as a function of probability threshold between 0 to 100.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> optimize_threshold(lr, true_negative = 10, false_negative = -100)

This will return a plot of optimized cost as a function of probability threshold.

Parameters:
  • estimator (object) – A trained model object should be passed as an estimator.
  • true_positive (int, default = 0) – Cost function or returns when prediction is true positive.
  • true_negative (int, default = 0) – Cost function or returns when prediction is true negative.
  • false_positive (int, default = 0) – Cost function or returns when prediction is false positive.
  • false_negative (int, default = 0) – Cost function or returns when prediction is false negative.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

Warning

  • This function is not supported for multiclass problems.
pycaret.classification.predict_model(estimator, data=None, probability_threshold=None, encoded_labels=False, verbose=True)

This function is used to predict label and probability score on the new dataset using a trained estimator. New unseen data can be passed to data param as pandas Dataframe. If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> lr_predictions_holdout = predict_model(lr)
Parameters:
  • estimator (object, default = none) – A trained model object / pipeline should be passed as an estimator.
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features. All features used during training must be present in the new dataset.
  • probability_threshold (float, default = None) – Threshold used to convert probability values into binary outcome. By default the probability threshold for all binary classifiers is 0.5 (50%). This can be changed using probability_threshold param.
  • encoded_labels (Boolean, default = False) – If True, will return labels encoded as an integer.
  • verbose (Boolean, default = True) – Holdout score grid is not printed when verbose is set to False.
Returns:

  • Predictions – Predictions (Label and Score) column attached to the original dataset and returned as pandas dataframe.
  • score_grid – A table containing the scoring metrics on hold-out / test set.

Warning

  • The behavior of the predict_model is changed in version 2.1 without backward compatibility.

As such, the pipelines trained using the version (<= 2.0), may not work for inference with version >= 2.1. You can either retrain your models with a newer version or downgrade the version for inference.

pycaret.classification.finalize_model(estimator)

This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> final_lr = finalize_model(lr)

This will return the final model object fitted to complete dataset.

Parameters:estimator (object, default = none) – A trained model object should be passed as an estimator.
Returns:Trained model object fitted on complete dataset.
Return type:model

Warning

  • If the model returned by finalize_model(), is used on predict_model() without passing a new unseen dataset, then the information grid printed is misleading as the model is trained on the complete dataset including test / hold-out sample. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseens dataset only.
pycaret.classification.deploy_model(model, model_name, platform, authentication)

This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your AWS console account:

  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)
>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> deploy_model(model = lr, model_name = 'deploy_lr', platform = 'aws', authentication = {'bucket' : 'bucket-name'})

Before deploying a model to Google Cloud Platform (GCP), project must be created either using command line or GCP console. Once project is created, you must create a service account and download the service account key as a JSON file, which is then used to set environment variable.

Learn more : https://cloud.google.com/docs/authentication/production

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'c:/path-to-json-file.json'
>>> deploy_model(model = lr, model_name = 'deploy_lr', platform = 'gcp', authentication = {'project' : 'project-name', 'bucket' : 'bucket-name'})

Before deploying a model to Microsoft Azure, environment variables for connection string must be set. Connection string can be obtained from ‘Access Keys’ of your storage account in Azure.

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> os.environ['AZURE_STORAGE_CONNECTION_STRING'] = 'connection-string-here'
>>> deploy_model(model = lr, model_name = 'deploy_lr', platform = 'azure', authentication = {'container' : 'container-name'})
Parameters:
  • model (object) – A trained model object should be passed as an estimator.
  • model_name (string) – Name of model to be passed as a string.
  • platform (string) – Name of platform for deployment. Currently accepts: ‘aws’, ‘gcp’, ‘azure’
  • authentication (dict) –

    Dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

Returns:

Return type:

Success_Message

Warning

  • This function uses file storage services to deploy the model on cloud platform. As such, this is efficient for batch-use. Where the production objective is to obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.
pycaret.classification.create_webservice(model, model_endopoint, api_key=True, pydantic_payload=None)

(In Preview)

This function deploys the transformation pipeline and trained model object as api. Rest api base on FastAPI and could run on localhost, it uses

the model name as a path to POST endpoint. The endpoint can be protected by api key generated by pycaret and return for the user.

Create_webservice uses pydantic style input/output model. Parameters ———- model : object

A trained model object should be passed as an estimator.
model_endopoint : string
Name of model to be passed as a string.
api_key: bool, default = True
Security for API, if True Pycaret generates api key and print in console, else user can post data without header but it not safe if application will expose external.
pydantic_payload: pydantic.main.ModelMetaclass, default = None
Pycaret allows us to automatically generate a schema for the input model, thanks to which can prevent incorrect requests. User can generate own pydantic model and use it as an input model.

Dictionary with api_key: FastAPI application class which is ready to run with console if api_key is set to False, the dictionary key is set to ‘Not_exist’.

pycaret.classification.save_model(model, model_name, model_only=False, verbose=True)

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Example

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> save_model(lr, 'lr_model_23122019')

This will save the transformation pipeline and model as a binary pickle file in the current active directory.

Parameters:
  • model (object, default = none) – A trained model object should be passed as an estimator.
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • model_only (bool, default = False) – When set to True, only trained model object is saved and all the transformations are ignored.
  • verbose (Boolean, default = True) – Success message is not printed when verbose is set to False.
Returns:

Return type:

Success_Message

pycaret.classification.load_model(model_name, platform=None, authentication=None, verbose=True)

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Example

>>> saved_lr = load_model('lr_model_23122019')

This will load the previously saved model in saved_lr variable. The file must be in the current directory.

Parameters:
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • platform (string, default = None) – Name of platform, if loading model from cloud. Currently available options are: ‘aws’, ‘gcp’, ‘azure’.
  • authentication (dict) –

    dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

  • verbose (Boolean, default = True) – Success message is not printed when verbose is set to False.
Returns:

Return type:

Model Object

pycaret.classification.automl(optimize='Accuracy', use_holdout=False)

This function returns the best model out of all models created in current active environment based on metric defined in optimize parameter.

Parameters:
  • optimize (string, default = 'Accuracy') – Other values you can pass in optimize param are ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, and ‘MCC’.
  • use_holdout (bool, default = False) – When set to True, metrics are evaluated on holdout set instead of CV.
pycaret.classification.pull()

Returns latest displayed table.

Returns:Equivalent to get_config(‘display_container’)[-1]
Return type:pandas.DataFrame
pycaret.classification.models(type=None)

Returns table of models available in model library.

Example

>>> all_models = models()

This will return pandas dataframe with all available models and their metadata.

Parameters:type (string, default = None) –
  • linear : filters and only return linear models
  • tree : filters and only return tree based models
  • ensemble : filters and only return ensemble models
Returns:
Return type:pandas.DataFrame
pycaret.classification.get_logs(experiment_name=None, save=False)

Returns a table with experiment logs consisting run details, parameter, metrics and tags.

Example

>>> logs = get_logs()

This will return pandas dataframe.

Parameters:
  • experiment_name (string, default = None) – When set to None current active run is used.
  • save (bool, default = False) – When set to True, csv file is saved in current directory.
Returns:

Return type:

pandas.DataFrame

pycaret.classification.get_config(variable)

This function is used to access global environment variables. Following variables can be accessed:

  • X: Transformed dataset (X)
  • y: Transformed dataset (y)
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • folds_shuffle_param: shuffle parameter used in Kfolds
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • create_model_container: results grid storage container
  • master_model_container: model storage container
  • display_container: results display container
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
  • fix_imbalance_param: fix_imbalance param set through setup
  • fix_imbalance_method_param: fix_imbalance_method param set through setup
  • data_before_preprocess: data before preprocessing
  • target_param: name of target variable
  • gpu_param: use_gpu param configured through setup

Example

>>> X_train = get_config('X_train')

This will return X_train transformed dataset.

Returns:
Return type:variable
pycaret.classification.set_config(variable, value)

This function is used to reset global environment variables. Following variables can be accessed:

  • X: Transformed dataset (X)
  • y: Transformed dataset (y)
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • folds_shuffle_param: shuffle parameter used in Kfolds
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • create_model_container: results grid storage container
  • master_model_container: model storage container
  • display_container: results display container
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
  • fix_imbalance_param: fix_imbalance param set through setup
  • fix_imbalance_method_param: fix_imbalance_method param set through setup
  • data_before_preprocess: data before preprocessing
  • target_param: name of target variable
  • gpu_param: use_gpu param configured through setup

Example

>>> set_config('seed', 123)

This will set the global seed to ‘123’.

pycaret.classification.get_system_logs()

Read and print ‘logs.log’ file from current active directory

Regression

pycaret.regression.setup(data, target, train_size=0.7, sampling=True, sample_estimator=None, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, high_cardinality_method='frequency', numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_outliers=False, outliers_threshold=0.05, remove_multicollinearity=False, multicollinearity_threshold=0.9, remove_perfect_collinearity=False, create_clusters=False, cluster_iter=20, polynomial_features=False, polynomial_degree=2, trigonometry_features=False, polynomial_threshold=0.1, group_features=None, group_names=None, feature_selection=False, feature_selection_threshold=0.8, feature_selection_method='classic', feature_interaction=False, feature_ratio=False, interaction_threshold=0.01, transform_target=False, transform_target_method='box-cox', data_split_shuffle=True, folds_shuffle=False, n_jobs=-1, use_gpu=False, html=True, session_id=None, log_experiment=False, experiment_name=None, log_plots=False, log_profile=False, log_data=False, silent=False, verbose=True, profile=False)

This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column.

All other parameters are optional.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')

‘boston’ is a pandas.DataFrame and ‘medv’ is the name of target column.

Parameters:
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.
  • target (string) – Name of target column to be passed in as string.
  • train_size (float, default = 0.7) – Size of the training set. By default, 70% of the data will be used for training and validation. The remaining data will be used for test / hold-out set.
  • sampling (bool, default = True) – When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance plot of R2 values at various sample levels, that will assist in deciding the preferred sample size for modeling. The desired sample size must then be entered for training and validation in the pycaret environment. When sample_size entered is less than 1, the remaining dataset (1 - sample) is used for fitting the model only when finalize_model() is called.
  • sample_estimator (object, default = None) – If None, Linear Regression is used by default.
  • categorical_features (string, default = None) – If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].
  • categorical_imputation (string, default = 'constant') – If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.
  • ordinal_features (dictionary, default = None) – When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. The list sequence must be in increasing order from lowest to highest.
  • high_cardinality_features (string, default = None) – When the data containts features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using method defined in high_cardinality_method param.
  • high_cardinality_method (string, default = 'frequency') – When method set to ‘frequency’ it will replace the original value of feature with the frequency distribution and convert the feature into numeric. Other available method is ‘clustering’ which performs the clustering on statistical attribute of data and replaces the original value of feature with cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.
  • numeric_features (string, default = None) – If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].
  • numeric_imputation (string, default = 'mean') – If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available options are ‘median’ which imputes the value using the median value in the training dataset and ‘zero’ which replaces missing values with zeroes.
  • date_features (string, default = None) – If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features = ‘date_column_name’. It can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.
  • ignore_features (string, default = None) – If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically set to ignore for modeling.
  • normalize (bool, default = False) – When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.
  • normalize_method (string, default = 'zscore') –

    Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x - u) / s. The other available options are:

    ’minmax’ : scales and translates each feature individually such that it is
    in the range of 0 - 1.
    ’maxabs’ : scales and translates each feature individually such that the
    maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
    ’robust’ : scales and translates each feature according to the Interquartile
    range. When the dataset contains outliers, robust scaler often gives better results.
  • transformation (bool, default = False) – When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
  • transformation_method (string, default = 'yeo-johnson') – Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.
  • handle_unknown_categorical (bool, default = True) – When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is defined under the unknown_categorical_method param.
  • unknown_categorical_method (string, default = 'least_frequent') – Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.
  • pca (bool, default = False) – When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.
  • pca_method (string, default = 'linear') –

    The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:

    kernel : dimensionality reduction through the use of RVF kernel.

    incremental : replacement for ‘linear’ pca when the dataset to be decomposed is
    too large to fit in memory
  • pca_components (int/float, default = 0.99) – Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.
  • ignore_low_variance (bool, default = False) – When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
  • combine_rare_levels (bool, default = False) – When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be atleast two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.
  • rare_level_threshold (float, default = 0.1) – Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.
  • bin_numeric_features (list, default = None) – When a list of numeric features is passed they are transformed into categorical features using KMeans, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.
  • remove_outliers (bool, default = False) – When set to True, outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique.
  • outliers_threshold (float, default = 0.05) – The percentage / proportion of outliers in the dataset can be defined using the outliers_threshold param. By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.
  • remove_multicollinearity (bool, default = False) – When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.
  • multicollinearity_threshold (float, default = 0.9) – Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.
  • remove_perfect_collinearity (bool, default = False) – When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, When two features are 100% correlated, one of it is randomly dropped from the dataset.
  • create_clusters (bool, default = False) – When set to True, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.
  • cluster_iter (int, default = 20) – Number of iterations used to create a cluster. Each iteration represents cluster size. Only comes into effect when create_clusters param is set to True.
  • polynomial_features (bool, default = False) – When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in polynomial_degree param.
  • polynomial_degree (int, default = 2) – Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].
  • trigonometry_features (bool, default = False) – When set to True, new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree param.
  • polynomial_threshold (float, default = 0.1) – This is used to compress a sparse matrix of polynomial and trigonometric features. Polynomial and trigonometric features whose feature importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
  • group_features (list or list of list, default = None) – When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.
  • group_names (list, default = None) – When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.
  • feature_selection (bool, default = False) – When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When polynomial_features and feature_interaction are used, it is highly recommended to define the feature_selection_threshold param with a lower value. Feature selection algorithm by default is ‘classic’ but could be ‘boruta’, which will lead PyCaret to create use the Boruta selection algorithm.
  • feature_selection_threshold (float, default = 0.8) – Threshold used for feature selection (for newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of feature_selection_threshold specially in cases where polynomial_features and feature_interaction are used. Setting a very low value may be efficient but could result in under-fitting.
  • feature_selection_method (str, default = 'classic') – Can be either ‘classic’ or ‘boruta’. Selects the algorithm responsible for choosing a subset of features. For the ‘classic’ selection method, PyCaret will use various permutation importance techniques. For the ‘boruta’ algorithm, PyCaret will create an instance of boosted trees model, which will iterate with permutation over all features and choose the best ones based on the distributions of feature importance.
  • feature_interaction (bool, default = False) – When set to True, it will create new features by interacting (a * b) for all numeric variables in the dataset including polynomial and trigonometric features (if created). This feature is not scalable and may not work as expected on datasets with large feature space.
  • feature_ratio (bool, default = False) – When set to True, it will create new features by calculating the ratios (a / b) of all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.
  • interaction_threshold (bool, default = 0.01) – Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
  • transform_target (bool, default = False) – When set to True, target variable is transformed using the method defined in transform_target_method param. Target transformation is applied separately from feature transformations.
  • transform_target_method (string, default = 'box-cox') – ‘Box-cox’ and ‘yeo-johnson’ methods are supported. Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data. When transform_target_method is ‘box-cox’ and target variable contains negative values, method is internally forced to ‘yeo-johnson’ to avoid exceptions.
  • data_split_shuffle (bool, default = True) – If set to False, prevents shuffling of rows when splitting data.
  • folds_shuffle (bool, default = True) – If set to False, prevents shuffling of rows when using cross validation.
  • n_jobs (int, default = -1) – The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.
  • use_gpu (bool, default = False) – If set to True, algorithms that supports gpu are trained using gpu.
  • html (bool, default = True) – If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.
  • session_id (int, default = None) – If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
  • log_experiment (bool, default = False) – When set to True, all metrics and parameters are logged on MLFlow server.
  • experiment_name (str, default = None) – Name of experiment for logging. When set to None, ‘reg’ is by default used as alias for the experiment name.
  • log_plots (bool, default = False) – When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.
  • log_profile (bool, default = False) – When set to True, data profile is also logged on MLflow as a html file. By default, it is set to False.
  • log_data (bool, default = False) – When set to True, train and test dataset are logged as csv.
  • silent (bool, default = False) – When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.
  • verbose (Boolean, default = True) – Information grid is not printed when verbose is set to False.
  • profile (bool, default = False) – If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.
Returns:

  • info_grid – Information grid is printed.
  • environment – This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

pycaret.regression.compare_models(exclude=None, include=None, fold=10, round=4, sort='R2', n_select=1, budget_time=0, turbo=True, verbose=True)

This function train all the models available in the model library and scores them using Kfold Cross Validation. The output prints a score grid with MAE, MSE RMSE, R2, RMSLE and MAPE (averaged accross folds), determined by fold parameter.

This function returns the best model based on metric defined in sort parameter.

To select top N models, use n_select parameter that is set to 1 by default. Where n_select parameter > 1, it will return a list of trained model objects.

When turbo is set to True (‘kr’, ‘ard’ and ‘mlp’) are excluded due to longer training times. By default turbo param is set to True.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> best_model = compare_models()

This will return the averaged score grid of all models except ‘kr’, ‘ard’ and ‘mlp’. When turbo param is set to False, all models including ‘kr’, ‘ard’ and ‘mlp’ are used, but this may result in longer training times.

>>> best_model = compare_models(exclude = ['knn','gbr'], turbo = False)

This will return a comparison of all models except K Nearest Neighbour and Gradient Boosting Regressor.

>>> best_model = compare_models(exclude = ['knn','gbr'] , turbo = True)

This will return a comparison of all models except K Nearest Neighbour, Gradient Boosting Regressor, Kernel Ridge Regressor, Automatic Relevance Determinant and Multi Level Perceptron.

Parameters:
  • exclude (list of strings, default = None) – In order to omit certain models from the comparison model ID’s can be passed as a list of strings in exclude param.
  • include (list of strings, default = None) – In order to run only certain models for the comparison, the model ID’s can be passed as a list of strings in include param.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • sort (string, default = 'MAE') – The scoring measure specified is used for sorting the average score grid Other options are ‘MAE’, ‘MSE’, ‘RMSE’, ‘R2’, ‘RMSLE’ and ‘MAPE’.
  • n_select (int, default = 1) – Number of top_n models to return. use negative argument for bottom selection. for example, n_select = -3 means bottom 3 models.
  • budget_time (int or float, default = 0) – If set above 0, will terminate execution of the function after budget_time minutes have passed and return results up to that point.
  • turbo (Boolean, default = True) – When turbo is set to True, it excludes estimators that have longer training times.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

A table containing the scores of the model across the kfolds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE Mean and standard deviation of the scores across the folds is also returned.

Return type:

score_grid

Warning

  • compare_models() though attractive, might be time consuming with large datasets. By default turbo is set to True, which excludes models that have longer training times. Changing turbo parameter to False may result in very high training times with datasets where number of samples exceed 10,000.
pycaret.regression.create_model(estimator=None, ensemble=False, method=None, fold=10, round=4, cross_validation=True, verbose=True, system=True, **kwargs)

This function creates a model and scores it using Kfold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, RMSLE, R2 and MAPE by fold (default = 10 Fold).

This function returns a trained model object.

setup() function must be called before using create_model()

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> lr = create_model('lr')

This will create a trained Linear Regression model.

Parameters:
  • estimator (string / object, default = None) –

    Enter ID of the estimators available in model library or pass an untrained model object consistent with fit / predict API to train and evaluate model. All estimators support binary or multiclass problem. List of estimators in model library (ID - Name):

    • ’lr’ - Linear Regression
    • ’lasso’ - Lasso Regression
    • ’ridge’ - Ridge Regression
    • ’en’ - Elastic Net
    • ’lar’ - Least Angle Regression
    • ’llar’ - Lasso Least Angle Regression
    • ’omp’ - Orthogonal Matching Pursuit
    • ’br’ - Bayesian Ridge
    • ’ard’ - Automatic Relevance Determination
    • ’par’ - Passive Aggressive Regressor
    • ’ransac’ - Random Sample Consensus
    • ’tr’ - TheilSen Regressor
    • ’huber’ - Huber Regressor
    • ’kr’ - Kernel Ridge
    • ’svm’ - Support Vector Machine
    • ’knn’ - K Neighbors Regressor
    • ’dt’ - Decision Tree
    • ’rf’ - Random Forest
    • ’et’ - Extra Trees Regressor
    • ’ada’ - AdaBoost Regressor
    • ’gbr’ - Gradient Boosting Regressor
    • ’mlp’ - Multi Level Perceptron
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Regressor
  • ensemble (Boolean, default = False) – True would result in an ensemble of estimator using the method parameter defined.
  • method (String, 'Bagging' or 'Boosting', default = None.) – method must be defined when ensemble is set to True. Default method is set to None.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • cross_validation (bool, default = True) – When cross_validation set to False fold parameter is ignored and model is trained on entire training dataset. No metric evaluation is returned.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
  • **kwargs – Additional keyword arguments to pass to the estimator.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are MAE, MSE, RMSE, RMSLE, R2 and MAPE. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained model object.

pycaret.regression.tune_model(estimator, fold=10, round=4, n_iter=10, custom_grid=None, optimize='R2', custom_scorer=None, choose_better=False, verbose=True)

This function tunes the hyperparameters of a model and scores it using Kfold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds).

This function returns a trained model object.

tune_model() only accepts a string parameter for estimator.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> xgboost = create_model('xgboost')
>>> tuned_xgboost = tune_model(xgboost)

This will tune the hyperparameters of Extreme Gradient Boosting Regressor.

Parameters:
  • estimator (object, default = None) –
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • n_iter (integer, default = 10) – Number of iterations within the Random Grid Search. For every iteration, the model randomly selects one value from the pre-defined grid of hyperparameters.
  • custom_grid (dictionary, default = None) – To use custom hyperparameters for tuning pass a dictionary with parameter name and values to be iterated. When set to None it uses pre-defined tuning grid.
  • optimize (string, default = 'R2') – Measure used to select the best model through hyperparameter tuning. The default scoring measure is ‘R2’. Other measures include ‘MAE’, ‘MSE’, ‘RMSE’, ‘RMSLE’, ‘MAPE’. When using ‘RMSE’ or ‘RMSLE’ the base scorer is ‘MSE’ and when using ‘MAPE’ the base scorer is ‘MAE’.
  • custom_scorer (object, default = None) – custom_scorer can be passed to tune hyperparameters of the model. It must be created using sklearn.make_scorer.
  • choose_better (Boolean, default = False) – When set to set to True, base estimator is returned when the metric doesn’t improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.
  • model – trained model object

Warning

  • estimator parameter takes an abbreviated string. Passing a trained model object returns an error. The tune_model() function internally calls create_model() before tuning the hyperparameters.
pycaret.regression.ensemble_model(estimator, method='Bagging', fold=10, n_estimators=10, round=4, choose_better=False, optimize='R2', verbose=True)

This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds).

This function returns a trained model object.

Model must be created using create_model() or tune_model().

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> dt = create_model('dt')
>>> ensembled_dt = ensemble_model(dt)

This will return an ensembled Decision Tree model using ‘Bagging’.

Parameters:
  • estimator (object, default = None) –
  • method (String, default = 'Bagging') – Bagging method will create an ensemble meta-estimator that fits base regressor each on random subsets of the original dataset. The other available method is ‘Boosting’ that fits a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction. As such, subsequent regressors focus more on difficult cases.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • n_estimators (integer, default = 10) – The number of base estimators in the ensemble. In case of perfect fit, the learning procedure is stopped early.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • choose_better (Boolean, default = False) – When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • optimize (string, default = 'R2') – Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘MAE’, ‘MSE’, ‘RMSE’, ‘R2’, ‘RMSLE’, ‘MAPE’.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained ensembled model object.

pycaret.regression.blend_models(estimator_list='All', fold=10, round=4, choose_better=False, optimize='R2', turbo=True, verbose=True)

This function creates an ensemble meta-estimator that fits a base regressor on the whole dataset. It then averages the predictions to form a final prediction. By default, this function will use all estimators in the model library (excl. the few estimators when turbo is True) or a specific trained estimator passed as a list in estimator_list param. It scores it using Kfold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Fold).

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> blend_all = blend_models()

This will result in VotingRegressor for all models in the library except ‘ard’, ‘kr’ and ‘mlp’.

For specific models, you can use:

>>> lr = create_model('lr')
>>> rf = create_model('rf')
>>> knn = create_model('knn')
>>> blend_three = blend_models(estimator_list = [lr,rf,knn])

This will create a VotingRegressor of lr, rf and knn.

Parameters:
  • estimator_list (string ('All') or list of objects, default = 'All') –
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • choose_better (Boolean, default = False) – When set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • optimize (string, default = 'R2') – Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘MAE’, ‘MSE’, ‘RMSE’, ‘R2’, ‘RMSLE’, ‘MAPE’.
  • turbo (Boolean, default = True) – When turbo is set to True, it excludes estimator that uses Radial Kernel.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained Voting Regressor model object.

pycaret.regression.stack_models(estimator_list, meta_model=None, fold=10, round=4, restack=True, choose_better=False, optimize='R2', verbose=True)

This function trains a meta model and scores it using Kfold Cross Validation. The predictions from the base level models as passed in the estimator_list param are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False).

The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Folds).

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> dt = create_model('dt')
>>> rf = create_model('rf')
>>> ada = create_model('ada')
>>> ridge = create_model('ridge')
>>> knn = create_model('knn')
>>> stacked_models = stack_models(estimator_list=[dt,rf,ada,ridge,knn])

This will create a meta model that will use the predictions of all the models provided in estimator_list param. By default, the meta model is Linear Regression but can be changed with meta_model param.

Parameters:
  • estimator_list (list of object) –
  • meta_model (object, default = None) – If set to None, Linear Regression is used as a meta model.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • round (integer, default = 4) – Number of decimal places the metrics in the score grid will be rounded to.
  • restack (Boolean, default = True) – When restack is set to True, raw data will be exposed to meta model when making predictions, otherwise when False, only the predicted label is passed to meta model when making final predictions.
  • choose_better (Boolean, default = False) – When set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
  • optimize (string, default = 'R2') – Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘MAE’, ‘MSE’, ‘RMSE’, ‘R2’, ‘RMSLE’, ‘MAPE’.
  • verbose (Boolean, default = True) – Score grid is not printed when verbose is set to False.
Returns:

  • score_grid – A table containing the scores of the model across the kfolds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.
  • model – Trained model object.

pycaret.regression.plot_model(estimator, plot='residuals', scale=1, save=False, verbose=True, system=True)

This function takes a trained model object and returns a plot based on the test / hold-out set. The process may require the model to be re-trained in certain cases. See list of plots supported below.

Model must be created using create_model() or tune_model().

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> lr = create_model('lr')
>>> plot_model(lr)

This will return an residuals plot of a trained Linear Regression model.

Parameters:
  • estimator (object, default = none) – A trained model object should be passed as an estimator.
  • plot (string, default = residual) –

    Enter abbreviation of type of plot. The current list of plots supported are (Plot - Name):

    • ’residuals’ - Residuals Plot
    • ’error’ - Prediction Error Plot
    • ’cooks’ - Cooks Distance Plot
    • ’rfe’ - Recursive Feat. Selection
    • ’learning’ - Learning Curve
    • ’vc’ - Validation Curve
    • ’manifold’ - Manifold Learning
    • ’feature’ - Feature Importance
    • ’parameter’ - Model Hyperparameter
  • scale (float, default = 1) – The resolution scale of the figure.
  • save (Boolean, default = False) – When set to True, Plot is saved as a ‘png’ file in current working directory.
  • verbose (Boolean, default = True) – Progress bar not shown when verbose set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

pycaret.regression.evaluate_model(estimator)

This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> lr = create_model('lr')
>>> evaluate_model(lr)

This will display the User Interface for all of the plots for a given estimator.

Parameters:estimator (object, default = none) – A trained model object should be passed as an estimator.
Returns:Displays the user interface for plotting.
Return type:User_Interface
pycaret.regression.interpret_model(estimator, plot='summary', feature=None, observation=None, **kwargs)

This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms.

This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

For more information : https://shap.readthedocs.io/en/latest/

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> dt = create_model('dt')
>>> interpret_model(dt)

This will return a summary interpretation plot of Decision Tree model.

Parameters:
  • estimator (object, default = none) – A trained tree based model object should be passed as an estimator.
  • plot (string, default = 'summary') – Other available options are ‘correlation’ and ‘reason’.
  • feature (string, default = None) – This parameter is only needed when plot = ‘correlation’. By default feature is set to None which means the first column of the dataset will be used as a variable. A feature parameter must be passed to change this.
  • observation (integer, default = None) – This parameter only comes into effect when plot is set to ‘reason’. If no observation number is provided, it will return an analysis of all observations with the option to select the feature on x and y axes through drop down interactivity. For analysis at the sample level, an observation parameter must be passed with the index value of the observation in test / hold-out set.
  • **kwargs – Additional keyword arguments to pass to the plot.
Returns:

Returns the visual plot. Returns the interactive JS plot when plot = ‘reason’.

Return type:

Visual_Plot

pycaret.regression.predict_model(estimator, data=None, round=4, verbose=True)

This function is used to predict target value on the new dataset using a trained estimator. New unseen data can be passed to data param as pandas.DataFrame. If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.

from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)

lr_predictions_holdout = predict_model(lr)

Parameters:
  • estimator (object, default = none) – A trained model object / pipeline should be passed as an estimator.
  • data (pandas.DataFrame) – shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features. All features used during training must be present in the new dataset.
  • round (integer, default = 4) – Number of decimal places the predicted labels will be rounded to.
  • verbose (Boolean, default = True) – Holdout score grid is not printed when verbose is set to False.
Returns:

  • Predictions – Predictions (Label and Score) column attached to the original dataset and returned as pandas.DataFrame.
  • score grid – A table containing the scoring metrics on hold-out / test set.

Warning

  • The behavior of the predict_model is changed in version 2.1 without backward compatibility.

As such, the pipelines trained using the version (<= 2.0), may not work for inference with version >= 2.1. You can either retrain your models with a newer version or downgrade the version for inference.

pycaret.regression.finalize_model(estimator)

This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> lr = create_model('lr')
>>> final_lr = finalize_model(lr)

This will return the final model object fitted to complete dataset.

Parameters:estimator (object, default = none) – A trained model object should be passed as an estimator.
Returns:Trained model object fitted on complete dataset.
Return type:model

Warning

  • If the model returned by finalize_model(), is used on predict_model() without passing a new unseen dataset, then the information grid printed is misleading as the model is trained on the complete dataset including test / hold-out sample. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseens dataset only.
pycaret.regression.deploy_model(model, model_name, platform, authentication)

This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your AWS console account:

  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)
>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston,  target = 'medv')
>>> lr = create_model('lr')
>>> deploy_model(model = lr, model_name = 'deploy_lr', platform = 'aws', authentication = {'bucket' : 'bucket-name'})

Before deploying a model to Google Cloud Platform (GCP), project must be created either using command line or GCP console. Once project is created, you must create a service account and download the service account key as a JSON file, which is then used to set environment variable.

Learn more : https://cloud.google.com/docs/authentication/production

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'c:/path-to-json-file.json'
>>> deploy_model(model = lr, model_name = 'deploy_lr', platform = 'gcp', authentication = {'project' : 'project-name', 'bucket' : 'bucket-name'})

Before deploying a model to Microsoft Azure, environment variables for connection string must be set. Connection string can be obtained from ‘Access Keys’ of your storage account in Azure.

>>> from pycaret.datasets import get_data
>>> juice = get_data('juice')
>>> experiment_name = setup(data = juice,  target = 'Purchase')
>>> lr = create_model('lr')
>>> os.environ['AZURE_STORAGE_CONNECTION_STRING'] = 'connection-string-here'
>>> deploy_model(model = lr, model_name = 'deploy_lr', platform = 'azure', authentication = {'container' : 'container-name'})
Parameters:
  • model (object) – A trained model object should be passed as an estimator.
  • model_name (string) – Name of model to be passed as a string.
  • platform (string) – Name of platform for deployment. Currently accepts: ‘aws’, ‘gcp’, ‘azure’
  • authentication (dict) –

    Dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

Returns:

Return type:

Success_Message

Warning

  • This function uses file storage services to deploy the model on cloud platform. As such, this is efficient for batch-use. Where the production objective is to obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.
pycaret.regression.save_model(model, model_name, model_only=False, verbose=True)

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)

save_model(lr, ‘lr_model_23122019’)

This will save the transformation pipeline and model as a binary pickle file in the current directory.

Parameters:
  • model (object, default = none) – A trained model object should be passed as an estimator.
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • model_only (bool, default = False) – When set to True, only trained model object is saved and all the transformations are ignored.
  • verbose (Boolean, default = True) – Success message is not printed when verbose is set to False.
Returns:

Return type:

Success_Message

pycaret.regression.load_model(model_name, platform=None, authentication=None, verbose=True)

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Example

>>> saved_lr = load_model('lr_model_23122019')

This will load the previously saved model in saved_lr variable. The file must be in the current directory.

Parameters:
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • platform (string, default = None) – Name of platform, if loading model from cloud. Currently available options are: ‘aws’, ‘gcp’, ‘azure’.
  • authentication (dict) –

    dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

  • verbose (Boolean, default = True) – Success message is not printed when verbose is set to False.
Returns:

Return type:

Model Object

pycaret.regression.automl(optimize='R2', use_holdout=False)

This function returns the best model out of all models created in current active environment based on metric defined in optimize parameter.

Parameters:
  • optimize (string, default = 'R2') – Other values you can pass in optimize param are ‘MAE’, ‘MSE’, ‘RMSE’, ‘RMSLE’, and ‘MAPE’.
  • use_holdout (bool, default = False) – When set to True, metrics are evaluated on holdout set instead of CV.
pycaret.regression.pull()

Returns latest displayed table.

Returns:Equivalent to get_config(‘display_container’)[-1]
Return type:pandas.DataFrame
pycaret.regression.models(type=None)

Returns table of models available in model library.

Example

>>> all_models = models()

This will return pandas dataframe with all available models and their metadata.

Parameters:type (string, default = None) –
  • linear : filters and only return linear models
  • tree : filters and only return tree based models
  • ensemble : filters and only return ensemble models
Returns:
Return type:pandas.DataFrame
pycaret.regression.get_logs(experiment_name=None, save=False)

Returns a table with experiment logs consisting run details, parameter, metrics and tags.

Example

>>> logs = get_logs()

This will return pandas dataframe.

Parameters:
  • experiment_name (string, default = None) – When set to None current active run is used.
  • save (bool, default = False) – When set to True, csv file is saved in current directory.
Returns:

Return type:

pandas.DataFrame

pycaret.regression.get_config(variable)

This function is used to access global environment variables. Following variables can be accessed:

  • X: Transformed dataset (X)
  • y: Transformed dataset (y)
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • target_inverse_transformer: Target variable inverse transformer
  • folds_shuffle_param: shuffle parameter used in Kfolds
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • create_model_container: results grid storage container
  • master_model_container: model storage container
  • display_container: results display container
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
  • data_before_preprocess: data before preprocessing
  • target_param: name of target variable
  • gpu_param: use_gpu param configured through setup

Example

>>> X_train = get_config('X_train')

This will return X_train transformed dataset.

Returns:
Return type:variable
pycaret.regression.set_config(variable, value)

This function is used to reset global environment variables. Following variables can be accessed:

  • X: Transformed dataset (X)
  • y: Transformed dataset (y)
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • target_inverse_transformer: Target variable inverse transformer
  • folds_shuffle_param: shuffle parameter used in Kfolds
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • create_model_container: results grid storage container
  • master_model_container: model storage container
  • display_container: results display container
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
  • data_before_preprocess: data before preprocessing
  • target_param: name of target variable
  • gpu_param: use_gpu param configured through setup

Example

>>> set_config('seed', 123)

This will set the global seed to ‘123’.

pycaret.regression.get_system_logs()

Read and print ‘logs.log’ file from current active directory.

Clustering

pycaret.clustering.setup(data, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_multicollinearity=False, multicollinearity_threshold=0.9, group_features=None, group_names=None, supervised=False, supervised_target=None, n_jobs=-1, html=True, session_id=None, log_experiment=False, experiment_name=None, log_plots=False, log_profile=False, log_data=False, silent=False, verbose=True, profile=False)

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: data.

Example

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery, normalize = True)

‘jewellery’ is a pandas.DataFrame.

Parameters:
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.
  • categorical_features (string, default = None) – If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].
  • categorical_imputation (string, default = 'constant') – If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.
  • ordinal_features (dictionary, default = None) – When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. The list sequence must be in increasing order from lowest to highest.
  • high_cardinality_features (string, default = None) – When the data containts features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using frequency distribution. As such original features are replaced with the frequency distribution and converted into numeric variable.
  • numeric_features (string, default = None) – If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].
  • numeric_imputation (string, default = 'mean') – If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available options are ‘median’ which imputes the value using the median value in the training dataset and ‘zero’ which replaces missing values with zeroes.
  • date_features (string, default = None) – If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features = ‘date_column_name’. It can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.
  • ignore_features (string, default = None) – If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically set to ignore for modeling.
  • normalize (bool, default = False) – When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.
  • normalize_method (string, default = 'zscore') –

    Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x - u) / s. The other available options are:

    ’minmax’ : scales and translates each feature individually such that it is in
    the range of 0 - 1.
    ’maxabs’ : scales and translates each feature individually such that the
    maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
    ’robust’ : scales and translates each feature according to the Interquartile
    range. When the dataset contains outliers, robust scaler often gives better results.
  • transformation (bool, default = False) – When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
  • transformation_method (string, default = 'yeo-johnson') – Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.
  • handle_unknown_categorical (bool, default = True) – When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is defined under the unknown_categorical_method param.
  • unknown_categorical_method (string, default = 'least_frequent') – Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.
  • pca (bool, default = False) – When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.
  • pca_method (string, default = 'linear') –

    The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:

    kernel : dimensionality reduction through the use of RVF kernel.

    incremental : replacement for ‘linear’ pca when the dataset to be decomposed is
    too large to fit in memory
  • pca_components (int/float, default = 0.99) – Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.
  • ignore_low_variance (bool, default = False) – When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
  • combine_rare_levels (bool, default = False) – When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be atleast two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.
  • rare_level_threshold (float, default = 0.1) – Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.
  • bin_numeric_features (list, default = None) – When a list of numeric features is passed they are transformed into categorical features using KMeans, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.
  • remove_multicollinearity (bool, default = False) – When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature with higher average correlation in the feature space is dropped.
  • multicollinearity_threshold (float, default = 0.9) – Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.
  • group_features (list or list of list, default = None) – When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.
  • group_names (list, default = None) – When group_features is passed, a name of the group can be passed into group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.
  • supervised (bool, default = False) – When set to True, supervised_target column is ignored for transformation. This param is only for internal use.
  • supervised_target (string, default = None) – Name of supervised_target column that will be ignored for transformation. Only applciable when tune_model function is used. This param is only for internal use.
  • n_jobs (int, default = -1) – The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.
  • html (bool, default = True) – If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.
  • session_id (int, default = None) – If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
  • log_experiment (bool, default = True) – When set to True, all metrics and parameters are logged on MLFlow server.
  • experiment_name (str, default = None) – Name of experiment for logging. When set to None, ‘clu’ is by default used as alias for the experiment name.
  • log_plots (bool, default = False) – When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.
  • log_profile (bool, default = False) – When set to True, data profile is also logged on MLflow as a html file. By default, it is set to False.
  • log_data (bool, default = False) – When set to True, train and test dataset are logged as csv.
  • silent (bool, default = False) – When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.
  • verbose (Boolean, default = True) – Information grid is not printed when verbose is set to False.
  • profile (bool, default = False) – If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.
Returns:

  • info_grid – Information grid is printed.
  • environment – This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

pycaret.clustering.create_model(model=None, num_clusters=None, ground_truth=None, verbose=True, system=True, **kwargs)

This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model().

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery, normalize = True)
>>> kmeans = create_model('kmeans')

This will return a trained K-Means clustering model.

Parameters:
  • model (string / object, default = None) –

    Enter ID of the models available in model library or pass an untrained model object consistent with fit / predict API to train and evaluate model. List of models available in model library (ID - Model):

    • ’kmeans’ - K-Means Clustering
    • ’ap’ - Affinity Propagation
    • ’meanshift’ - Mean shift Clustering
    • ’sc’ - Spectral Clustering
    • ’hclust’ - Agglomerative Clustering
    • ’dbscan’ - Density-Based Spatial Clustering
    • ’optics’ - OPTICS Clustering
    • ’birch’ - Birch Clustering
    • ’kmodes’ - K-Modes Clustering
  • num_clusters (int, default = None) – Number of clusters to be generated with the dataset. If None, num_clusters is set to 4.
  • ground_truth (string, default = None) – When ground_truth is provided, Homogeneity Score, Rand Index, and Completeness Score is evaluated and printer along with other metrics.
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
  • **kwargs – Additional keyword arguments to pass to the estimator.
Returns:

  • score_grid – A table containing the Silhouette, Calinski-Harabasz, Davies-Bouldin, Homogeneity Score, Rand Index, and Completeness Score. Last 3 are only evaluated when ground_truth param is provided.
  • model – trained model object

Warning

  • num_clusters not required for Affinity Propagation (‘ap’), Mean shift clustering (‘meanshift’), Density-Based Spatial Clustering (‘dbscan’) and OPTICS Clustering (‘optics’). num_clusters param for these models are automatically determined.
  • When fit doesn’t converge in Affinity Propagation (‘ap’) model, all datapoints are labelled as -1.
  • Noisy samples are given the label -1, when using Density-Based Spatial (‘dbscan’) or OPTICS Clustering (‘optics’).
  • OPTICS (‘optics’) clustering may take longer training times on large datasets.
pycaret.clustering.assign_model(model, transformation=False, verbose=True)

This function assigns each of the data point in the dataset passed during setup stage to one of the clusters using trained model object passed as model param. create_model() function must be called before using assign_model().

This function returns a pandas.DataFrame.

Example

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery, normalize = True)
>>> kmeans = create_model('kmeans')
>>> kmeans_df = assign_model(kmeans)

This will return a pandas.DataFrame with inferred clusters using trained model.

Parameters:
  • model (trained model object, default = None) –
  • transformation (bool, default = False) – When set to True, assigned clusters are returned on transformed dataset instead of original dataset passed during setup().
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
Returns:

Returns a DataFrame with assigned clusters using a trained model.

Return type:

pandas.DataFrame

pycaret.clustering.plot_model(model, plot='cluster', feature=None, label=False, scale=1, save=False, system=True)

This function takes a trained model object and returns a plot on the dataset passed during setup stage. This function internally calls assign_model before generating a plot.

Example

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery, normalize = True)
>>> kmeans = create_model('kmeans')
>>> plot_model(kmeans)

This will return a cluster scatter plot (by default).

Parameters:
  • model (object, default = none) – A trained model object can be passed. Model must be created using create_model().
  • plot (string, default = 'cluster') –

    Enter abbreviation for type of plot. The current list of plots supported are (Plot - Name):

    • ’cluster’ - Cluster PCA Plot (2d)
    • ’tsne’ - Cluster TSnE (3d)
    • ’elbow’ - Elbow Plot
    • ’silhouette’ - Silhouette Plot
    • ’distance’ - Distance Plot
    • ’distribution’ - Distribution Plot
  • feature (string, default = None) – Name of feature column for x-axis of when plot = ‘distribution’. When plot is ‘cluster’ or ‘tsne’ feature column is used as a hoverover tooltip and/or label when label is set to True. If no feature name is passed in ‘cluster’ or ‘tsne’ by default the first of column of dataset is chosen as hoverover tooltip.
  • label (bool, default = False) – When set to True, data labels are shown in ‘cluster’ and ‘tsne’ plot.
  • scale (float, default = 1) – The resolution scale of the figure.
  • save (Boolean, default = False) – Plot is saved as png file in local directory when save parameter set to True.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

pycaret.clustering.tune_model(model=None, supervised_target=None, estimator=None, optimize=None, custom_grid=None, fold=10, verbose=True)

This function tunes the num_clusters model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear.

This function returns the tuned model object.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston, normalize = True)
>>> tuned_kmeans = tune_model(model = 'kmeans', supervised_target = 'medv')

This will return tuned K-Means Clustering Model.

Parameters:
  • model (string, default = None) –

    Enter ID of the models available in model library (ID - Name):

    • ’kmeans’ - K-Means Clustering
    • ’ap’ - Affinity Propagation
    • ’meanshift’ - Mean shift Clustering
    • ’sc’ - Spectral Clustering
    • ’hclust’ - Agglomerative Clustering
    • ’dbscan’ - Density-Based Spatial Clustering
    • ’optics’ - OPTICS Clustering
    • ’birch’ - Birch Clustering
    • ’kmodes’ - K-Modes Clustering
  • supervised_target (string) – Name of the target column for supervised learning.
  • estimator (string, default = None) –

    For Classification (ID - Name):

    • ’lr’ - Logistic Regression
    • ’knn’ - K Nearest Neighbour
    • ’nb’ - Naive Bayes
    • ’dt’ - Decision Tree Classifier
    • ’svm’ - SVM - Linear Kernel
    • ’rbfsvm’ - SVM - Radial Kernel
    • ’gpc’ - Gaussian Process Classifier
    • ’mlp’ - Multi Level Perceptron
    • ’ridge’ - Ridge Classifier
    • ’rf’ - Random Forest Classifier
    • ’qda’ - Quadratic Discriminant Analysis
    • ’ada’ - Ada Boost Classifier
    • ’gbc’ - Gradient Boosting Classifier
    • ’lda’ - Linear Discriminant Analysis
    • ’et’ - Extra Trees Classifier
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Classifier

    For Regression (ID - Name):

    • ’lr’ - Linear Regression
    • ’lasso’ - Lasso Regression
    • ’ridge’ - Ridge Regression
    • ’en’ - Elastic Net
    • ’lar’ - Least Angle Regression
    • ’llar’ - Lasso Least Angle Regression
    • ’omp’ - Orthogonal Matching Pursuit
    • ’br’ - Bayesian Ridge
    • ’ard’ - Automatic Relevance Determ.
    • ’par’ - Passive Aggressive Regressor
    • ’ransac’ - Random Sample Consensus
    • ’tr’ - TheilSen Regressor
    • ’huber’ - Huber Regressor
    • ’kr’ - Kernel Ridge
    • ’svm’ - Support Vector Machine
    • ’knn’ - K Neighbors Regressor
    • ’dt’ - Decision Tree
    • ’rf’ - Random Forest
    • ’et’ - Extra Trees Regressor
    • ’ada’ - AdaBoost Regressor
    • ’gbr’ - Gradient Boosting
    • ’mlp’ - Multi Level Perceptron
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Regressor

    If set to None, Linear / Logistic model is used by default.

  • optimize (string, default = None) –
    For Classification tasks:
    Accuracy, AUC, Recall, Precision, F1, Kappa
    For Regression tasks:
    MAE, MSE, RMSE, R2, RMSLE, MAPE
  • set to None, default is 'Accuracy' for classification and 'R2' for (If) –
  • tasks. (regression) –
  • custom_grid (list, default = None) – By default, a pre-defined number of clusters is iterated over to optimize the supervised objective. To overwrite default iteration, pass a list of num_clusters to iterate over in custom_grid param.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
Returns:

  • Visual_Plot – Visual plot with num_clusters param on x-axis with metric to optimize on y-axis. Also, prints the best model metric.
  • model – trained model object with best num_clusters param.

Warning

  • Affinity Propagation, Mean shift clustering, Density-Based Spatial Clustering and OPTICS Clustering cannot be used in this function since they donot support num_clusters param.
pycaret.clustering.predict_model(model, data)

This function is used to predict new data using a trained model. It requires a trained model object created using one of the function in pycaret that returns a trained model object. New data must be passed to data param as a DataFrame.

Example

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery)
>>> kmeans = create_model('kmeans')
>>> kmeans_predictions = predict_model(model = kmeans, data = jewellery)
Parameters:
  • model (object, default = None) – A trained model object / pipeline should be passed as an estimator.
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features. All features used during training must be present in the new dataset.
Returns:

Information grid is printed when data is None.

Return type:

info_grid

Warning

  • Models that do not support ‘predict’ function cannot be used in predict_model().
  • The behavior of the predict_model is changed in version 2.1 without backward compatibility.

As such, the pipelines trained using the version (<= 2.0), may not work for inference with version >= 2.1. You can either retrain your models with a newer version or downgrade the version for inference.

pycaret.clustering.deploy_model(model, model_name, platform, authentication)

This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your AWS console account:

  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)
>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery)
>>> kmeans = create_model('kmeans')
>>> deploy_model(model = kmeans, model_name = 'deploy_kmeans', platform = 'aws', authentication = {'bucket' : 'bucket-name'})

Before deploying a model to Google Cloud Platform (GCP), project must be created either using command line or GCP console. Once project is created, you must create a service account and download the service account key as a JSON file, which is then used to set environment variable.

Learn more : https://cloud.google.com/docs/authentication/production

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery)
>>> kmeans = create_model('kmeans')
>>> os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'c:/path-to-json-file.json'
>>> deploy_model(model = kmeans, model_name = 'deploy_kmeans', platform = 'gcp', authentication = {'project' : 'project-name', 'bucket' : 'bucket-name'})

Before deploying a model to Microsoft Azure, environment variables for connection string must be set. Connection string can be obtained from ‘Access Keys’ of your storage account in Azure.

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery)
>>> kmeans = create_model('kmeans')
>>> os.environ['AZURE_STORAGE_CONNECTION_STRING'] = 'connection-string-here'
>>> deploy_model(model = kmeans, model_name = 'deploy_kmeans', platform = 'azure', authentication = {'container' : 'container-name'})
Parameters:
  • model (object) – A trained model object should be passed as an estimator.
  • model_name (string) – Name of model to be passed as a string.
  • platform (string) – Name of platform for deployment. Currently accepts: ‘aws’, ‘gcp’, ‘azure’
  • authentication (dict) –

    Dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

Returns:

Return type:

Success_Message

Warning

  • This function uses file storage services to deploy the model on cloud platform. As such, this is efficient for batch-use. Where the production objective is to obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.
pycaret.clustering.save_model(model, model_name, model_only=False, verbose=True)

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Example

>>> from pycaret.datasets import get_data
>>> jewellery = get_data('jewellery')
>>> experiment_name = setup(data = jewellery, normalize = True)
>>> kmeans = create_model('kmeans')
>>> save_model(kmeans, 'kmeans_model_23122019')

This will save the transformation pipeline and model as a binary pickle file in the current directory.

Parameters:
  • model (object, default = none) – A trained model object should be passed.
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • model_only (bool, default = False) – When set to True, only trained model object is saved and all the transformations are ignored.
  • verbose (bool, default = True) – When set to False, success message is not printed.
Returns:

Return type:

Success_Message

pycaret.clustering.load_model(model_name, platform=None, authentication=None, verbose=True)

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Example

>>> saved_lr = load_model('kmeans_model_23122019')

This will load the previously saved model in saved_lr variable. The file must be in the current directory.

Parameters:
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • platform (string, default = None) – Name of platform, if loading model from cloud. Currently available options are: ‘aws’, ‘gcp’, ‘azure’.
  • authentication (dict) –

    dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

  • verbose (Boolean, default = True) – Success message is not printed when verbose is set to False.
Returns:

Return type:

Model Object

pycaret.clustering.models()

Returns table of models available in model library.

Example

>>> all_models = models()

This will return pandas.DataFrame with all available models and their metadata.

Returns:
Return type:pandas.DataFrame
pycaret.clustering.get_logs(experiment_name=None, save=False)

Returns a table with experiment logs consisting run details, parameter, metrics and tags.

Example

>>> logs = get_logs()

This will return pandas.DataFrame.

Parameters:
  • experiment_name (string, default = None) – When set to None current active run is used.
  • save (bool, default = False) – When set to True, csv file is saved in current directory.
Returns:

Return type:

pandas.DataFrame

pycaret.clustering.get_config(variable)

This function is used to access global environment variables. Following variables can be accessed:

  • X: Transformed dataset
  • data_: Original dataset
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • prep_param: prep_param configured through setup
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup

Example

>>> X = get_config('X')

This will return transformed dataset.

Returns:
Return type:variable
pycaret.clustering.set_config(variable, value)

This function is used to reset global environment variables. Following variables can be accessed:

  • X: Transformed dataset
  • data_: Original dataset
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • prep_param: prep_param configured through setup
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup

Example

>>> set_config('seed', 123)

This will set the global seed to ‘123’.

pycaret.clustering.get_system_logs()

Read and print ‘logs.log’ file from current active directory

pycaret.clustering.get_clusters(data, model=None, num_clusters=4, ignore_features=None, normalize=True, transformation=False, pca=False, pca_components=0.99, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, remove_multicollinearity=False, multicollinearity_threshold=0.9, n_jobs=None)

Callable from any external environment without requiring setup initialization.

Anomaly

pycaret.anomaly.setup(data, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_multicollinearity=False, multicollinearity_threshold=0.9, group_features=None, group_names=None, supervised=False, supervised_target=None, n_jobs=-1, html=True, session_id=None, log_experiment=False, experiment_name=None, log_plots=False, log_profile=False, log_data=False, silent=False, verbose=True, profile=False)

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: data.

Example

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly, normalize = True)

‘anomaly’ is a pandas.DataFrame.

Parameters:
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.
  • categorical_features (string, default = None) – If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].
  • categorical_imputation (string, default = 'constant') – If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.
  • ordinal_features (dictionary, default = None) – When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. The list sequence must be in increasing order from lowest to highest.
  • high_cardinality_features (string, default = None) – When the data containts features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using frequency distribution. As such original features are replaced with the frequency distribution and converted into numeric variable.
  • numeric_features (string, default = None) – If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].
  • numeric_imputation (string, default = 'mean') – If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available options are ‘median’ which imputes the value using the median value in the training dataset and ‘zero’ which replaces missing values with zeroes.
  • date_features (string, default = None) – If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features = ‘date_column_name’. It can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.
  • ignore_features (string, default = None) – If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically set to ignore for modeling.
  • normalize (bool, default = False) – When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.
  • normalize_method (string, default = 'zscore') –

    Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x - u) / s. The other available options are:

    ’minmax’ : scales and translates each feature individually such that it is in
    the range of 0 - 1.
    ’maxabs’ : scales and translates each feature individually such that the
    maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
    ’robust’ : scales and translates each feature according to the Interquartile
    range. When the dataset contains outliers, robust scaler often gives better results.
  • transformation (bool, default = False) – When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
  • transformation_method (string, default = 'yeo-johnson') – Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.
  • handle_unknown_categorical (bool, default = True) – When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is defined under the unknown_categorical_method param.
  • unknown_categorical_method (string, default = 'least_frequent') – Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.
  • pca (bool, default = False) – When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.
  • pca_method (string, default = 'linear') –

    The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:

    kernel : dimensionality reduction through the use of RVF kernel.

    incremental : replacement for ‘linear’ pca when the dataset to be decomposed is
    too large to fit in memory
  • pca_components (int/float, default = 0.99) –

    Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.

    ignore_low_variance: bool, default = False When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.

  • combine_rare_levels (bool, default = False) –

    When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be atleast two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.

    rare_level_threshold: float, default = 0.1 Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.

  • bin_numeric_features (list, default = None) – When a list of numeric features is passed they are transformed into categorical features using KMeans, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.
  • remove_multicollinearity (bool, default = False) – When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature with higher average correlation in the feature space is dropped.
  • multicollinearity_threshold (float, default = 0.9) – Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.
  • group_features (list or list of list, default = None) – When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.
  • group_names (list, default = None) – When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.
  • supervised (bool, default = False) – When set to True, supervised_target column is ignored for transformation. This param is only for internal use.
  • supervised_target (string, default = None) – Name of supervised_target column that will be ignored for transformation. Only applicable when tune_model() function is used. This param is only for internal use.
  • n_jobs (int, default = -1) – The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.
  • html (bool, default = True) – If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.
  • session_id (int, default = None) – If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
  • log_experiment (bool, default = True) – When set to True, all metrics and parameters are logged on MLFlow server.
  • experiment_name (str, default = None) – Name of experiment for logging. When set to None, ‘clf’ is by default used as alias for the experiment name.
  • log_plots (bool, default = False) – When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.
  • log_profile (bool, default = False) – When set to True, data profile is also logged on MLflow as a html file. By default, it is set to False.
  • silent (bool, default = False) – When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.
  • verbose (Boolean, default = True) – Information grid is not printed when verbose is set to False.
  • profile (bool, default = False) – If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.
Returns:

  • info_grid – Information grid is printed.
  • environment – This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

pycaret.anomaly.create_model(model=None, fraction=0.05, verbose=True, system=True, **kwargs)

This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model().

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly, normalize = True)
>>> knn = create_model('knn')

This will return trained k-Nearest Neighbors model.

Parameters:
  • model (string / object, default = None) –

    Enter ID of the models available in model library or pass an untrained model object consistent with fit / predict API to train and evaluate model. List of models available in model library (ID - Model):

    • ’abod’ - Angle-base Outlier Detection
    • ’cluster’ - Clustering-Based Local Outlier
    • ’cof’ - Connectivity-Based Outlier Factor
    • ’histogram’ - Histogram-based Outlier Detection
    • ’knn’ - k-Nearest Neighbors Detector
    • ’lof’ - Local Outlier Factor
    • ’svm’ - One-class SVM detector
    • ’pca’ - Principal Component Analysis
    • ’mcd’ - Minimum Covariance Determinant
    • ’sod’ - Subspace Outlier Detection
    • ’sos’ - Stochastic Outlier Selection
  • fraction (float, default = 0.05) – The percentage / proportion of outliers in the dataset.
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
  • **kwargs – Additional keyword arguments to pass to the estimator.
Returns:

Trained model object.

Return type:

model

pycaret.anomaly.assign_model(model, transformation=False, score=True, verbose=True)

This function flags each of the data point in the dataset passed during setup stage as either outlier or inlier (1 = outlier, 0 = inlier) using trained model object passed as model param. create_model() function must be called before using assign_model().

This function returns dataframe with Outlier flag (1 = outlier, 0 = inlier) and decision score, when score is set to True.

Example

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly, normalize = True)
>>> knn = create_model('knn')
>>> knn_df = assign_model(knn)

This will return a dataframe with inferred outliers using trained model.

Parameters:
  • model (trained model object, default = None) –
  • transformation (bool, default = False) – When set to True, assigned outliers are returned on transformed dataset instead of original dataset passed during setup().
  • score (Boolean, default = True) – The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the model is fitted. If set to False, it will only return the flag (1 = outlier, 0 = inlier).
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
Returns:

Returns a dataframe with inferred outliers using a trained model.

Return type:

pandas.DataFrame

pycaret.anomaly.plot_model(model, plot='tsne', feature=None, scale=1, save=False, system=True)

This function takes a trained model object and returns a plot on the dataset passed during setup stage. This function internally calls assign_model before generating a plot.

Example

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly, normalize = True)
>>> knn = create_model('knn')
>>> plot_model(knn)
Parameters:
  • model (object) – A trained model object can be passed. Model must be created using create_model().
  • plot (string, default = 'tsne') –

    Enter abbreviation of type of plot. The current list of plots supported are (Plot - Name):

    • ’tsne’ - t-SNE (3d) Dimension Plot
    • ’umap’ - UMAP Dimensionality Plot
  • feature (string, default = None) – Feature column is used as a hoverover tooltip. By default, first of column of the dataset is chosen as hoverover tooltip, when no feature is passed.
  • scale (float, default = 1) – The resolution scale of the figure.
  • save (Boolean, default = False) – Plot is saved as png file in local directory when save parameter set to True.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

pycaret.anomaly.tune_model(model=None, supervised_target=None, method='drop', estimator=None, optimize=None, custom_grid=None, fold=10, verbose=True)

This function tunes the fraction parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear.

This function returns the tuned model object.

Example

>>> from pycaret.datasets import get_data
>>> boston = get_data('boston')
>>> experiment_name = setup(data = boston, normalize = True)
>>> tuned_knn = tune_model(model = 'knn', supervised_target = 'medv')

This will return tuned k-Nearest Neighbors model.

Parameters:
  • model (string, default = None) –

    Enter ID of the models available in model library (ID - Model):

    • ’abod’ - Angle-base Outlier Detection
    • ’cluster’ - Clustering-Based Local Outlier
    • ’cof’ - Connectivity-Based Outlier Factor
    • ’histogram’ - Histogram-based Outlier Detection
    • ’knn’ - k-Nearest Neighbors Detector
    • ’lof’ - Local Outlier Factor
    • ’svm’ - One-class SVM detector
    • ’pca’ - Principal Component Analysis
    • ’mcd’ - Minimum Covariance Determinant
    • ’sod’ - Subspace Outlier Detection
    • ’sos’ - Stochastic Outlier Selection
  • supervised_target (string) – Name of the target column for supervised learning.
  • method (string, default = 'drop') – When method set to drop, it will drop the outlier rows from training dataset of supervised estimator, when method set to ‘surrogate’, it will use the decision function and label as a feature without dropping the outliers from training dataset.
  • estimator (string, default = None) –

    For Classification (ID - Name):

    • ’lr’ - Logistic Regression
    • ’knn’ - K Nearest Neighbour
    • ’nb’ - Naive Bayes
    • ’dt’ - Decision Tree Classifier
    • ’svm’ - SVM - Linear Kernel
    • ’rbfsvm’ - SVM - Radial Kernel
    • ’gpc’ - Gaussian Process Classifier
    • ’mlp’ - Multi Level Perceptron
    • ’ridge’ - Ridge Classifier
    • ’rf’ - Random Forest Classifier
    • ’qda’ - Quadratic Discriminant Analysis
    • ’ada’ - Ada Boost Classifier
    • ’gbc’ - Gradient Boosting Classifier
    • ’lda’ - Linear Discriminant Analysis
    • ’et’ - Extra Trees Classifier
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Classifier

    For Regression (ID - Name):

    • ’lr’ - Linear Regression
    • ’lasso’ - Lasso Regression
    • ’ridge’ - Ridge Regression
    • ’en’ - Elastic Net
    • ’lar’ - Least Angle Regression
    • ’llar’ - Lasso Least Angle Regression
    • ’omp’ - Orthogonal Matching Pursuit
    • ’br’ - Bayesian Ridge
    • ’ard’ - Automatic Relevance Determ.
    • ’par’ - Passive Aggressive Regressor
    • ’ransac’ - Random Sample Consensus
    • ’tr’ - TheilSen Regressor
    • ’huber’ - Huber Regressor
    • ’kr’ - Kernel Ridge
    • ’svm’ - Support Vector Machine
    • ’knn’ - K Neighbors Regressor
    • ’dt’ - Decision Tree
    • ’rf’ - Random Forest
    • ’et’ - Extra Trees Regressor
    • ’ada’ - AdaBoost Regressor
    • ’gbr’ - Gradient Boosting
    • ’mlp’ - Multi Level Perceptron
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Regressor

    If set to None, Linear model is used by default for both classification and regression tasks.

  • optimize (string, default = None) –
    For Classification tasks:
    Accuracy, AUC, Recall, Precision, F1, Kappa
    For Regression tasks:
    MAE, MSE, RMSE, R2, RMSLE, MAPE

    If set to None, default is ‘Accuracy’ for classification and ‘R2’ for regression tasks.

  • custom_grid (list, default = None) – By default, a pre-defined list of fraction values is iterated over to optimize the supervised objective. To overwrite default iteration, pass a list of fraction value to iterate over in custom_grid param.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
Returns:

  • Visual_Plot – Visual plot with fraction param on x-axis with metric to optimize on y-axis. Also, prints the best model metric.
  • model – trained model object with best fraction param.

pycaret.anomaly.save_model(model, model_name, model_only=False, verbose=True)

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Example

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly, normalize = True)
>>> knn = create_model('knn')
>>> save_model(knn, 'knn_model_23122019')

This will save the transformation pipeline and model as a binary pickle file in the current directory.

Parameters:
  • model (object, default = none) – A trained model object should be passed.
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • model_only (bool, default = False) – When set to True, only trained model object is saved and all the transformations are ignored.
  • verbose (bool, default = True) – When set to False, success message is not printed.
Returns:

Return type:

Success_Message

pycaret.anomaly.load_model(model_name, platform=None, authentication=None, verbose=True)

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Example

>>> saved_lr = load_model('knn_model_23122019')

This will load the previously saved model in saved_lr variable. The file must be in the current directory.

Parameters:
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • platform (string, default = None) – Name of platform, if loading model from cloud. Currently available options are: ‘aws’, ‘gcp’, ‘azure’.
  • authentication (dict) –

    dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

  • verbose (Boolean, default = True) – Success message is not printed when verbose is set to False.
Returns:

Return type:

Model Object

pycaret.anomaly.predict_model(model, data)

This function is used to predict new data using a trained model. It requires a trained model object created using one of the function in pycaret that returns a trained model object. New data must be passed to data param as pandas Dataframe.

Example

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly)
>>> knn = create_model('knn')
>>> knn_predictions = predict_model(model = knn, data = anomaly)
Parameters:
  • model (object / string, default = None) – When model is passed as string, load_model() is called internally to load the pickle file from active directory or cloud platform when platform param is passed.
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features. All features used during training must be present in the new dataset.
Returns:

Information grid is printed when data is None.

Return type:

info_grid

Warning

  • The behavior of the predict_model is changed in version 2.1 without backward compatibility.

As such, the pipelines trained using the version (<= 2.0), may not work for inference with version >= 2.1. You can either retrain your models with a newer version or downgrade the version for inference.

pycaret.anomaly.deploy_model(model, model_name, platform, authentication)

This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your AWS console account:

  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)
>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly)
>>> knn = create_model('knn')
>>> deploy_model(model = knn, model_name = 'deploy_knn', platform = 'aws', authentication = {'bucket' : 'bucket-name'})

Before deploying a model to Google Cloud Platform (GCP), project must be created either using command line or GCP console. Once project is created, you must create a service account and download the service account key as a JSON file, which is then used to set environment variable.

Learn more : https://cloud.google.com/docs/authentication/production

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly)
>>> knn = create_model('knn')
>>> os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'c:/path-to-json-file.json'
>>> deploy_model(model = knn, model_name = 'deploy_knn', platform = 'gcp', authentication = {'project' : 'project-name', 'bucket' : 'bucket-name'})

Before deploying a model to Microsoft Azure, environment variables for connection string must be set. Connection string can be obtained from ‘Access Keys’ of your storage account in Azure.

>>> from pycaret.datasets import get_data
>>> anomaly = get_data('anomaly')
>>> experiment_name = setup(data = anomaly)
>>> knn = create_model('knn')
>>> os.environ['AZURE_STORAGE_CONNECTION_STRING'] = 'connection-string-here'
>>> deploy_model(model = knn, model_name = 'deploy_knn', platform = 'azure', authentication = {'container' : 'container-name'})
Parameters:
  • model (object) – A trained model object should be passed as an estimator.
  • model_name (string) – Name of model to be passed as a string.
  • platform (string) – Name of platform for deployment. Currently accepts: ‘aws’, ‘gcp’, ‘azure’
  • authentication (dict) –

    Dictionary of applicable authentication tokens.

    When platform = ‘aws’: {‘bucket’ : ‘name of bucket’}

    When platform = ‘gcp’: {‘project’: ‘name of project’, ‘bucket’ : ‘name of bucket’}

    When platform = ‘azure’: {‘container’: ‘name of container’}

Returns:

Return type:

Success_Message

Warning

  • This function uses file storage services to deploy the model on cloud platform. As such, this is efficient for batch-use. Where the production objective is to obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.
pycaret.anomaly.models()

Returns table of models available in model library.

Example

>>> all_models = models()

This will return pandas dataframe with all available models and their metadata.

Returns:
Return type:pandas.DataFrame
pycaret.anomaly.get_logs(experiment_name=None, save=False)

Returns a table with experiment logs consisting run details, parameter, metrics and tags.

Example

>>> logs = get_logs()

This will return pandas dataframe.

Parameters:
  • experiment_name (string, default = None) – When set to None current active run is used.
  • save (bool, default = False) – When set to True, csv file is saved in current directory.
Returns:

Return type:

pandas.DataFrame

pycaret.anomaly.get_config(variable)

This function is used to access global environment variables. Following variables can be accessed:

  • X: Transformed dataset
  • data_: Original dataset
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • prep_param: prep_param configured through setup
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup

Example

>>> X = get_config('X')

This will return transformed dataset.

Returns:
Return type:variable
pycaret.anomaly.set_config(variable, value)

This function is used to reset global environment variables. Following variables can be accessed:

  • X: Transformed dataset
  • data_: Original dataset
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • prep_param: prep_param configured through setup
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup

Example

>>> set_config('seed', 123)

This will set the global seed to ‘123’.

pycaret.anomaly.get_system_logs()

Read and print ‘logs.log’ file from current active directory

pycaret.anomaly.get_outliers(data, model=None, fraction=0.05, ignore_features=None, normalize=True, transformation=False, pca=False, pca_components=0.99, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, remove_multicollinearity=False, multicollinearity_threshold=0.9, n_jobs=None)

Magic function to get outliers in Power Query / Power BI.

NLP

pycaret.nlp.setup(data, target=None, custom_stopwords=None, html=True, session_id=None, log_experiment=False, experiment_name=None, log_plots=False, log_data=False, verbose=True)

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: data, a pandas.Dataframe or object of type list. If a pandas.Dataframe is passed, target column containing text must be specified. When data passed is of type list, no target parameter is required. All other parameters are optional. This module only supports English Language at this time.

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')

‘kiva’ is a pandas.Dataframe.

Parameters:
  • data (pandas.Dataframe or list) – pandas.Dataframe with shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features, or object of type list with n length.
  • target (string) – If data is of type pandas.Dataframe, name of column containing text values must be passed as string.
  • custom_stopwords (list, default = None) – List containing custom stopwords.
  • html (bool, default = True) – If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.
  • session_id (int, default = None) – If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
  • log_experiment (bool, default = True) – When set to True, all metrics and parameters are logged on MLFlow server.
  • experiment_name (str, default = None) – Name of experiment for logging. When set to None, ‘nlp’ is by default used as alias for the experiment name.
  • log_plots (bool, default = False) – When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.
  • log_data (bool, default = False) – When set to True, train and test dataset are logged as csv.
  • verbose (Boolean, default = True) – Information grid is not printed when verbose is set to False.
Returns:

  • info_grid – Information grid is printed.
  • environment – This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

Warning

  • Some functionalities in pycaret.nlp requires you to have english language model. The language model is not downloaded automatically when you install pycaret. You will have to download two models using your Anaconda Prompt or python command line interface. To download the model, please type the following in your command line:

    python -m spacy download en_core_web_sm python -m textblob.download_corpora

    Once downloaded, please restart your kernel and re-run the setup.

pycaret.nlp.create_model(model=None, multi_core=False, num_topics=None, verbose=True, system=True, **kwargs)

This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model().

This function returns a trained model object.

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')
>>> lda = create_model('lda')

This will return trained Latent Dirichlet Allocation model.

Parameters:
  • model (string, default = None) –

    Enter ID of the model available in model library (ID - Model):

    • ’lda’ - Latent Dirichlet Allocation
    • ’lsi’ - Latent Semantic Indexing
    • ’hdp’ - Hierarchical Dirichlet Process
    • ’rp’ - Random Projections
    • ’nmf’ - Non-Negative Matrix Factorization
  • multi_core (Boolean, default = False) – True would utilize all CPU cores to parallelize and speed up model training. Only available for ‘lda’. For all other models, the multi_core parameter is ignored.
  • num_topics (integer, default = 4) – Number of topics to be created. If None, default is set to 4.
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
  • **kwargs – Additional keyword arguments to pass to the estimator.
Returns:

Trained model object.

Return type:

model

pycaret.nlp.assign_model(model, verbose=True)

This function assigns each of the data point in the dataset passed during setup stage to one of the topic using trained model object passed as model param. create_model() function must be called before using assign_model().

This function returns a pandas.Dataframe with topic weights, dominant topic and % of the dominant topic (where applicable).

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')
>>> lda = create_model('lda')
>>> lda_df = assign_model(lda)

This will return a pandas.Dataframe with inferred topics using trained model.

Parameters:
  • model (trained model object, default = None) –
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
Returns:

Returns a DataFrame with inferred topics using trained model object.

Return type:

pandas.DataFrame

pycaret.nlp.plot_model(model=None, plot='frequency', topic_num=None, save=False, system=True)

This function takes a trained model object (optional) and returns a plot based on the inferred dataset by internally calling assign_model before generating a plot. Where a model parameter is not passed, a plot on the entire dataset will be returned instead of one at the topic level. As such, plot_model can be used with or without model. All plots with a model parameter passed as a trained model object will return a plot based on the first topic i.e. ‘Topic 0’. This can be changed using the topic_num param.

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')
>>> lda = create_model('lda')
>>> plot_model(lda, plot = 'frequency')

This will return a frequency plot on a trained Latent Dirichlet Allocation model for all documents in ‘Topic 0’. The topic number can be changed as follows:

>>> plot_model(lda, plot = 'frequency', topic_num = 'Topic 1')

This will now return a frequency plot on a trained LDA model for all documents inferred in ‘Topic 1’.

Alternatively, if following is used:

>>> plot_model(plot = 'frequency')

This will return frequency plot on the entire training corpus compiled during setup stage.

Parameters:
  • model (object, default = none) – A trained model object can be passed. Model must be created using create_model().
  • plot (string, default = 'frequency') – Enter abbreviation for type of plot. The current list of plots supported are (Name - Abbreviated String): * Word Token Frequency - ‘frequency’ * Word Distribution Plot - ‘distribution’ * Bigram Frequency Plot - ‘bigram’ * Trigram Frequency Plot - ‘trigram’ * Sentiment Polarity Plot - ‘sentiment’ * Part of Speech Frequency - ‘pos’ * t-SNE (3d) Dimension Plot - ‘tsne’ * Topic Model (pyLDAvis) - ‘topic_model’ * Topic Infer Distribution - ‘topic_distribution’ * Wordcloud - ‘wordcloud’ * UMAP Dimensionality Plot - ‘umap’
  • topic_num (string, default = None) – Topic number to be passed as a string. If set to None, default generation will be on ‘Topic 0’
  • save (Boolean, default = False) – Plot is saved as png file in local directory when save parameter set to True.
  • system (Boolean, default = True) – Must remain True all times. Only to be changed by internal functions.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

Warning

  • ‘pos’ and ‘umap’ plot not available at model level. Hence the model parameter is ignored. The result will always be based on the entire training corpus.
  • ‘topic_model’ plot is based on pyLDAVis implementation. Hence its not available for model = ‘lsi’, ‘rp’ and ‘nmf’.
pycaret.nlp.tune_model(model=None, multi_core=False, supervised_target=None, estimator=None, optimize=None, custom_grid=None, auto_fe=True, fold=10, verbose=True)

This function tunes the num_topics model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear.

This function returns the tuned model object.

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')
>>> tuned_lda = tune_model(model = 'lda', supervised_target = 'status')

This will return trained Latent Dirichlet Allocation model.

Parameters:
  • model (string, default = None) –

    Enter ID of the models available in model library (ID - Model):

    • ’lda’ - Latent Dirichlet Allocation
    • ’lsi’ - Latent Semantic Indexing
    • ’hdp’ - Hierarchical Dirichlet Process
    • ’rp’ - Random Projections
    • ’nmf’ - Non-Negative Matrix Factorization
  • multi_core (Boolean, default = False) – True would utilize all CPU cores to parallelize and speed up model training. Only available for ‘lda’. For all other models, multi_core parameter is ignored.
  • supervised_target (string) – Name of the target column for supervised learning. If None, the model coherence value is used as the objective function.
  • estimator (string, default = None) –

    For Classification (ID - Name):

    • ’lr’ - Logistic Regression
    • ’knn’ - K Nearest Neighbour
    • ’nb’ - Naive Bayes
    • ’dt’ - Decision Tree Classifier
    • ’svm’ - SVM - Linear Kernel
    • ’rbfsvm’ - SVM - Radial Kernel
    • ’gpc’ - Gaussian Process Classifier
    • ’mlp’ - Multi Level Perceptron
    • ’ridge’ - Ridge Classifier
    • ’rf’ - Random Forest Classifier
    • ’qda’ - Quadratic Discriminant Analysis
    • ’ada’ - Ada Boost Classifier
    • ’gbc’ - Gradient Boosting Classifier
    • ’lda’ - Linear Discriminant Analysis
    • ’et’ - Extra Trees Classifier
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Classifier

    For Regression (ID - Name):

    • ’lr’ - Linear Regression
    • ’lasso’ - Lasso Regression
    • ’ridge’ - Ridge Regression
    • ’en’ - Elastic Net
    • ’lar’ - Least Angle Regression
    • ’llar’ - Lasso Least Angle Regression
    • ’omp’ - Orthogonal Matching Pursuit
    • ’br’ - Bayesian Ridge
    • ’ard’ - Automatic Relevance Determ.
    • ’par’ - Passive Aggressive Regressor
    • ’ransac’ - Random Sample Consensus
    • ’tr’ - TheilSen Regressor
    • ’huber’ - Huber Regressor
    • ’kr’ - Kernel Ridge
    • ’svm’ - Support Vector Machine
    • ’knn’ - K Neighbors Regressor
    • ’dt’ - Decision Tree
    • ’rf’ - Random Forest
    • ’et’ - Extra Trees Regressor
    • ’ada’ - AdaBoost Regressor
    • ’gbr’ - Gradient Boosting
    • ’mlp’ - Multi Level Perceptron
    • ’xgboost’ - Extreme Gradient Boosting
    • ’lightgbm’ - Light Gradient Boosting
    • ’catboost’ - CatBoost Regressor

    If set to None, Linear / Logistic model is used by default.

  • optimize (string, default = None) –
    For Classification tasks:
    Accuracy, AUC, Recall, Precision, F1, Kappa
    For Regression tasks:
    MAE, MSE, RMSE, R2, RMSLE, MAPE

    If set to None, default is ‘Accuracy’ for classification and ‘R2’ for regression tasks.

  • custom_grid (list, default = None) – By default, a pre-defined number of topics is iterated over to optimize the supervised objective. To overwrite default iteration, pass a list of num_topics to iterate over in custom_grid param.
  • auto_fe (boolean, default = True) – Automatic text feature engineering. Only used when supervised_target is passed. When set to true, it will generate text based features such as polarity, subjectivity, wordcounts to be used in supervised learning. Ignored when supervised_target is set to None.
  • fold (integer, default = 10) – Number of folds to be used in Kfold CV. Must be at least 2.
  • verbose (Boolean, default = True) – Status update is not printed when verbose is set to False.
Returns:

  • Visual_Plot – Visual plot with k number of topics on x-axis with metric to optimize on y-axis. Coherence is used when learning is unsupervised. Also, prints the best model metric.
  • model – trained model object with best K number of topics.

Warning

  • Random Projections (‘rp’) and Non Negative Matrix Factorization (‘nmf’) is not available for unsupervised learning. Error is raised when ‘rp’ or ‘nmf’ is passed without supervised_target.
  • Estimators using kernel based methods such as Kernel Ridge Regressor, Automatic Relevance Determinant, Gaussian Process Classifier, Radial Basis Support Vector Machine and Multi Level Perceptron may have longer training times.
pycaret.nlp.evaluate_model(model)

This function displays the user interface for all the available plots for a given model. It internally uses the plot_model() function.

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')
>>> lda = create_model('lda')
>>> evaluate_model(lda)

This will display the User Interface for all of the plots for given model.

Parameters:model (object, default = none) – A trained model object should be passed.
Returns:Displays the user interface for plotting.
Return type:User_Interface
pycaret.nlp.save_model(model, model_name, verbose=True)

This function saves the trained model object into the current active directory as a pickle file for later use.

Example

>>> from pycaret.datasets import get_data
>>> kiva = get_data('kiva')
>>> experiment_name = setup(data = kiva, target = 'en')
>>> lda = create_model('lda')
>>> save_model(lda, 'lda_model_23122019')

This will save the model as a binary pickle file in the current directory.

Parameters:
  • model (object, default = none) – A trained model object should be passed.
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • verbose (bool, default = True) – When set to False, success message is not printed.
Returns:

Return type:

Success_Message

pycaret.nlp.load_model(model_name, verbose=True)

This function loads a previously saved model from the current active directory into the current python environment. Load object must be a pickle file.

Example

>>> saved_lda = load_model('lda_model_23122019')

This will call the trained model in saved_lr variable using model_name param. The file must be in current directory.

Parameters:
  • model_name (string, default = none) – Name of pickle file to be passed as a string.
  • verbose (bool, default = True) – When set to False, success message is not printed.
Returns:

Return type:

Success_Message

pycaret.nlp.models()

Returns table of models available in model library.

Example

>>> all_models = models()

This will return pandas.DataFrame with all available models and their metadata.

Returns:
Return type:pandas.DataFrame
pycaret.nlp.get_logs(experiment_name=None, save=False)

Returns a table with experiment logs consisting run details, parameter, metrics and tags.

Example

>>> logs = get_logs()

This will return pandas.DataFrame.

Parameters:
  • experiment_name (string, default = None) – When set to None current active run is used.
  • save (bool, default = False) – When set to True, csv file is saved in current directory.
Returns:

Return type:

pandas.DataFrame

pycaret.nlp.get_config(variable)

This function is used to access global environment variables. Following variables can be accessed:

  • text: Tokenized words as a list with length = # documents
  • data_: pandas.DataFrame containing text after all processing
  • corpus: List containing tuples of id to word mapping
  • id2word: gensim.corpora.dictionary.Dictionary
  • seed: random state set through session_id
  • target_: Name of column containing text. ‘en’ by default.
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup

Example

>>> text = get_config('text')

This will return transformed dataset.

Returns:
Return type:variable
pycaret.nlp.set_config(variable, value)

This function is used to reset global environment variables. Following variables can be accessed:

  • text: Tokenized words as a list with length = # documents
  • data_: pandas.DataFrame containing text after all processing
  • corpus: List containing tuples of id to word mapping
  • id2word: gensim.corpora.dictionary.Dictionary
  • seed: random state set through session_id
  • target_: Name of column containing text. ‘en’ by default.
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup

Example

>>> set_config('seed', 123)

This will set the global seed to ‘123’.

pycaret.nlp.get_system_logs()

Read and print ‘logs.log’ file from current active directory

pycaret.nlp.get_topics(data, text, model=None, num_topics=4)

Callable from any external environment without requiring setup initialization.

Arules

pycaret.arules.setup(data, transaction_id, item_id, ignore_items=None, session_id=None)

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes three mandatory parameters: (i) data, (ii) transaction_id param identifying basket and (iii) item_id param used to create rules. These three params are normally found in any transactional dataset. pycaret will internally convert the pandas.DataFrame into a sparse matrix which is required for association rules mining.

Example

>>> from pycaret.datasets import get_data
>>> france = get_data('france')
>>> experiment_name = setup(data = data, transaction_id = 'InvoiceNo', item_id = 'ProductName')
Parameters:
  • data (pandas.DataFrame) – Shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.
  • transaction_id (string) – Name of column representing transaction id. This will be used to pivot the matrix.
  • item_id (string) – Name of column used for creation of rules. Normally, this will be the variable of interest.
  • ignore_items (list, default = None) – List of strings to be ignored when considering rule mining.
  • session_id (int, default = None) – If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.
Returns:

  • info_grid – Information grid is printed.
  • environment – This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

pycaret.arules.create_model(metric='confidence', threshold=0.5, min_support=0.05, round=4)

This function creates an association rules model using data and identifiers passed at setup stage. This function internally transforms the data for association rule mining.

setup() function must be called before using create_model()

Example

>>> from pycaret.datasets import get_data
>>> france = get_data('france')
>>> experiment_name = setup(data = data, transaction_id = 'InvoiceNo', item_id = 'ProductName')

This will return pandas.DataFrame containing rules sorted by metric param.

Parameters:
  • metric (string, default = 'confidence') –

    Metric to evaluate if a rule is of interest. Default is set to confidence. Other available metrics include ‘support’, ‘lift’, ‘leverage’, ‘conviction’. These metrics are computed as follows:

    • support(A->C) = support(A+C) [aka ‘support’], range: [0, 1]
    • confidence(A->C) = support(A+C) / support(A), range: [0, 1]
    • lift(A->C) = confidence(A->C) / support(C), range: [0, inf]
    • leverage(A->C) = support(A->C) - support(A)*support(C), range: [-1, 1]
    • conviction = [1 - support(C)] / [1 - confidence(A->C)], range: [0, inf]
  • threshold (float, default = 0.5) – Minimal threshold for the evaluation metric, via the metric parameter, to decide whether a candidate rule is of interest.
  • min_support (float, default = 0.05) – A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
  • round (integer, default = 4) – Number of decimal places metrics in score grid will be rounded to.
Returns:

Dataframe containing rules of interest with all metrics including antecedents, consequents, antecedent support, consequent support, support, confidence, lift, leverage, conviction.

Return type:

pandas.DataFrame

Warning

  • Setting low values for min_support may increase training time.
pycaret.arules.plot_model(model, plot='2d', scale=1)

This function takes a model dataframe returned by create_model() function. ‘2d’ and ‘3d’ plots are available.

Example

>>> rule1 = create_model(metric='confidence', threshold=0.7, min_support=0.05)
>>> plot_model(rule1, plot='2d')
>>> plot_model(rule1, plot='3d')
Parameters:
  • model (pandas.DataFrame, default = none) – pandas.DataFrame returned by trained model using create_model().
  • plot (string, default = '2d') –

    Enter abbreviation of type of plot. The current list of plots supported are (Name - Abbreviated String):

    • Support, Confidence and Lift (2d) - ‘2d’
    • Support, Confidence and Lift (3d) - ‘3d’
  • scale (float, default = 1) – The resolution scale of the figure.
Returns:

Prints the visual plot.

Return type:

Visual_Plot

pycaret.arules.get_rules(data, transaction_id, item_id, ignore_items=None, metric='confidence', threshold=0.5, min_support=0.05)

Magic function to get Association Rules in Power Query / Power BI.

Datasets

pycaret.datasets.get_data(dataset, save_copy=False, profile=False, verbose=True)

This function loads sample datasets that are available in the pycaret git repository. The full list of available datasets and their descriptions can be viewed by calling index.

Example

>>> data = get_data('index')

This will display the list of available datasets that can be loaded using the get_data() function. For example, to load the credit dataset:

>>> credit = get_data('credit')
Parameters:
  • dataset (string) – Index value of dataset
  • save_copy (bool, default = False) – When set to true, it saves a copy of the dataset to your local active directory.
  • profile (bool, default = False) – If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.
  • verbose (bool, default = True) – When set to False, head of data is not displayed.
Returns:

Pandas dataframe is returned.

Return type:

pandas.DataFrame

Warning

  • Use of get_data() requires internet connection.

Indices and tables