Natural Language Processing
- pycaret.nlp.setup(data, target=None, custom_stopwords=None, html=True, session_id=None, log_experiment=False, experiment_name=None, experiment_custom_tags: Optional[Dict[str, Any]] = None, log_plots=False, log_data=False, verbose=True)
This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function. It takes one mandatory parameter only:
data
. All the other parameters are optional.Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en')
- data: pandas.Dataframe or list
pandas.Dataframe with shape (n_samples, n_features) or a list.
- target: str
When
data
is pandas.Dataframe, name of column containing text.- custom_stopwords: list, default = None
List of stopwords.
- html: bool, default = True
When set to False, prevents runtime display of monitor. This must be set to False when the environment does not support IPython. For example, command line terminal, Databricks Notebook, Spyder and other similar IDEs.
- session_id: int, default = None
Controls the randomness of experiment. It is equivalent to ‘random_state’ in scikit-learn. When None, a pseudo random number is generated. This can be used for later reproducibility of the entire experiment.
- log_experiment: bool, default = False
When set to True, all metrics and parameters are logged on the
MLFlow
server.- experiment_name: str, default = None
Name of the experiment for logging. Ignored when
log_experiment
is not True.- experiment_custom_tags: dict, default = None
Dictionary of tag_name: String -> value: (String, but will be string-ified if not) passed to the mlflow.set_tags to add new custom tags for the experiment.
- log_plots: bool or list, default = False
When set to True, certain plots are logged automatically in the
MLFlow
server.- log_data: bool, default = False
When set to True, dataset is logged on the
MLflow
server as a csv file. Ignored whenlog_experiment
is not True.- verbose: bool, default = True
When set to False, Information grid is not printed.
- Returns
Global variables that can be changed using the
set_config
function.
Warning
pycaret.nlp requires following language models:
python -m spacy download en_core_web_sm
python -m textblob.download_corpora
- pycaret.nlp.create_model(model=None, multi_core=False, num_topics=None, verbose=True, system=True, experiment_custom_tags: Optional[Dict[str, Any]] = None, **kwargs)
This function trains a given topic model. All the available models can be accessed using the
models
function.Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en') >>> lda = create_model('lda')
- model: str, default = None
Models available in the model library (ID - Name):
‘lda’ - Latent Dirichlet Allocation
‘lsi’ - Latent Semantic Indexing
‘hdp’ - Hierarchical Dirichlet Process
‘rp’ - Random Projections
‘nmf’ - Non-Negative Matrix Factorization
- multi_core: bool, default = False
True would utilize all CPU cores to parallelize and speed up model training. Ignored when
model
is not ‘lda’.- num_topics: int, default = 4
Number of topics to be created. If None, default is set to 4.
- verbose: bool, default = True
Status update is not printed when verbose is set to False.
- system: bool, default = True
Must remain True all times. Only to be changed by internal functions.
- experiment_custom_tags: dict, default = None
Dictionary of tag_name: String -> value: (String, but will be string-ified if not) passed to the mlflow.set_tags to add new custom tags for the experiment.
- **kwargs:
Additional keyword arguments to pass to the estimator.
- Returns
Trained Model
- pycaret.nlp.assign_model(model, verbose=True)
This function assigns topic labels to the dataset for a given model.
Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en') >>> lda = create_model('lda') >>> lda_df = assign_model(lda)
- model: trained model object, default = None
Trained model object
- verbose: bool, default = True
Status update is not printed when verbose is set to False.
- Returns
pandas.DataFrame
- pycaret.nlp.plot_model(model=None, plot='frequency', topic_num=None, save=False, system=True, display_format=None)
This function takes a trained model object (optional) and returns a plot based on the inferred dataset by internally calling assign_model before generating a plot. Where a model parameter is not passed, a plot on the entire dataset will be returned instead of one at the topic level. As such, plot_model can be used with or without model. All plots with a model parameter passed as a trained model object will return a plot based on the first topic i.e. ‘Topic 0’. This can be changed using the topic_num param.
Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp = setup(data = kiva, target = 'en') >>> lda = create_model('lda') >>> plot_model(lda, plot = 'frequency')
- model: object, default = none
Trained Model Object
- plot: str, default = ‘frequency’
List of available plots (ID - Name):
Word Token Frequency - ‘frequency’
Word Distribution Plot - ‘distribution’
Bigram Frequency Plot - ‘bigram’
Trigram Frequency Plot - ‘trigram’
Sentiment Polarity Plot - ‘sentiment’
Part of Speech Frequency - ‘pos’
t-SNE (3d) Dimension Plot - ‘tsne’
Topic Model (pyLDAvis) - ‘topic_model’
Topic Infer Distribution - ‘topic_distribution’
Wordcloud - ‘wordcloud’
UMAP Dimensionality Plot - ‘umap’
- topic_numstr, default = None
Topic number to be passed as a string. If set to None, default generation will be on ‘Topic 0’
- save: string or bool, default = False
Plot is saved as png file in local directory when save parameter set to True. Plot is saved as png file in the specified directory when the path to the directory is specified.
- system: bool, default = True
Must remain True all times. Only to be changed by internal functions.
- display_format: str, default = None
To display plots in Streamlit (https://www.streamlit.io/), set this to ‘streamlit’. Currently, not all plots are supported.
- Returns
None
Warning
‘pos’ and ‘umap’ plot not available at model level. Hence the model parameter is ignored. The result will always be based on the entire training corpus.
‘topic_model’ plot is based on pyLDAVis implementation. Hence its not available for model = ‘lsi’, ‘rp’ and ‘nmf’.
- pycaret.nlp.tune_model(model=None, multi_core=False, supervised_target=None, estimator=None, optimize=None, custom_grid=None, auto_fe=True, fold=10, verbose=True)
This function tunes the
num_topics
parameter of a given model.Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en') >>> tuned_lda = tune_model(model = 'lda', supervised_target = 'status')
- model: str, default = None
Enter ID of the models available in model library (ID - Model):
‘lda’ - Latent Dirichlet Allocation
‘lsi’ - Latent Semantic Indexing
‘hdp’ - Hierarchical Dirichlet Process
‘rp’ - Random Projections
‘nmf’ - Non-Negative Matrix Factorization
- multi_core: bool, default = False
True would utilize all CPU cores to parallelize and speed up model training. Ignored when
model
is not ‘lda’.- supervised_target: str
Name of the target column for supervised learning. If None, the model coherence value is used as the objective function.
- estimator: str, default = None
- Classification (ID - Name):
‘lr’ - Logistic Regression (Default)
‘knn’ - K Nearest Neighbour
‘nb’ - Naive Bayes
‘dt’ - Decision Tree Classifier
‘svm’ - SVM - Linear Kernel
‘rbfsvm’ - SVM - Radial Kernel
‘gpc’ - Gaussian Process Classifier
‘mlp’ - Multi Level Perceptron
‘ridge’ - Ridge Classifier
‘rf’ - Random Forest Classifier
‘qda’ - Quadratic Discriminant Analysis
‘ada’ - Ada Boost Classifier
‘gbc’ - Gradient Boosting Classifier
‘lda’ - Linear Discriminant Analysis
‘et’ - Extra Trees Classifier
‘xgboost’ - Extreme Gradient Boosting
‘lightgbm’ - Light Gradient Boosting
‘catboost’ - CatBoost Classifier
- Regression (ID - Name):
‘lr’ - Linear Regression (Default)
‘lasso’ - Lasso Regression
‘ridge’ - Ridge Regression
‘en’ - Elastic Net
‘lar’ - Least Angle Regression
‘llar’ - Lasso Least Angle Regression
‘omp’ - Orthogonal Matching Pursuit
‘br’ - Bayesian Ridge
‘ard’ - Automatic Relevance Determ.
‘par’ - Passive Aggressive Regressor
‘ransac’ - Random Sample Consensus
‘tr’ - TheilSen Regressor
‘huber’ - Huber Regressor
‘kr’ - Kernel Ridge
‘svm’ - Support Vector Machine
‘knn’ - K Neighbors Regressor
‘dt’ - Decision Tree
‘rf’ - Random Forest
‘et’ - Extra Trees Regressor
‘ada’ - AdaBoost Regressor
‘gbr’ - Gradient Boosting
‘mlp’ - Multi Level Perceptron
‘xgboost’ - Extreme Gradient Boosting
‘lightgbm’ - Light Gradient Boosting
‘catboost’ - CatBoost Regressor
- optimize: str, default = None
- For Classification tasks:
Accuracy, AUC, Recall, Precision, F1, Kappa (default = ‘Accuracy’)
- For Regression tasks:
MAE, MSE, RMSE, R2, RMSLE, MAPE (default = ‘R2’)
- custom_grid: list, default = None
By default, a pre-defined number of topics is iterated over to optimize the supervised objective. To overwrite default iteration, pass a list of num_topics to iterate over in custom_grid param.
- auto_fe: bool, default = True
Automatic text feature engineering. When set to True, it will generate text based features such as polarity, subjectivity, wordcounts. Ignored when
supervised_target
is None.- fold: int, default = 10
Number of folds to be used in Kfold CV. Must be at least 2.
- verbose: bool, default = True
Status update is not printed when verbose is set to False.
- Returns
Trained Model with optimized
num_topics
parameter.
Warning
Random Projections (‘rp’) and Non Negative Matrix Factorization (‘nmf’) is not available for unsupervised learning. Error is raised when ‘rp’ or ‘nmf’ is passed without supervised_target.
Estimators using kernel based methods such as Kernel Ridge Regressor, Automatic Relevance Determinant, Gaussian Process Classifier, Radial Basis Support Vector Machine and Multi Level Perceptron may have longer training times.
- pycaret.nlp.evaluate_model(model)
This function displays a user interface for analyzing performance of a trained model. It calls the
plot_model
function internally.Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> experiment_name = setup(data = kiva, target = 'en') >>> lda = create_model('lda') >>> evaluate_model(lda)
- model: object, default = none
A trained model object should be passed.
- Returns
None
- pycaret.nlp.save_model(model, model_name, verbose=True, **kwargs)
This function saves the trained model object into the current active directory as a pickle file for later use.
Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> experiment_name = setup(data = kiva, target = 'en') >>> lda = create_model('lda') >>> save_model(lda, 'saved_lda_model')
- model: object
A trained model object should be passed.
- model_name: str
Name of pickle file to be passed as a string.
- verbose: bool, default = True
When set to False, success message is not printed.
- **kwargs:
Additional keyword arguments to pass to joblib.dump().
- Returns
Tuple of the model object and the filename.
- pycaret.nlp.load_model(model_name, verbose=True)
This function loads a previously saved model.
Example
>>> from pycaret.nlp import load_model >>> saved_lda = load_model('saved_lda_model')
- model_name: str
Name of pickle file to be passed as a string.
- verbose: bool, default = True
When set to False, success message is not printed.
- Returns
Trained Model
- pycaret.nlp.models()
Returns table of models available in model library.
Example
>>> from pycaret.nlp import models >>> all_models = models()
- Returns
pandas.DataFrame
- pycaret.nlp.get_logs(experiment_name=None, save=False)
Returns a table of experiment logs. Only works when
log_experiment
is True when initializing thesetup
function.Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en', log_experiment = True) >>> lda = create_model('lda') >>> exp_logs = get_logs()
- experiment_name: str, default = None
When None current active run is used.
- save: bool, default = False
When set to True, csv file is saved in current working directory.
- Returns
pandas.DataFrame
- pycaret.nlp.get_config(variable)
This function retrieves the global variables created when initializing the
setup
function. Following variables are accessible:text: Tokenized words as a list with length = # documents
data_: pandas.DataFrame containing text after all processing
corpus: List containing tuples of id to word mapping
id2word: gensim.corpora.dictionary.Dictionary
seed: random state set through session_id
target_: Name of column containing text. ‘en’ by default.
html_param: html_param configured through setup
exp_name_log: Name of experiment set through setup
logging_param: log_experiment param set through setup
log_plots_param: log_plots param set through setup
USI: Unique session ID parameter set through setup
Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en') >>> text = get_config('text')
- Returns
Global variable
- pycaret.nlp.set_config(variable, value)
This function resets the global variables. Following variables are accessible:
text: Tokenized words as a list with length = # documents
data_: pandas.DataFrame containing text after all processing
corpus: List containing tuples of id to word mapping
id2word: gensim.corpora.dictionary.Dictionary
seed: random state set through session_id
target_: Name of column containing text. ‘en’ by default.
html_param: html_param configured through setup
exp_name_log: Name of experiment set through setup
logging_param: log_experiment param set through setup
log_plots_param: log_plots param set through setup
USI: Unique session ID parameter set through setup
Example
>>> from pycaret.datasets import get_data >>> kiva = get_data('kiva') >>> from pycaret.nlp import * >>> exp_name = setup(data = kiva, target = 'en') >>> set_config('seed', 123)
- Returns
None
- pycaret.nlp.get_topics(data, text, model=None, num_topics=4)
Callable from any external environment without requiring setup initialization.