π Data Science Mentor
Abstract
This section covers how to craft prompts to use chatbots as a data science mentor.
Chatbots can serve as good mentors in the field of data science, offering support, explanations, and guidance across a range of data science topics. They can aid individuals in comprehending fundamental principles in statistics, machine learning, and data analysis, providing explanations and illustrations to improve their comprehension.
Additionally, the chatbots can provide support with tasks such as data preprocessing, model selection, and evaluation, choosing data science tools and libraries, and helping mentees make informed decisions throughout the data science workflow.
β Data Preprocessing
Data preprocessing is one of the important steps in data science workflow. Data preprocessing involves applying a series of steps to clean the raw data. Data preprocessing is essential as the model finds it difficult to learn from raw data which is completely noisy. Let us see how to craft prompts to use chatbots as data science mentors for data preprocessing.
πΏ Suggest preprocessing steps
Let us assume that you have to work on a dataset having tweets and you are comparatively new to NLP. Now, you want to preprocess tweets and are not sure of what steps to apply to clean the tweets. In this case, you can ask the chatbot for what preprocessing steps to apply. For this, you can write a prompt like the following
Prompt
You are an expert data scientist specialising in natural language processing. Tell me what preprocessing steps
I have to apply to clean the tweets.
Constraints: Donβt give detailed explanations, just generate the list of preprocessing steps to clean the tweets.
Tweet preprocessing steps:
Output
Sure, here is a list of preprocessing steps to clean tweets:
1. Lowercasing
2. Removing special characters and punctuation
3. Removing URLs and hyperlinks
4. Removing mentions and usernames (@mentions)
5. Removing hashtags (#hashtags)
6. Tokenization
7. Removing numbers and digits
8. Removing stop words
9. Lemmatization or stemming
10. Handling emojis and emoticons
11. Handling repeated characters (e.g., "loooove" -> "love")
12. Handling contractions (e.g., "can't" -> "cannot")
13. Removing extra white spaces
14. Spell-checking and correction (optional)
15. Removing retweets (e.g., "RT @username: original tweet" -> "original tweet")
Let us assume that you have to work on a dataset with missing values. You are not sure of the best strategies to handle missing values. In such as case, you can ask the chatbot for the strategies to handle missing data. For this, you can write a prompt like the following
Prompt
You are an expert data scientist with ten years of industry experience. Tell me some of the best strategies to
handle missing values.
Constraints: Donβt give detailed explanations, just generate the list of strategies.
Strategies to handle missing values:
Output
Certainly, here is a list of strategies to handle missing values:
1. Data imputation
2. Mean imputation
3. Median imputation
4. Mode imputation
5. Forward fill imputation
6. Backward fill imputation
7. Interpolation
8. K-nearest neighbors imputation
9. Regression imputation
10. Multiple imputation
β Machine Learning Model Training
Model training is important in data science workflow. Model training requires labelled data. In general, data labelling is an expensive and laborious process. So, the amount of labelled data available in most of the cases will be imbalanced or insufficient. In such cases, it is essential to address these issues to ensure that the model is trained well.
Now let us see how to craft promkpts to use chatbot as a data science mentor to address the two issues of labelled data namely data imbalance and small size.
πΏ Data Imbalance
Assume that the labelled dataset you are working with has a high data imbalance. Training the model on such an imbalanced dataset will result in poor performance related to the minor classes. To avoid this, you want to know the various data imbalance handling strategies. You can ask the chatbot to generate the list of imbalance handling strategies using the following prompt.
Prompt
You are an expert data scientist with more than ten years of industry experience. Tell me some of the best
data imbalance handling strategies.
Constraints: Donβt give detailed explanations, just generate the list of strategies.
Strategies to handle data imbalance:
Output
1. Resampling Techniques (Oversampling and Undersampling)
2. Synthetic Data Generation (SMOTE, ADASYN)
3. Cost-sensitive Learning
4. Ensemble Methods (Random Forest, XGBoost)
5. Anomaly Detection
6. Change the Threshold
7. Using Different Algorithms
8. Collect More Data
9. Data Augmentation
10. One-Class Classification
πΏ Small Training Data
Assume that the labelled dataset you are working with is small in size. Training the model on such a small training dataset will result in poor performance as the model can learn very little from a small number of labelled instances. To avoid this, you want to know the various strategies to train models with small labelled datasets. You can ask the chatbot for the list of strategies using the following prompt.
Prompt
You are an expert data scientist with more than ten years of industry experience. Tell me some of the best
strategies to handle small training datasets to train a model. Constraints: Donβt give detailed explanations,
just generate the list of strategies.
Strategies to handle small training datasets:
Output
1. Data augmentation
2. Transfer learning
3. Regularization techniques
4. Ensemble methods
5. Feature engineering
6. Cross-validation
7. Bayesian methods
8. Active learning
9. Semi-supervised learning
β Selecting Data Science Tools and Libraries
You have many tools and libraries to achieve various tasks in data science workflow. For example, for basic NLP tasks, we have various libraries like NLTK, spaCy, Stanza etc. Similarly, for data augmentation, we have many libraries like imgaug, Augmentor, nlpaug etc. Beginners in data science may not be aware of these libraries or may know which library to use. As a data science mentor, a chatbot can help to choose or recommend libraries for various data science tasks.
πΏ Tweets Preprocessing
You can ask the chatbot to suggest some good libraries to preprocess tweets using the following prompt.
Prompt
You are an expert data scientist with more than ten years of industry experience. Suggest a good library for
tweet preprocessing.
Constraints : Just suggest the library, donβt give any explanations.
Output
The "Tweets Preprocessor" library is a dedicated tool for cleaning and preprocessing tweet text. It offers various
functionalities to prepare tweet data for analysis or natural language processing tasks.
πΏ Exploratory Data Analysis
Exploratory data analysis is the first step whenever you want to work with any dataset as EDA gives insights about the dataset which greatly helps data scientists to decide how to proceed with the dataset. You can ask the chatbot to suggest a good library for exploratory data analysis.
Prompt
You are an expert data scientist with more than ten years of industry experience. Suggest a good library for
exploratory data analysis.
Constraints : Just suggest the library, donβt give any explanations.
Output