🌅 Data Science Mentor

Abstract

This section covers how to craft prompts to use chatbots as a data science mentor.

Chatbots can serve as good mentors in the field of data science, offering support, explanations, and guidance across a range of data science topics. They can aid individuals in comprehending fundamental principles in statistics, machine learning, and data analysis, providing explanations and illustrations to improve their comprehension.

Additionally, the chatbots can provide support with tasks such as data preprocessing, model selection, and evaluation, choosing data science tools and libraries, and helping mentees make informed decisions throughout the data science workflow.

✅ Data Preprocessing

Data preprocessing is one of the important steps in data science workflow. Data preprocessing involves applying a series of steps to clean the raw data. Data preprocessing is essential as the model finds it difficult to learn from raw data which is completely noisy. Let us see how to craft prompts to use chatbots as data science mentors for data preprocessing.

🍿 Suggest preprocessing steps

Let us assume that you have to work on a dataset having tweets and you are comparatively new to NLP. Now, you want to preprocess tweets and are not sure of what steps to apply to clean the tweets. In this case, you can ask the chatbot for what preprocessing steps to apply. For this, you can write a prompt like the following

Prompt

You are an expert data scientist specialising in natural language processing. Tell me what preprocessing steps 
I have to apply to clean the tweets. 

Constraints: Don’t give detailed explanations, just generate the list of preprocessing steps to clean the tweets.

Tweet preprocessing steps:

Output

Sure, here is a list of preprocessing steps to clean tweets:

1. Lowercasing
2. Removing special characters and punctuation
3. Removing URLs and hyperlinks
4. Removing mentions and usernames (@mentions)
5. Removing hashtags (#hashtags)
6. Tokenization
7. Removing numbers and digits
8. Removing stop words
9. Lemmatization or stemming
10. Handling emojis and emoticons
11. Handling repeated characters (e.g., "loooove" -> "love")
12. Handling contractions (e.g., "can't" -> "cannot")
13. Removing extra white spaces
14. Spell-checking and correction (optional)
15. Removing retweets (e.g., "RT @username: original tweet" -> "original tweet")

🍿 Strategies for Handling Missing Values

Let us assume that you have to work on a dataset with missing values. You are not sure of the best strategies to handle missing values. In such as case, you can ask the chatbot for the strategies to handle missing data. For this, you can write a prompt like the following

Prompt

You are an expert data scientist with ten years of industry experience. Tell me some of the best strategies to 
handle missing values. 

Constraints: Don’t give detailed explanations, just generate the list of strategies.

Strategies to handle missing values:

Output

Certainly, here is a list of strategies to handle missing values:

1. Data imputation
2. Mean imputation
3. Median imputation
4. Mode imputation
5. Forward fill imputation
6. Backward fill imputation
7. Interpolation
8. K-nearest neighbors imputation
9. Regression imputation
10. Multiple imputation

✅ Machine Learning Model Training

Model training is important in data science workflow. Model training requires labelled data. In general, data labelling is an expensive and laborious process. So, the amount of labelled data available in most of the cases will be imbalanced or insufficient. In such cases, it is essential to address these issues to ensure that the model is trained well.

Now let us see how to craft promkpts to use chatbot as a data science mentor to address the two issues of labelled data namely data imbalance and small size.

🍿 Data Imbalance

Assume that the labelled dataset you are working with has a high data imbalance. Training the model on such an imbalanced dataset will result in poor performance related to the minor classes. To avoid this, you want to know the various data imbalance handling strategies. You can ask the chatbot to generate the list of imbalance handling strategies using the following prompt.

Prompt

You are an expert data scientist with more than ten years of industry experience. Tell me some of the best 
data imbalance handling strategies. 

Constraints: Don’t give detailed explanations, just generate the list of strategies.

Strategies to handle data imbalance:

Output

1. Resampling Techniques (Oversampling and Undersampling)
2. Synthetic Data Generation (SMOTE, ADASYN)
3. Cost-sensitive Learning
4. Ensemble Methods (Random Forest, XGBoost)
5. Anomaly Detection
6. Change the Threshold
7. Using Different Algorithms
8. Collect More Data
9. Data Augmentation
10. One-Class Classification

🍿 Small Training Data

Assume that the labelled dataset you are working with is small in size. Training the model on such a small training dataset will result in poor performance as the model can learn very little from a small number of labelled instances. To avoid this, you want to know the various strategies to train models with small labelled datasets. You can ask the chatbot for the list of strategies using the following prompt.

Prompt

You are an expert data scientist with more than ten years of industry experience. Tell me some of the best 
strategies to handle small training datasets to train a model. Constraints: Don’t give detailed explanations, 
just generate the list of strategies.

Strategies to handle small training datasets:

Output

1. Data augmentation
2. Transfer learning
3. Regularization techniques
4. Ensemble methods
5. Feature engineering
6. Cross-validation
7. Bayesian methods
8. Active learning
9. Semi-supervised learning

✅ Selecting Data Science Tools and Libraries

You have many tools and libraries to achieve various tasks in data science workflow. For example, for basic NLP tasks, we have various libraries like NLTK, spaCy, Stanza etc. Similarly, for data augmentation, we have many libraries like imgaug, Augmentor, nlpaug etc. Beginners in data science may not be aware of these libraries or may know which library to use. As a data science mentor, a chatbot can help to choose or recommend libraries for various data science tasks.

🍿 Tweets Preprocessing

You can ask the chatbot to suggest some good libraries to preprocess tweets using the following prompt.

Prompt

You are an expert data scientist with more than ten years of industry experience. Suggest a good library for 
tweet preprocessing. 

Constraints : Just suggest the library, don’t give any explanations.

Output

The "Tweets Preprocessor" library is a dedicated tool for cleaning and preprocessing tweet text. It offers various 
functionalities to prepare tweet data for analysis or natural language processing tasks.

🍿 Exploratory Data Analysis

Exploratory data analysis is the first step whenever you want to work with any dataset as EDA gives insights about the dataset which greatly helps data scientists to decide how to proceed with the dataset. You can ask the chatbot to suggest a good library for exploratory data analysis.

Prompt

You are an expert data scientist with more than ten years of industry experience. Suggest a good library for 
exploratory data analysis. 

Constraints : Just suggest the library, don’t give any explanations.

Output

I would recommend using the "Pandas" library for exploratory data analysis (EDA). Pandas is a powerful and 
popular Python library for data manipulation and analysis, and it is widely considered one of the best choices 
for EDA.