Data Science: Scenario-Based Interview Questions
Check out this list of the most commonly asked scenario-based interview questions for data scientists.
Join the DZone community and get the full member experience.
Join For FreeAs we step into the new era of technology-driven data, we find ourselves facing an abundance of possibilities on the data horizon. Data science is a very competitive job market, so it can be very difficult to find a good job. Only the best inline personnel have a chance to be interviewed for a position and hired. It is important to prepare well for such a scenario.
The formats used for data science interviews vary from company to company. Interviews most often involve questions related to SQL, machine learning, and Python, but there are certainly scenario-based questions, programming knowledge tests, and soft skills tests, as well.
The purpose of this article is to provide you with an overview of 10+ data science scenario-based interview questions provided by practitioners in the field of data science. Practicing the following data science interview questions can give you an edge over your competitors.
Let's take a look.
Scenario-based Interview Questions for Data Science Jobs
1. Consider training your neural networks with a dataset of 20 GB. Your RAM is 3 GB. What will you do to solve this problem of training large datasets?
Solution: Here's how we'll train our neural network with limited memory.
- First, we load all the data into a NumPy array.
- Next, we obtain the data from the NumPy array by passing the index.
- This data is then passed to our neural network and trained in small batches.
2. Training a dataset's features results in 100% accuracy, but when the dataset is validated, it reaches 75% accuracy. What should you look out for?
Solution: Upon achieving a training accuracy of 100%, our model needs to be screened for overfitting. When a model overfits, it learns the details and noise associated with its training data to the point that it adversely impacts its performance on new datasets. Therefore, verifying the model for overfitting is imperative.
3. Take the case where you are training a machine learning model using text data. Approximately 200K documents are included in the document matrix created. If you want to reduce the data dimensions, what techniques do you have?
Solution: We can use any of the three following techniques to reduce the dimensions of our data, depending on which one we prefer:
- Latent dirichlet allocation.
- Latent semantic indexing.
- Keyword normalization.
4. Does PCA (principal component analysis) require rotation? If so, why? If the components aren't rotated, what will happen?
Solution: As a matter of fact, rotation (orthogonal) is necessary in order to maximize the variance captured by each component. Components can be interpreted more easily in this way. In addition, this is what PCA is intended to do, where we select fewer components than features to explain the maximum variance in the dataset. A rotation does not change the relative location of the components, only their coordinates. Without rotating the components, the PCA will have a diminished impact, and we will be forced to select more components to explain variances in the dataset as a result.
5. Imagine you're given a dataset. There are missing values in the dataset that vary by 1 standard deviation from the median. How much of the data will remain unaffected? What's the reason?
Solution: To get you started, this question contains enough hints. Considering that the data spans the median, we'll assume it's a normal distribution. Approximately 68% of the data in a normal distribution lie within 1 standard deviation of the mean (or mode, median), leaving 32% untouched. Therefore, the missing values won't affect 32% of the data.
6. What makes Naive Bayes so naive?
Solution: The primary reason that Naive Bayes is so "naive" is that it assumes all data features are equally important and independent of one another. In real-world situations, these assumptions rarely hold true.
7. Imagine that you find out that your model has low bias and high variance. Which algorithm would be best suited to this problem? What's the reason?
Solution: A low bias is when the model's predicted value is close to the actual value. Therefore, the model becomes flexible enough to be able to replicate the distribution of training data in the future. Although it seems like a great accomplishment, you must remember that a flexible model cannot be generalized. This means that this model fails to perform well on unseen data.
We can resolve high variance problems using a bagging algorithm (such as a random forest). With bagging algorithms, datasets are divided into subsets that are derived through repeated randomized sampling. Following this, a single learning algorithm is used to construct a set of models based on these samples. Afterwards, averaging (regression) or voting (classification) are used to combine the model predictions. The following can also be done to combat high variance:
- Reduce model complexity by using regularization, where high model coefficients are penalized.
- Consider the top n features of the variable importance chart. Considering all the variables in the dataset, perhaps the algorithm has trouble finding a meaningful signal.
8. Let's say you're given a dataset. There are many variables in the dataset, some of which are highly correlated, and you are aware of this. You've been asked to run PCA by your manager. Do you think it would be better to remove correlated variables first? What's the reason?
Sol: There's a good chance you'll be tempted to answer "No," but that's not the right answer. An absence of correlated variables contributes substantially to PCA since the variance explained by a particular component is inflated when correlated variables are present. An example might be a data set with three variables, of which two are correlated. A PCA of this data set would demonstrate twice the variance as a PCA of uncorrelated variables. Additionally, adding correlated variables makes PCA focus more on those variables, which can be misleading.
9. Consider a train data set based on a classification problem with 1,000 columns and 1 million rows. You have been asked by your manager to reduce the dimension of this data in order to reduce the computation time for the model. There are memory constraints on your machine. How would you respond? (You are free to make practical assumptions.)
Solution: Managing high-dimensional data on a machine with limited memory is a challenge. It would be obvious to your interviewer. You can deal with such a situation in the following ways:
- Since we have less RAM, we should shut down all other programs in our machine, including our web browser, so that the RAM can be used to its full potential.
- The dataset can be randomly sampled. Essentially, we can create a smaller dataset, say with 1000 variables and 300000 rows, and compute on it.
- We can reduce dimensionality by separating numerical and categorical variables and removing those that are correlated. In the case of numerical variables, we will use correlation. We will use the chi-square test for categorical variables.
- Furthermore, we can use PCA to identify the components that explain the most variance.
- An online learning algorithm such as Vowpal Wabbit (available in Python) might be a good choice.
- The Stochastic Gradient Descent method can also be used for building linear models.
- Also, we can use our business understanding to estimate which predictors might affect the response variable. It is, however, an intuitive approach. The loss of information could be significant if useful predictors are not identified.
Note: Be sure to read about stochastic gradient descent and online learning algorithms.
10. Assume you are given a dataset about cancer detection. You have to build a classification model and achieve 96% accuracy. What makes you think that the performance of your model isn't satisfactory? Is there anything you can do about it?
Solution: After studying enough datasets, it seems obvious that cancer detection produces imbalanced results. In the case of a highly imbalanced dataset, it shouldn't be used to measure performance since 96% (as given) might only be correctly forecasting the majority of the class, but our primary concern would be about the minority class (4%) which is the people who actually get diagnosed with cancer. In order to evaluate the class-wise performance of the classifier, we should use Specificity (True Negative Rate), Sensitivity (True Positive Rate), and F measures. The following steps can be taken if minority class performance is poor:
- Among the methods available to make the data balanced, we can use under-sampling, over-sampling, or SMOTE (Synthetic Minority Oversampling Technique).
- Probability calibration can be used to adjust the prediction threshold value, and the AUC-ROC curve can be used to determine the optimal threshold.
- It is possible to assign weights to classes in a way that makes minority classes heavier.
- Anomaly detection can also be used.
11. Let's say you're working with time series data. Your manager has requested that you create a high-accuracy model. In the beginning, you choose the decision tree algorithm because you know it works fairly well with all kinds of data. Afterwards, you attempted a time series regression model and obtained greater accuracy than the decision tree model. Is this possible? How come?
Solution: Time series data tends to have linearity. Nonlinear interactions, however, are best detected using a decision tree algorithm. Decision trees failed to provide robust predictions because they were unable to map linear relationships as effectively as regression models. In summary, we discovered that linear regression models can produce robust predictions if the data set meets their linearity assumptions.
12. As part of your new assignment, you are supposed to assist a food delivery company in saving money. Despite their best efforts, the company's delivery team is not able to deliver food on time. Due to this, customers become unhappy, so to keep them satisfied, the team delivers food for free. Would a machine learning algorithm be able to save them?
Solution: It's possible that you've already started going through the list of ML algorithms in your head. Hold on! These questions are used as a way of testing your machine learning fundamentals. The problem is not one of machine learning, but one of route optimization. Three things make up a machine learning problem:
- A pattern exists.
- There is no mathematical solution to this problem (even when writing exponential equations).
- There is data on it.
Consider these three factors when determining whether machine learning is the right tool for the job.
Conclusion
How many questions did you successfully answer by yourself? As the data science industry is booming and companies are seeking more data scientists, it is likely that the difficulty of interviews will increase as well. These questions will help in preparing you for the advanced level.
The concept-based questions are vital, but scenario-based questions are also crucial in this process, since the answers you give ultimately determine your persona. When recruiting, recruiters seek problem-solvers or people who can come up with optimum solutions under pressure. The above scenarios are based on interview questions from the world of data science that have been asked during interviews in the past two years. Continue to learn, and continue to succeed. Best wishes to you all!
Opinions expressed by DZone contributors are their own.
Comments