What is Data Science? List the differences between supervised and unsupervised learning.
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. How is this different from what statisticians have been doing for years?
The answer lies in the difference between explaining and predicting.

The differences between supervised and unsupervised learning are as follows;
Supervised Learning | Unsupervised Learning |
Input data is labelled. | Input data is unlabelled. |
Uses a training data set. | Uses the input data set. |
Used for prediction. | Used for analysis. |
Enables classification and regression. | Enables Classification, Density Estimation, & Dimension Reduction |
What is Selection Bias?
Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is usually associated with research where the selection of participants isn’t random. It is sometimes referred to as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias include:
- Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
- Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
- Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
- Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
What is bias-variance trade-off?
Bias: Bias is an error introduced in your model due to oversimplification of the machine learning algorithm. It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity and overfitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
- The k-nearest neighbour algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.
- The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease bias.
Data Science Interview Question
What is a confusion matrix?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix

A data set used for performance evaluation is called a test data set. It should contain the correct labels and predicted labels.

The predicted labels will exactly the same if the performance of a binary classifier is perfect.

The predicted labels usually match with part of the observed labels in real-world scenarios.

A binary classifier predicts all data instances of a test data set as either positive or negative. This produces four outcomes-
- True-positive(TP) — Correct positive prediction
- False-positive(FP) — Incorrect positive prediction
- True-negative(TN) — Correct negative prediction
- False-negative(FN) — Incorrect negative prediction

Basic measures derived from the confusion matrix-
- Error Rate = (FP+FN)/(P+N)
- Accuracy = (TP+TN)/(P+N)
- Sensitivity(Recall or True positive rate) = TP/P
- Specificity(True negative rate) = TN/N
- Precision(Positive predicted value) = TP/(TP+FP)
- F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b²PREC+REC) where b is commonly 0.5, 1, 2.
Describe Markov chains?
Markov chain is a type of stochastic process. In Markov chains, the future probability of any state depends only on the current state.

The above figure represents a Markov chain model where each step has an output that depends on the current state only.
An example can be word recommendation. When we type a paragraph, the next word is suggested by the model which depends only on the previous word and not on anything before it. The Markov chain model is trained previously on a similar paragraph where the next word to a given word is stored for all the words in the training data. Based on this training data output, the next words are suggested.
What do you understand by true positive rate and false positive rate?
True Positive rate(TPR) is the ratio of True Positives to True Positives and False Negatives. It is the probability that an actual positive will test as positive.
TPR=TP/TP+FN
The False Positive Rate(FPR) is the ratio of the False Positives to all the positives(True positives and false positives). It is the probability of a false alarm, i.e., a positive result will be given when it is actually negative.
FPR=FP/TP+FP
Advance Data Science Interview Question
Why R is used in Data Visualization?
R is used in data visualization as it has many inbuilt functions and libraries that help in data visualizations. These libraries include ggplot2, leaflet, lattice, etc.
R helps in exploratory data analysis as well as feature engineering. Using R, almost any type of graph can be created. Customizing graphics is easier in R than using python.
What is the ROC curve?
The ROC curve is a graph between False positive rate on the x axis and True positive rate on the y axis. True positive rate is the ratio of True positives to the total number of positive samples. False positive rate is the ratio of False positives to the total number of negative samples. The FPR and TPR are plotted on several threshold values to construct the ROC curve. The area under the ROC curve ranges from 0 to 1. A completely random model has an ROC of 0.5, which is represented by a straight line. The more the ROC curve deviates from this straight line, the better the model is. ROC curves are used for binary classification. The below image shows an example of an ROC curve.

What are dimensionality reduction and its benefits?
Reducing the number of features for a given dataset is known as dimensionality reduction. There are many techniques used to reduce dimensionality such as-
Feature Selection Methods
Matrix Factorization
Manifold Learning
Autoencoder Methods
Linear Discriminant Analysis (LDA)
Principal component analysis (PCA)
One of the main reasons for dimensionality reduction is the curse of dimensionality. When the number of features increases, the model becomes more complex. But if the number of datapoints is less, the model will start learning or overfitting the data. The model will not generalize the data. This is known as the curse of dimensionality.
Other benefits of dimensionality reduction include-
The time and storage space is reduced.
It becomes easier to visualize and visually represent the data in 2D or 3D.
Space complexity is reduced.
Data Science Interview Question
How do you find RMSE and MSE in a linear regression model?
Root Mean Squared Error(RMSE) is used to test the performance of the linear regression model. It evaluates how much the data is spread around the line of best fit. Its formula is-

Where,
y_hat is the predicted value
y_i is the actual value of the output variable.
N is the number of data points
Mean Squared Error(MSE) tells how close the line is to the actual data. The difference of the line from the data points is taken and squared. The MSE value should be low for a good model. This means that the error between actual and predicted output values should be low. It is calculated as-

How to deal with unbalanced binary classification?
While doing binary classification, if the data set is imbalanced, the accuracy of the model can’t be predicted correctly using only the R2 score. For example, if the data belonging to one of the two classes is very less in quantity as compared to the other class, the traditional accuracy will take a very small percentage of the smaller class. If only 5% of the examples are belonging to the smaller class, and the model classifies all outputs belonging to the other class, the accuracy would still be around 95%. But this will be wrong. To deal with this, we can do the following-
- Use other methods for calculating the model performance like precision/recall, F1 score, etc.
- Resample the data with techniques like under sampling(reducing the sample size of the larger class), oversampling(increasing the sample size of smaller class using repetition, SMOTE, and other such techniques.
- Using K-fold cross-validation
- Using ensemble learning such that each decision tree considers the entire sample of the smaller class and only a subset of the larger class.
What is the difference between a box plot and a histogram
Both histograms and box plots are used to visually represent the frequency of values of a certain feature. The figure below shows a histogram.

While the figure below shows a boxplot for the same data.

Histogram is used to know the underlying probability distribution of data. While boxplots are used more to compare several datasets. Boxplots have fewer details and take up less space than histograms.
Advance Data Science Interview Question
What does NLP stand for?
NLP stands for Natural language processing. It is the study of programming computers to learn large amounts of textual data. Examples of NLP include tokenization, stop words removal, stemming, sentiment analysis, etc.
Walkthrough the probability fundamentals
The possibility of the occurrence of an event, among all the possible outcomes, is known as its probability. The probability of an event always lies between(including) 0 and 1.

Factorial -it is used to find the total number of ways n number of things can be arranged in n places without repetition. Its value is n multiplied by all natural numbers till n-1.eg. 5!=5X4X3X2X1=120
Permutation– It is used when replacement is not allowed, and the order of items is important. Its formula is-

Where,
n is the total number of items
R is the number of ways items are being selected
Combination– It is used when replacement is not allowed, and the order of items is not important. Its formula is-
Explore Curriculum

Some rules for probability are-
Addition Rule
P(A or B)= P(A) + P(B) – P(A and B)
Conditional probability
It is the probability of event B occurring, assuming that event A has already occurred.
P(A and B)= P(A) . P(B|A)
Central Limit theorem
It states that when we draw random samples from a large population, and take the mean of these samples, they form a normal distribution.
Describe different regularization methods, such as L1 and L2 regularization
There are 3 important regularization methods as follows-
L2 regularization-(Ridge regression)– In L2 regularization, we add the sum of the squares of all the weights, multiplied by a value lambda, to the loss function. The formula for Ridge regression is as follows-

As you can see, if the value of the weights multiplied by the data value for a particular data point and feature becomes very large, the original loss will become small. But the added value of lambda multiplied with the sum of squares of weights will become large as well. Similarly, if the original loss value becomes very large, the added value will become small. Thus it will control the final value from becoming too large or too small.
L1 Regularization-(Lasso regression)– In L1 regularization, we add the sum of the absolute values of all the weights, multiplied by a value lambda, to the loss function. The formula for Lasso regression is as follows-

The loss function along with the optimization algorithm brings parameters near to zero but not actually zero, while lasso eliminates less important features and sets respective weight values to zero.
Dropout
This is used for regularization in neural networks. Fully connected layers are more prone to overfitting. Dropout leaves out some neurons with 1-p probability in neural networks. Dropout reduces overfitting, improves training speed, and makes the model more robust.
Data Science Interview Question
How should you maintain a deployed model?
After a model has been deployed, it needs to be maintained. The data being fed may change over time. For example, in the case of a model predicting house prices, the prices of houses may rise over time or fluctuate due to some other factor. The accuracy of the model on new data can be recorded. Some common ways to ensure accuracy include-
- The model should be frequently checked by feeding negative test data. If the model gives low accuracy with negative test data, it is fine.
- An Auto Encoder should be built that Using anomaly detection techniques, the AE model will calculate the Reconstruction error value. If the Reconstruction error value is high, it means the new data does not follow the old pattern learned by the model.
If the model shows good prediction accuracy with new data, it means that the new data follows the pattern or the generalization learned by the model on old data. So, the model can be retrained on the new data. If the accuracy on new data is not that good, the model can be retrained on the new data with feature engineering on the data features along with the old data.
If the accuracy is not good, the model may need to be trained from scratch.
Write the equation and calculate the precision and recall rate.
Precision quantifies the number of correct positive predictions made. Precision is calculated as the number of true positives divided by the total number of true positives and false positives.
Precision = True Positives / (True Positives + False Positives)
Precision is defined as the number of correct positive predictions made out of all positive predictions that could have been made. Recall is calculated as the number of true positives divided by the total number of true positives and false negatives.
Recall = True Positives / (True Positives + False Negatives)
Why do we use the summary function?
Summary functions are used to give the summary of all the numeric values in a data frame. Eg. The describe() function can be used to provide the summary of all the data values given to it.
column_name.describe() will give the following values of all the numeric data in the column-
- Count
- Mean
- Std-Standard deviation
- Min-Minimum
- 25%
- 50%
- 75%
- max-Maximum
Advance Data Science Interview Question
How will you measure the Euclidean distance between the two arrays in NumPy?
The Euclidean distance between 2 arrays A[1,2,3,] andB[8,9,10] can be calculated by taking the Euclidean distance of each point respectively. The built-in function numpy.linalg.norm()can be used as follows-

What is the difference between an error and a residual error?
An error refers to the difference between the predicted value and the actual value. The most popular means for calculating errors in data science are Mean Absolute Error(MAE), Mean Squared Error(MSE), and Root Mean Squared Error(RMSE). While residual is the difference between a group of values observed and their arithmetical mean. An error is generally unobservable while a residual error can be visualized on a graph. Error represents how observed data differs from the actual population. While a residual represents the way observed data differs from the sample population data.
Difference between Normalisation and Standardization?
Normalization, also known as min-max scaling, is a technique where all the data values are converted such that they lie between 0 and 1.
The formula for Normalization is-

Where,
X_max is the maximum value of the feature
X_min is the minimum value of the feature
Standardization refers to converting our data such that the data has normal distribution with its mean as 0 and standard deviation as 1.
The formula for Standardization is-

So, while normalization rescales the data into the range from 0 to 1 only, standardization ensures data follows the standard normal distribution.
Data Science Interview Question
What is the difference between “long” and “wide” format data?
In the wide-format, a subject’s repeated responses will be in a single row, and each response is in a separate column. In the long-format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups.

What is correlation and covariance in statistics?
Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.
Covariance: In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.
What is the difference between Point Estimates and Confidence Interval?
Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.
A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 — alpha, where alpha is the level of significance.
Advance Data Science Interview Question
What is the goal of A/B Testing?
It is a hypothesis testing for a randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads
An example of this could be identifying the click-through rate for a banner ad.
What is p-value?
When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way,
High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.
In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour?
Probability of not seeing any shooting star in 15 minutes is
= 1 – P( Seeing one shooting star )
= 1 – 0.2 = 0.8
Probability of not seeing any shooting star in the period of one hour
= (0.8) ^ 4 = 0.4096
Probability of seeing at least one shooting star in the one hour
= 1 – P( Not seeing any star )
= 1 – 0.4096 = 0.5904
Data Science Interview Question
How can you generate a random number between 1 – 7 with only a die?
- Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36 different outcomes.
- To get our 7 equal outcomes we have to reduce this 36 to a number divisible by 7. We can thus consider only 35 outcomes and exclude the other one.
- A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6 appears twice.
- All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each. This way all the seven sets of outcomes are equally likely.
A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?
In the case of two children, there are 4 equally likely possibilities
BB, BG, GB and GG;
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
Thus, P(Having two girls given one girl) = 1 / 3
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).
Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.
Calculation of seasonality is pretty straightforward.
Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )
Advance Data Science Interview Question
Why Is Re-sampling Done?
Resampling is done in any of these cases:
- Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
- Substituting labels on data points when performing significance tests
- Validating models by using random subsets (bootstrapping, cross-validation)
What are the differences between over-fitting and under-fitting?
In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.
How to combat Overfitting and Underfitting?
To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.
Data Science Interview Question
What is regularization? Why is it useful?
Regularization is the process of adding tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.
What Is the Law of Large Numbers?
It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.
What Are Confounding Variables?
In statistics, a confounder is a variable that influences both the dependent variable and independent variable.
For example, if you are researching whether a lack of exercise leads to weight gain,
lack of exercise = independent variable
weight gain = dependent variable.
A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.
Advance Data Science Interview Question
What Are the Types of Biases That Can Occur During Sampling?
- Selection bias
- Under coverage bias
- Survivorship bias
What is Survivorship Bias?
It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means.
What is selection Bias?
Selection bias occurs when the sample obtained is not representative of the population intended to be analyzed.
Data Science Interview Question
Explain how a ROC curve works?
The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.

What is TF/IDF vectorization?
TF–IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
The TF–IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Why we generally use SoftMax non-linearity function as last operation in-network?
It is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints).
Then the i’th component of SoftMax(x) is —

It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.
Advance Data Science Interview Question
How does data cleaning plays a vital role in the analysis?
Data cleaning can help in analysis because:
- Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
- Data Cleaning helps to increase the accuracy of the model in machine learning.
- It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
- It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
Differentiate between univariate, bivariate and multivariate analysis.
Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.
The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.
Explain Star Schema.
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
Data Science Interview Question
What is Cluster Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
For eg., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.
What is Systematic Sampling?
Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
What are Eigenvectors and Eigenvalues?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Advance Data Science Interview Question
Can you cite some examples where a false positive is important than a false negative?
Let us first understand what false positives and false negatives are.
- False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error.
- False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase.
Can you cite some examples where a false negative important than a false positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?
Example 2: What if Jury or judge decides to make a criminal go free?
Example 3: What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after a few years and realize that you had a false negative?
What is a Neural Network?
Neural Network is a supervised machine learning algorithm which is inspired by human nervous system and it replicates the similar to how human brain is trained. It consists of Input Layers, Hidden Layers, & Output Layers.
Data Science Interview Question
What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
In bayesian estimate, we have some knowledge about the data/problem (prior). There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of the Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted, computing the weighted sum of these predictions serves the purpose.
Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.
How would you develop a model to identify plagiarism?
Follow the steps below for developing a model that identifies plagiarism:
- Tokenise the document.
- Use the NLTK library in Python for the removal of stop words from data.
- Create LDA or SDA of the document and then use the GenSim library to identify the most relevant words, line by line.
- Use Google Search API to search for those words.
How will you tackle a vanishing gradient problem?
A vanishing gradient problem is something that occurs when multiple layers using a certain activation function are added to the network, making the network hard to train. Activation functions like sigmoid squish large input values to a smaller input space i.e., between 0 and 1. Because of this squishing, larger changes in the sigmoid input produce very small derivatives, producing very small gradients. It does not help the weights move anywhere significantly, and the weights get into a stuck state. We can overcome this by following one or more of the following methods.
Residual networks: Residual network blocks can help us overcome this problem as they provide a direct connection to the earlier layers skipping 1 or more weighted layers.
ReLu: Opting other activation functions like ReLu can also fix this problem efficiently.
Batch normalization: Batch normalization helps us overcome this problem by normalizing the input data so that the data doesn’t reach the outer edges of the sigmoid function.
Advance Data Science Interview Question
How will you tackle an exploding gradient problem?
By sticking to a small learning rate, scaled target variables, a standard loss function, one can carefully configure the network of a model and avoid exploding gradients. Another approach for tackling exploding gradients is using gradient scaling or gradient clipping to change the error before it is propagated back through the network. This change in error allows rescaling of weights.
Write a function that takes in two sorted lists and outputs a sorted list that is their union.
The first solution which will come to your mind is to merge two lists and sort them afterwards.
Python code-
def return_union(list_a, list_b):
return sorted(list_a + list_b)
R code-
return_union <- function(list_a, list_b)
{
list_c<-list(c(unlist(list_a),unlist(list_b)))
return(list(list_c[[1]][order(list_c[[1]])]))
}
Generally, the tricky part of the question is not to use any sorting or ordering function. In that case, you will have to write your own logic to answer the question and impress your interviewer.
Python code-
def return_union(list_a, list_b):
len1 = len(list_a)
len2 = len(list_b)
final_sorted_list = []
j = 0
k = 0
for i in range(len1+len2):
if k == len1:
final_sorted_list.extend(list_b[j:])
break
elif j == len2:
final_sorted_list.extend(list_a[k:])
break
elif list_a[k] < list_b[j]:
final_sorted_list.append(list_a[k])
k += 1
else:
final_sorted_list.append(list_b[j])
j += 1
return final_sorted_list
Similar function can be returned in R as well by following similar steps.
return_union <- function(list_a,list_b)
{
#Initializing length variables
len_a <- length(list_a)
len_b <- length(list_b)
len <- len_a + len_b
#initializing counter variables
j=1
k=1
#Creating an empty list which has length equal to sum of both the lists
list_c <- list(rep(NA,len))
#Here goes our for loop
for(i in 1:len)
{
if(j>len_a)
{
list_c[i:len] <- list_b[k:len_b]
break
}
else if(k>len_b)
{
list_c[i:len] <- list_a[j:len_a]
break
}
else if(list_a[[j]] <= list_b[[k]])
{
list_c[[i]] <- list_a[[j]]
j <- j+1
}
else if(list_a[[j]] > list_b[[k]])
{
list_c[[i]] <- list_b[[k]]
k <- k+1
}
}
return(list(unlist(list_c)))
}
How can you iterate over a list and also retrieve element indices at the same time?
It can be done using the enumerate function, which takes every element in a sequence just like in a list and adds its location just before it.
Data Science Interview Question
What is the difference between gradient descent optimization algorithms Adam and Momentum?
MOMENTUM ALGORITHM
Vanilla gradient descent with momentum is a method of accelerating the gradient descent to move faster towards the global minimum. Mathematically, a decay rate is multiplied to the previous sum of gradients and added with the present gradient to get a new sum of gradients. When the decay rate is assigned zero, it denotes a normal gradient descent. When the decay rate is set to 1, it oscillates like a ball in a frictionless bowl without any end. Hence decay rate is typically chosen around 0.8 to 0.9 to arrive at an end. The momentum algorithm gives us the advantage of escaping the local minima and getting into global minima.
ADAM ALGORITHM
Adaptive Moment Estimation, shortly called ADAM, is a combination of Momentum and RMSProp. In the AdaGrad algorithm, the sum of gradients is squared, which only grows, and it is prolonged. RMSProp is nothing, but root mean square propagation which fixes the issue by considering a decay factor. In the Adam algorithm, when mathematically explained, two decay rates are used namely beta1 and beta2 where beta1 denotes the first momentum in which the sum of the gradient is considered and beta2 denotes the second momentum in which the sum of gradient squared is considered. Since the Momentum algorithm gives us a faster way and RMSProp provides the ability to gradient to restyle in different directions, the combination of the two works well. Thus, the Adam algorithm is considered the go-to choice of deep learning algorithms.
Why is vectorization considered a powerful method for optimizing numerical code?
All computer CPUs support SIMD (Single Instruction Multiple Data), where a single instruction can simultaneously be applied to multiple data points. Vectorization can be defined as a process of transforming the system from operating on a single data point at a time to multiple data points simultaneously. Hence, when we say that we have vectorized the code, we are simultaneously applying a single instruction to multiple data points. With a conventional for loop (or while loop or any other looping techniques for that matter), we apply the instructions on only one data point at each iteration but when we use a vectorized approach, the instruction can be applied to n (say n = 3) data points at each iteration. If we have N such data points, and if the instruction takes 1 second to run on each data point, the conventional for loop might take 1 * N = N seconds to run. But, in a vectorized approach, the time taken will be N/n seconds. i.e., the time taken is reduced n fold.
Why it is not advisable to use a SoftMax output activation function in a multi-label classification problem for a one-hot encoded target?
The Sigmoid function, also known as the logistic function, is used when the problem is a binary classification problem. The formula of sigmoid function goes as given below.
Generally, if the output of the sigmoid function is more significant than 0.5, then it corresponds to class 1 and class 0 otherwise.
The SoftMax function is a generalized version of the sigmoid function to multiple dimensions or classes. The SoftMax function assumes that the outputs are mutually exclusive. If the outputs are one-hot encoded, they might not be mutually exclusive, so we do not prefer SoftMax activation functions in such cases.
Another reason is that when we say that the labels are one-hot encoded, then it means that the output will contain either a 0 or a 1, which is a more comfortable scenario for the Sigmoid function rather than the SoftMax function.
Advance Data Science Interview Question
What do you understand by conjugate-prior with respect to Naïve Bayes?
Bayes theorem works on the condition that the probability of an event can be found given the probability of another event has already occurred. In Bayes theorem, a prior probability is a probability that an observation will classify into a group before we collect the data. The term posterior probability means, the probability of assigning observations to groups that give the data. So, what exactly does a conjugate prior mean? It is defined when for some likelihood functions when one chooses a certain ‘prior’, the ‘posterior’ goes in the same distribution as that of prior and this prior is then known as conjugate prior.
What do you understand by Hypothesis in the content of Machine Learning?
In machine learning, a hypothesis represents a mathematical function that an algorithm uses to represent the relationship between the target variable and features.
Is Naïve Bayes bad? If yes, under what aspects.
Naïve Bayes is a machine learning algorithm based on the Bayes Theorem. This is used for solving classification problems. It is based on two assumptions, first, each feature/attribute present in the dataset is independent of another, and second, each feature carries equal importance. But this assumption of Naïve Bayes turns out to be disadvantageous. As it assumes that the features are independent of each other, but in real-life scenarios, this assumption cannot be true as there is always some dependence present in the given set of features. Another disadvantage of this algorithm is the ‘zero-frequency problem’ where the model assigns value zero for those features in the test dataset that were not present in the training dataset.
Data Science Interview Question
A test has a true positive rate of 100% and a false-positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of having that condition?
Let’s suppose you are being tested for a disease. If you have the illness, the test will end up saying you have the illness. However, if you don’t have the illness, 5% of the time, the test will say you have the illness, and 95% of the time, the test will give an accurate result that you don’t have the illness. Thus there is a 5% error in case you do not have the illness.
Out of 1000 people, 1 person who has the disease will get a true positive result.
Out of the remaining 999 people, 5% will also get true positive results.
Close to 50 people will get a true positive result for the disease.
It means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.
In experimental design, is it necessary to do randomization? If yes, why?
Yes, it is necessary to use randomization while designing experiments. By randomization, we try to eliminate the bias as much as possible. The main purpose of randomization is it automatically controls for all lurking variables. Experiments with randomization establish a clearer causal relationship between explanatory variables and response variables by having control over explanatory variables.
How will you assess the statistical significance of insight whether it is a real insight or just by chance?
The statistical importance of insight can be accessed using Hypothesis Testing.
Advance Data Science Interview Question
Can you write the formula to calculate R-square?
R-Square can be calculated using the below formula –
1 – (Residual Sum of Squares/ Total Sum of Squares)
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity is nothing but “Predicted TRUE events/ Total events”. True events here are the events that were true, and the model also predicted them as true.
Calculation of sensitivity is pretty straightforward-
Sensitivity = True Positives /Positives in Actual Dependent Variable
Where True positives are Positive events that are correctly classified as Positives.
Why L1 regularizations cause parameter sparsity whereas L2 regularization does not?
Regularizations in statistics or in the field of machine learning are used to include some extra information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints to optimization problems.
In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the corners as solutions while in L2, it doesn’t. So in L1 variables are penalized more as compared to L2 which results in sparsity.
In other words, errors are squared in L2, so the model sees the higher error and tries to minimize that squared error.
Data Science Interview Question
What is the difference between skewed and uniform distribution?
When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in a uniform distribution. Distributions that have more observations on one side of the graph than the other are referred to as skewed distribution. Distributions with fewer observations on the left ( towards lower values) are skewed left, and distributions with fewer observations on the right ( towards higher values) are skewed right.
How are confidence intervals constructed, and how will you interpret them?
A confidence interval provides a range of values that is likely to contain the population parameter of interest. In most statistical case studies, we tend to estimate the population mean. We can calculate the confidence interval of the average of a population if the standard deviation of the population is known using the formula below –
Here, z stands for the z value from the normal distribution. The z value changes according to the desired confidence level.
While interpreting a confidence interval, It is always necessary to remember that when we are estimating a confidence interval, we are estimating a population parameter using the data from a sample.
The correct way to interpret a 95% confidence interval can be “we are 95% confident that the population parameter is between X (lower limit) and X (upper limit).”
What is the difference between squared error and absolute error?
Mean Absolute Error
Mean absolute error is the average absolute difference between the predicted and the actual values across the validation set. It gives us the average residual of the validation data. The formula for mean absolute error is
Mean square error
Mean square error is the average of the squared difference between the predicted and the actual values across the validation set. It gives us the variance of the residuals in the validation data. Unlike MAE, MSE punishes significant errors more since it is a squared metric. The formula for mean squared error is
Advance Data Science Interview Question
How do you decide whether your linear regression model fits the data?
A good fitting regression model results in predicted values closer to the observed values. We can use any of the metrics below to check the performance of a linear regression model on our data.
1. R-squared: It is based on Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures how far the data are from the mean of the data, and SSE measures how far the data are from the model’s predicted values. Dividing the difference of SST and SSE with SST will give us the R-squared value. This proportion indicates how well the model is fit. R-squared ranges from zero to one, where zero indicates that the model makes poor predictions and one indicates perfect predictions. An increase in R square is proportional to improvement in the regression model.
2. F-test: f-test assesses with a null hypothesis that all coefficients in the regression model are zero and an alternate hypothesis that at least one is not zero. We accept the null hypothesis when R-squared equals to zero.
3. RMSE: It is the square root of the variance of the residuals. Lower the value of RMSE, the better the model is. The formula for RMSE is –
R-squared is considered as a relative measure of fit, whereas RMSE is an absolute measure of fit.
How can you make data normal using Box-Cox transformation?
The Box-Cox transformation is a method of normalizing data, named after two statisticians who introduced it, George Box and David Cox. Each data point, X, is transformed using the formula Xa, where a represents the power to which each data point is raised. The box-cox transformation fits the data for values -5 to +5 until the optimal value of ‘a’ that can best normalize the data is identified.
How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.
For example, the following image shows three different groups.
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see any decrement in WSS. This point is known as bending point and taken as K in K-Means.
This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendrograms and identify the distinct groups from there.
Data Science Interview Question
How important it is to introduce non-linearities in a neural network and why?
We will start by understanding what is neural networks. Neural networks reflect the behaviour of the human brain, allowing computer programs to recognize patterns and solve common problems in the fields of AI, machine learning, and deep learning. A neural network consists of an input layer, an output layer, and one or multiple hidden layers. These hidden layers can contain n number of neurons within them. A neural network works on forward and backward propagation. Weights and biases are fed to the network and an activation function is applied (differs according to the problem statement) and the output is generated.
What is nonlinearity? A neural network that successfully makes predictions without the following linearity. Sigmoid, tanh, RELU, LEAKY -RELU are some of the commonly used non-linear activation functions. A neural network without an activation function is essentially just a linear regression model. Neurons cannot learn with just linear function; it requires nonlinearity to understand the errors and make predictions. Hence, the activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. In short, we can say that increasing the non-linearities will help my network make more complex decisions.
You are given a dataset with 1500 observations and 15 features. How many observations you will select in each decision tree in a random forest?
Each decision tree has a subset of features but includes all the observations from the dataset. In this case, the answer will be 1500 as the tree will include all the observations from the dataset.
Explain the use of Combinatorics in data science.
Combinatorics is defined as a branch of mathematics that is concerned with sets of objects that meet certain conditions. In computer science, combinatorics is used to study algorithms, sets of steps or rules devised to address a specific problem. Combinators optimization is a subfield of combinators related to algorithm theory machine learning, image analysis and ANNs. Machine learning is related to computational statistics, which focuses on prediction making through the use of computers. Combinators are nothing but the study of countable sets. Probability use combinators to assign probability value between 0 to 1 to events and compare them with probability models. Real-world machine learning tasks frequently involve combinatorial structure. How model, infer or predict with graphs, matchings, hierarchies, informative subsets or other discrete structures are underlying the data In Artificial neural networks, feature selection and parameter optimization in feed-forward artificial neural networks. In feature selection, you’re trying to find an optimal combination of features to use in your dataset from a finite possible selection. Greedy algorithms, meta-heuristics and information gain filtering are all common approaches. Back-propagation is an algorithm used in artificial neural networks to find a near-optimal set of weights/parameters.
Advance Data Science Interview Question
According to the universal approximation theorem, any function can be approximated as closely as required using single collinearity. Then why do people use more?
One of the prominent researchers in AI, Ian Good fellow, said that “A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly. If you closely read the statement, it tells you the flaws of approximating a function with only a single layer. The first is a sizeable infeasible layer, and another is it may fail to learn the function correctly. While working with just a single layer, we can not always get good approximations. As you take more and more complex functions it is mandatory to use complex networks. From a theoretical perspective, you can always approximate the function with single collinearity. Still, in practice, we need to get close to that function that is difficult to achieve with single collinearity.
How will you calculate the accuracy of a model using a confusion matrix?
The matrix which is used to describe the performance of the classification model over the data for which true values are known is called a confusion matrix. Following is the image of the confusion matrix for a binary classifier. Accuracy is the quotient of the sum of correctly classified values by the sum of total values. Accuracy= (TP+TN) / (TP+FP+TN+FN)
What is the curse of dimensionality?
High dimensional data refers to data that has a large number of features. The dimension of data is the number of features or attributes in the data. The problems arising while working with high dimensional data are referred to as the curse of dimensionality. It basically means that error increases as the number of features increases in data. Theoretically, more information can be stored in high dimensional data, but practically, it does not help as it can have higher noise and redundancy. It is hard to design algorithms for high dimensional data. Also, the running time increases exponentially with the dimension of data.
Data Science Interview Question
Give some situations where you will use an SVM over a Random Forest Machine Learning algorithm and vice-versa.
SVM and Random Forest are both used in classification problems.
a) If you are sure that your data is outlier free and clean, go for SVM. It is the opposite – if your data might contain outliers, then Random forest would be the best choice
b) Generally, SVM consumes more computational power than Random Forest, so if you are constrained with memory, go for the Random Forest machine learning algorithm.
c) Random Forest gives you a perfect idea of variable importance in your data, so choose the Random Forest machine learning algorithm if you want to have variable importance.
d) Random Forest machine learning algorithms are preferred for multiclass problems.
e) SVM is preferred in multi-dimensional problem set – like text classification but as a good data scientist, you should experiment with both of them and test for accuracy, or rather you can use an ensemble of many Machine Learning techniques.
What is the benefit of batch normalization?
- The model is less sensitive to hyperparameter tuning.
- High learning rates become acceptable, which result in faster training of the model.
- Weight initialization becomes an easy task.
- Using different non-linear activation functions becomes feasible.
- Deep neural networks are simplified because of batch normalization.
- It introduces mild regularization in the network.
How will you explain logistic regression to an economist, physician-scientist, and biologist?
Logistic regression is one of the simplest machine learning algorithms. It is used to predict the relationship between a categorical dependent variable and two or more independent variables. The mathematical formula is given by
Where X is the independent variable, a, b are the coefficients, and Y is the dependent variable that can take categorical values.
Advance Data Science Interview Question
How beneficial is dropout regularization in deep learning models? Does it speed up or slow down the training process, and why?
The dropout regularization method mostly proves beneficial for cases where the dataset is small, and a deep neural network is likely to overfit during training. The computational factor has to be considered for large datasets, which may outweigh the benefit of dropout regularization.
As the dropout regularization method involves the random removal of a layer from a deep neural network, it speeds up the training process.
How does the use of dropout work as a regularize for deep neural networks?
Dropout is a regularization method used for deep neural networks to train different neural networks architectures on a given dataset. When the neural network is trained on a dataset, a few layers of the architecture are randomly dropped out of the network. This method introduces noise in the network by compelling nodes within a layer to probabilistically take on more or less authority for the input values. Thus, dropout makes the neural network model more robust by fixing the units of other layers with the help of prior layers.
What is the benefit of weight initialization in neural networks?
The weights are initialized in neural networks to surpass the above problems. The gradient can vanish or explode rapidly if weights are not initialized during the forward pass through the deep neural network. That can cause the slow convergence of the network, or they may not even converge in some cases. It also ensures that we will not oscillate near the minima.
Data Science Interview Question
What are categorical variables?
A variable that can take value from a limited set of values, usually fixed, is known as a categorical variable. As suggested by the name, categorical variables have limited categories or levels. For example, a variable representing the blood type of a human can only take A, AB, B, O values, which is a categorical variable. Ideally, the height of humans can take any positive value, which can be termed as a continuous variable. Unlike the continuous variable, a categorical variable can only take discrete values, which can take an unlimited number of values. A particular type of categorical variable that can only take two values is known as a binary variable.
How can you assess a good logistic model?
There are various methods to assess the results of logistic regression analysis
•Using Classification Matrix to look at the true negatives and false positives.
•Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
•Lift helps assess the logistic model by comparing it with random selection.
How can outlier values be treated?
Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values –
1) To change the value and bring in within a range
2) To just remove the value.
Advance Data Science Interview Question
What do you understand by feature vectors?
Feature vectors are the set of variables containing values describing each observation’s characteristics in a dataset. These vectors serve as input vectors to a machine learning model.
When working with text data, what are the benefits of using a recurrent neural network over a fully connected network?
The significant benefit of using a recurrent neural network over a fully connected network is that RNN can process temporal information, i.e., the data that comes in sequence, such as sentences.
Moreover, CNN and RNN are used for entirely different purposes.
RNN is used for Sequence Classification and Sequence Labeling, whereas CNN is used for image classification, recognition, etc. RNN and CNN also differ in structures.
CNN’s employ filters within convolutional layers to transform data. Whereas RNNs reuse activation functions from other data points to generate the following output in a series.
CNN also fails in interpreting temporal information such as blocks of text or videos. In contrast, RNN is specifically designed for that purpose.
When working with image classification problems, what are the benefits of using a convolutional neural network over a fully connected network when working with image classification problems?
As the name suggests, the fully connected network is a type of artificial neural network where each neuron in one layer is connected to all the neurons in the next layer while a CNN uses convolution in place of general matrix multiplication in at least one of the layers. CNN is specifically designed to take input as images and differentiate one from another. Fully connected networks aren’t good enough for feature extraction but CNNs are trained to identify and extract the best features from the images. The main advantage of CNN is that it automatically detects the critical features without any human supervision. For example, given a set of images of dogs and cats, it learns features of each class by itself. CNN is also computationally efficient.
Data Science Interview Question
In experimental design, is it necessary to do randomization? If yes, why?
Yes, it is necessary to use randomization while designing experiments. By randomization, we try to eliminate the bias as much as possible. The main purpose of randomization is it automatically controls for all lurking variables. Experiments with randomization establish a more apparent causal relationship between explanatory variables and response variables by having control over explanatory variables.
Is it better to have too many false negatives or too many false positives?
The answer to the question is entirely dependent on the application. When one receives a positive result of a test but should have received a negative result is known as a false positive. Similarly, when a positive result is expected out of a test but a negative result is received, it is known as a false negative. For example, a cancer screening test is negative for many patients, but the doctor expects it to be positive for the maximum number of patients. Here there are many false negatives which is not a good thing, as if the patient really has cancer and would not get treated immediately, may suffer a lot and eventually succumb.
What do you understand by outliers and inliers? What would you do if you find them in your dataset?
In your dataset when, when the data points are several standards away from the mean, we call those data points outliers. In a dataset, when the points are in the interior of the distribution, these points are known as inliers. There are various ways to deal with the outliers and inliers, the most common being completely removing them from your dataset. But this dealing of outliers differs according to the data under consideration. There are mainly three ways one can deal with them: keep them in the data, remove them, or change them to another variable.
Advance Data Science Interview Question
What do you understand by long and wide data formats?
In wide data format, you will find a column for each variable in the dataset. On the other hand, in long format, the dataset has a column for specific variable types & a column for the values of those variables.
For example,
What are the advantages and disadvantages of using regularization methods like Ridge Regression?
When we say that a model is overfitting, essentially, we get a low bias and high variance model. So, in order to minimize the overfitting, the technique called regularization is used. The Lasso regularization is termed as L1 regularization, and ridge regularization is termed as L2 regularization. Ridge Regression is a further extension of linear regression where a penalty is added to the RSS (residual sum of squares), which is equal to the square of coefficients term. This penalty term is nothing but alpha multiplied by slope and its square. It helps in getting rid of overfitting. Ridge regression = min(Sum of squared errors + alpha * slope)square)
Advantages: Ridge regression helps avoid overfitting of the model. Ridge regression works excellent with the data having high multicollinearity.
Disadvantages: Ridge regression trades variance for bias which means the result is not unbiased. All the predictors are included in the final model. The coefficient term shrinks towards Zero.
How do data management procedures like missing data handling make selection bias worse?
Missing value treatment is one of the primary tasks that a data scientist should do before starting data analysis. There are multiple methods for missing value treatment. If not done properly, it could potentially result in selection bias. Let see a few missing value treatment examples and their impact on selection-
Complete Case Treatment: Complete case treatment removes an entire row in data even if one value is missing. You could achieve a selection bias if your values are not missing at random and they have some pattern. Assume you are conducting a survey and few people didn’t specify their gender. Would you remove all those people? Can’t it tell a different story?
Available case analysis: Let say you are trying to calculate a correlation matrix for data so you might remove the missing values from variables that are needed for that particular correlation coefficient. In this case, your values will not be fully correct as they are coming from population sets.
Mean Substitution: In this method, missing values are replaced with the mean of other available values. This might make your distribution biased e.g., standard deviation, correlation and regression are mostly dependent on the mean value of variables.
Hence, various data management procedures might include selection bias in your data if not chosen correctly.
Data Science Interview Question
Differentiate between Batch Gradient Descent, Mini-Batch Gradient Descent, and Stochastic Gradient Descent.
Gradient descent is one of the most popular machine learning and deep learning optimization algorithms used to update a learning model’s parameters. There are 3 variants of gradient descent.
Batch Gradient Descent: In batch gradient descent, computation is carried on the entire dataset.
Stochastic Gradient Descent: In stochastic gradient descent, computation is carried over only one single training sample.
Mini Batch Gradient Descent: A small number/batch of training samples is used for computation in mini-batch gradient descent.
For example, if a dataset has 1000 data points, then batch GD, will train on all the 1000 datapoints, Stochastic GD, will train on only a single sample and the mini-batch GD will consider a batch size of say100 data points and update the parameters.