# Data Science Interview Question Part 2

#### Can you cite some examples where both false positive and false negatives are equally important?

In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.

Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.

#### Can you explain the difference between a Validation Set and a Test Set?

A Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.

On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model.

In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.

#### Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of statistical analysis will generalize to an independent dataset. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

Data Science Interview Question

#### What are some of the techniques used for sampling? What is the main advantage of sampling?

Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.

There are majorly two categories of sampling techniques based on the usage of statistics, they are:

Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling.
Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.

#### What does it mean when the p-values are high and low?

A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.

• Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
• High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
• p-value = 0.05 means that the hypothesis can go either way.

#### When is resampling done?

Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.

#### What do you understand by Imbalanced Data?

Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.

#### Are there any differences between the expected value and mean value?

There are not many differences between these two, but it is to be noted that these are used in different contexts. The mean value generally refers to the probability distribution whereas the expected value is referred to in the contexts involving random variables.

Data Science Interview Question

#### Define the terms KPI, lift, model fitting, robustness and DOE.

• KPI: KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
• Lift: This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
• Model fitting: This indicates how well the model under consideration fits given observations.
• Robustness: This represents the system’s capability to handle differences and variances effectively.
• DOE: stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.

#### Define confounding variables.

Confounding variables are also known as confounders. These variables are a type of extraneous variables that influence both independent and dependent variables causing spurious association and mathematical relationships between those variables that are associated but are not casually related to each other.

#### How are the time series problems different from other regression problems?

• Time series data can be thought of as an extension to linear regression which uses terms like autocorrelation, movement of averages for summarizing historical data of y-axis variables for predicting a better future.
• Forecasting and prediction is the main goal of time series problems where accurate predictions can be made but sometimes the underlying reasons might not be known.
• Having Time in the problem does not necessarily mean it becomes a time series problem. There should be a relationship between target and time for a problem to become a time series problem.
• The observations close to one another in time are expected to be similar to the ones far away which provide accountability for seasonality. For instance, today’s weather would be similar to tomorrow’s weather but not similar to weather from 4 months from today. Hence, weather prediction based on past data becomes a time series problem.

#### Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

Depending on the size of the dataset, we follow the below ways:

• In case the datasets are small, the missing values are substituted with the mean or average of the remaining data. In pandas, this can be done by using `mean = df.mean()` where df represents the pandas dataframe representing the dataset and mean() calculates the mean of the data. To substitute the missing values with the calculated mean, we can use `df.fillna(mean)`.
• For larger datasets, the rows with missing values can be removed and the remaining data can be used for data prediction.

#### What are the differences between correlation and covariance?

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them:

• Correlation: This technique is used to measure and estimate the quantitative relationship between two variables and is measured in terms of how strong are the variables related.
• Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the means are represented as {“detectHand”:false} and {“detectHand”:false} respectively and standard deviations are represented by {“detectHand”:false} and {“detectHand”:false} respectively and E represents the expected value operator, then:

• covarianceXY = E[(X-{“detectHand”:false}),(Y-{“detectHand”:false})]
• correlationXY = E[(X-{“detectHand”:false}),(Y-{“detectHand”:false})]/({“detectHand”:false}{“detectHand”:false})
so that
``correlation(X,Y) = covariance(X,Y)/(covariance(X) covariance(Y))``

Based on the above formula, we can deduce that the correlation is dimensionless whereas covariance is represented in units that are obtained from the multiplication of units of two variables.

#### How do you approach solving any data analytics based project?

Generally, we follow the below steps:

• First step is to thoroughly understand the business requirement/problem
• Next, explore the given data and analyze it carefully. If you find any data missing, get the requirements clarified from the business.
• Data cleanup and preparation step is to be performed next which is then used for modeling. Here, the missing values are found and the variables are transformed.
• Run your model against the data, build meaningful visualization and analyze the results to get meaningful insights.
• Release the model implementation, track the results and performance over a specified period to analyze the usefulness.
• Perform cross-validation of the model.

Data Science Interview Question

#### Why is data cleaning crucial? How do you clean the data?

While running an algorithm on any data, to gather proper insights, it is very much necessary to have correct and clean data that contains only relevant information. Dirty data most often results in poor or incorrect insights and predictions which can have damaging effects.

For example, while launching any big campaign to market a product, if our data analysis tells us to target a product that in reality has no demand and if the campaign is launched, it is bound to fail. This results in a loss of the company’s revenue. This is where the importance of having proper and clean data comes into the picture.

• Data Cleaning of the data coming from different sources helps in data transformation and results in the data where the data scientists can work on.
• Properly cleaned data increases the accuracy of the model and provides very good predictions.
• If the dataset is very large, then it becomes cumbersome to run data on it. The data cleanup step takes a lot of time (around 80% of the time) if the data is huge. It cannot be incorporated with running the model. Hence, cleaning data before running the model, results in increased speed and efficiency of the model.
• Data cleaning helps to identify and fix any structural issues in the data. It also helps in removing any duplicates and helps to maintain the consistency of the data.

#### What are the available feature selection methods for selecting the right variables for building efficient predictive models?

While using a dataset in data science or machine learning algorithms, it so happens that not all the variables are necessary and useful to build a model. Smarter feature selection methods are required to avoid redundant models to increase the efficiency of our model. Following are the three main methods in feature selection:

Filter Methods:
These methods pick up only the intrinsic properties of features that are measured via univariate statistics and not cross-validated performance. They are straightforward and are generally faster and require less computational resources when compared to wrapper methods.
There are various filter methods such as the Chi-Square test, Fisher’s Score method, Correlation Coefficient, Variance Threshold, Mean Absolute Difference (MAD) method, Dispersion Ratios, etc.

Wrapper Methods:
These methods need some sort of method to search greedily on all possible feature subsets, access their quality by learning and evaluating a classifier with the feature.
The selection technique is built upon the machine learning algorithm on which the given dataset needs to fit.
There are three types of wrapper methods, they are:
Forward Selection: Here, one feature is tested at a time and new features are added until a good fit is obtained.
Backward Selection: Here, all the features are tested and the non-fitting ones are eliminated one by one to see while checking which works better.
Recursive Feature Elimination: The features are recursively checked and evaluated how well they perform.
These methods are generally computationally intensive and require high-end resources for analysis. But these methods usually lead to better predictive models having higher accuracy than filter methods.

Embedded Methods:
Embedded methods constitute the advantages of both filter and wrapper methods by including feature interactions while maintaining reasonable computational costs.
These methods are iterative as they take each model iteration and carefully extract features contributing to most of the training in that iteration.
Examples of embedded methods: LASSO Regularization (L1), Random Forest Importance.

#### Will treating categorical variables as continuous variables result in a better predictive model?

Yes! A categorical variable is a variable that can be assigned to two or more categories with no definite category ordering. Ordinal variables are similar to categorical variables with proper and clear ordering defines. So, if the variable is ordinal, then treating the categorical value as a continuous variable will result in better predictive models.

#### How will you treat missing values during data analysis?

The impact of missing values can be known after identifying what kind of variables have the missing values.

• If the data analyst finds any pattern in these missing values, then there are chances of finding meaningful insights.
• In case of patterns are not found, then these missing values can either be ignored or can be replaced with default values such as mean, minimum, maximum, or median values.
• If the missing values belong to categorical variables, then they are assigned with default values. If the data has a normal distribution, then mean values are assigned to missing values.
• If 80% values are missing, then it depends on the analyst to either replace them with default values or drop the variables.

#### What are the differences between univariate, bivariate and multivariate analysis?

Statistical analyses are classified based on the number of variables processed at a given time.

#### What is the difference between the Test set and validation set?

The test set is used to test or evaluate the performance of the trained model. It evaluates the predictive power of the model.
The validation set is part of the training set that is used to select parameters for avoiding model overfitting.

Data Science Interview Question

#### What do you understand by a kernel trick?

Kernel functions are generalized dot product functions used for the computing dot product of vectors xx and yy in high dimensional feature space. Kernal trick method is used for solving a non-linear problem by using a linear classifier by transforming linearly inseparable data into separable ones in higher dimensions.

#### Differentiate between box plot and histogram.

Box plots and histograms are both visualizations used for showing data distributions for efficient communication of information.
Histograms are the bar chart representation of information that represents the frequency of numerical variable values that are useful in estimating probability distribution, variations and outliers.
Boxplots are used for communicating different aspects of data distribution where the shape of the distribution is not seen but still the insights can be gathered. These are useful for comparing multiple charts at the same time as they take less space when compared to histograms.

#### How will you balance/correct imbalanced data?

There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for minority classes. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data:

Use the right evaluation metrics: In cases of imbalanced data, it is very important to use the right evaluation metrics that provide valuable information.
Specificity/Precision: Indicates the number of selected instances that are relevant.
Sensitivity: Indicates the number of relevant instances that are selected.
F1 score: It represents the harmonic mean of precision and sensitivity.
MCC (Matthews correlation coefficient): It represents the correlation coefficient between observed and predicted binary classifications.
AUC (Area Under the Curve): This represents a relation between the true positive rates and false-positive rates.
For example, consider the below graph that illustrates training data:

Here, if we measure the accuracy of the model in terms of getting “0”s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above.

Training Set Resampling: It is also possible to balance data by working on getting different datasets and this can be achieved by resampling. There are two approaches followed under-sampling that is used based on the use case and the requirements:
Under-sampling This balances the data by reducing the size of the abundant class and is used when the data quantity is sufficient. By performing this, a new dataset that is balanced can be retrieved and this can be used for further modeling.
Over-sampling This is used when data quantity is not sufficient. This method balances the dataset by trying to increase the samples size. Instead of getting rid of extra samples, new samples are generated and introduced by employing the methods of repetition, bootstrapping, etc.
Perform K-fold cross-validation correctly: Cross-Validation needs to be applied properly while using over-sampling. The cross-validation should be done before over-sampling because if it is done later, then it would be like overfitting the model to get a specific result. To avoid this, resampling of data is done repeatedly with different ratios.

#### What is better – random forest or multiple decision trees?

Random forest is better than multiple decision trees as random forests are much more robust, accurate, and lesser prone to overfitting as it is an ensemble method that ensures multiple weak decision trees learn strongly.

#### What are some examples when false positive has proven important than false negative?

Before citing instances, let us understand what are false positives and false negatives.

• False Positives are those cases that were wrongly identified as an event even if they were not. They are called Type I errors.
• False Negatives are those cases that were wrongly identified as non-events despite being an event. They are called Type II errors.

Some examples where false positives were important than false negatives are:

• In the medical field: Consider that a lab report has predicted cancer to a patient even if he did not have cancer. This is an example of a false positive error. It is dangerous to start chemotherapy for that patient as he doesn’t have cancer as starting chemotherapy would lead to damage of healthy cells and might even actually lead to cancer.
• In the e-commerce field: Suppose a company decides to start a campaign where they give \$100 gift vouchers for purchasing \$10000 worth of items without any minimum purchase conditions. They assume it would result in at least 20% profit for items sold above \$10000. What if the vouchers are given to the customers who haven’t purchased anything but have been mistakenly marked as those who purchased \$10000 worth of products. This is the case of false-positive error.

#### Give one example where both false positives and false negatives are important equally?

In Banking fields: Lending loans are the main sources of income to the banks. But if the repayment rate isn’t good, then there is a risk of huge losses instead of any profits. So giving out loans to customers is a gamble as banks can’t risk losing good customers but at the same time, they can’t afford to acquire bad customers. This case is a classic example of equal importance in false positive and false negative scenarios.

Data Science Interview Question

#### Is it good to do dimensionality reduction before fitting a Support Vector Model?

If the features number is greater than observations then doing dimensionality reduction improves the SVM (Support Vector Model).

#### What are various assumptions used in linear regression? What would happen if they are violated?

Linear regression is done under the following assumptions:

• The sample data used for modeling represents the entire population.
• There exists a linear relationship between the X-axis variable and the mean of the Y variable.
• The residual variance is the same for any X values. This is called homoscedasticity
• The observations are independent of one another.
• Y is distributed normally for any value of X.

Extreme violations of the above assumptions lead to redundant results. Smaller violations of these result in greater variance or bias of the estimates.

#### How is feature selection performed using the regularization method?

The method of regularization entails the addition of penalties to different parameters in the machine learning model for reducing the freedom of the model to avoid the issue of overfitting.
There are various regularization methods available such as linear model regularization, Lasso/L1 regularization, etc. The linear model regularization applies penalty over coefficients that multiplies the predictors. The Lasso/L1 regularization has the feature of shrinking some coefficients to zero, thereby making it eligible to be removed from the model.

#### How do you identify if a coin is biased?

To identify this, we perform a hypothesis test as below:
According to the null hypothesis, the coin is unbiased if the probability of head flipping is 50%. According to the alternative hypothesis, the coin is biased and the probability is not equal to 500. Perform the below steps:

• Flip coin 500 times
• Calculate p-value.
• Compare the p-value against the alpha -> result of two-tailed test (0.05/2 = 0.025). Following two cases might occur:
• p-value > alpha: Then null hypothesis holds good and the coin is unbiased.
• p-value < alpha: Then the null hypothesis is rejected and the coin is biased.

#### What do you understand by linear regression?

Linear regression helps in understanding the linear relationship between the dependent and the independent variables. Linear regression is a supervised learning algorithm, which helps in finding the linear relationship between two variables. One is the predictor or the independent variable and the other is the response or the dependent variable. In Linear Regression, we try to understand how the dependent variable changes w.r.t the independent variable. If there is only one independent variable, then it is called simple linear regression, and if there is more than one independent variable then it is known as multiple linear regression.

#### How is Data Science different from traditional application programming?

Data Science takes a fundamentally different approach to building systems that provide value than traditional application development.

In traditional programming paradigms, we used to analyze the input, figure out the expected output, and write code, which contains rules and statements needed to transform the provided input into the expected output. As we can imagine, these rules were not easy to write, especially for those data that even computers had a hard time understanding, e.g., images, videos, etc.

Data Science shifts this process a little bit. In it, we need access to large volumes of data that contain the necessary inputs and their mappings to the expected outputs. Then, we use Data Science algorithms, which use mathematical analysis to generate rules to map the given inputs to outputs. This process of rule generation is called training. After training, we use some data that was set aside before the training phase to test and check the system’s accuracy. The generated rules are a kind of a black box, and we cannot understand how the inputs are being transformed into outputs. However. if the accuracy is good enough, then we can use the system (also called a model).

As described above, in traditional programming, we had to write the rules to map the input to the output, but in Data Science, the rules are automatically generated or learned from the given data. This helped solve some really difficult challenges that were being faced by several companies.

Data Science Interview Question

#### Why Python is used for Data Cleaning in DS?

Data Scientists have to clean and transform the huge data sets in a form that they can work with. It’s is important to deal with the redundant data for better results by removing nonsensical outliers, malformed records, missing values, inconsistent formatting, etc.

Python libraries such as  Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively used for Data cleaning and analysis. These libraries are used to load and clean the data doe effective analysis. For example, a CSV file named “Student” has information about the students of an institute like their names, standard, address, phone number, grades, marks, etc.

#### What are the popular libraries used in Data Science?

Below are the popular libraries used for data extraction, cleaning, visualization, and deploying DS models:

• Tensor Flow: Supports parallel computing with impeccable library management backed by Google.
• SciPy: Mainly used for solving differential equations, multidimensional programming, data manipulation, and visualization through graphs and charts.
• Matplotlib: Being free and open-source, it can be used as a replacement for MATLAB, which results in better performance and low memory consumption.
• PyTorch: Best for projects which involve Machine Learning algorithms and Deep Neural Networks.

#### What is variance in Data Science?

Variance is a type of error that occurs in a Data Science model when the model ends up being too complex and learns features from data, along with the noise that exists in it. This kind of error can occur if the algorithm used to train the model has high complexity, even though the data and the underlying patterns and trends are quite easy to discover. This makes the model a very sensitive one that performs well on the training dataset but poorly on the testing dataset, and on any kind of data that the model has not yet seen. Variance generally leads to poor accuracy in testing and results in overfitting.

#### What is pruning in a decision tree algorithm?

Pruning a decision tree is the process of removing the sections of the tree that are not necessary or are redundant. Pruning leads to a smaller decision tree, which performs better and gives higher accuracy and speed.

#### What is entropy in a decision tree algorithm?

In a decision tree algorithm, entropy is the measure of impurity or randomness. The entropy of a given dataset tells us how pure or impure the values of the dataset are. In simple terms, it tells us about the variance in the dataset.

For example, suppose we are given a box with 10 blue marbles. Then, the entropy of the box is 0 as it contains marbles of the same color, i.e., there is no impurity. If we need to draw a marble from the box, the probability of it being blue will be 1.0. However, if we replace 4 of the blue marbles with 4 red marbles in the box, then the entropy increases to 0.4 for drawing blue marbles.

#### What is information gain in a decision tree algorithm?

When building a decision tree, at each step, we have to create a node that decides which feature we should use to split data, i.e., which feature would best separate our data so that we can make predictions. This decision is made using information gain, which is a measure of how much entropy is reduced when a particular feature is used to split the data. The feature that gives the highest information gain is the one that is chosen to split the data.

Data Science Interview Question

#### What is k-fold cross-validation?

In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop over the entire dataset k times. In each iteration of the loop, one of the k parts is used for testing, and the other k − 1 parts are used for training. Using k-fold cross-validation, each one of the k parts of the dataset ends up being used for training and testing purposes.

#### Explain how a recommender system works.

A recommender system is a system that many consumer-facing, content-driven, online platforms employ to generate recommendations for users from a library of available content. These systems generate recommendations based on what they know about the users’ tastes from their activities on the platform.

For example, imagine that we have a movie streaming platform, similar to Netflix or Amazon Prime. If a user has previously watched and liked movies from action and horror genres, then it means that the user likes watching the movies of these genres. In that case, it would be better to recommend such movies to this particular user. These recommendations can also be generated based on what users with a similar taste like watching.

#### What is a normal distribution?

Data distribution is a visualization tool to analyze how data is spread out or distributed. Data can be distributed in various ways. For instance, it could be with a bias to the left or to the right, or it could all be jumbled up. Data may also be distributed around a central value, i.e., mean, median, etc. This kind of distribution has no bias either to the left or to the right and is in the form of a bell-shaped curve. This distribution also has its mean equal to the median. This kind of distribution is called a normal distribution.

#### How is Data modeling different from Database design?

Data Modeling: It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying data modeling techniques. Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.

#### What are precision?

Precision: When we are implementing algorithms for the classification of data or the retrieval of information, precision helps us get a portion of positive class values that are positively predicted. Basically, it measures the accuracy of correct positive predictions. Below is the formula to calculate precision:

#### What is recall?

Recall: It is the set of all positive predictions out of the total number of positive instances. Recall helps us identify the misclassified positive predictions. We use the below formula to calculate recall:

Data Science Interview Question

#### What is the F1 score and how to calculate it?

F1 score helps us calculate the harmonic mean of precision and recall that gives us the test’s accuracy. If F1 = 1, then precision and recall are accurate. If F1 < 1 or equal to 0, then precision or recall is less accurate, or they are completely inaccurate. See below for the formula to calculate the F1 score:

#### Why do we use p-value?

We use the p-value to understand whether the given data really describe the observed effect or not. We use the below formula to calculate the p-value for the effect ‘E’ and the null hypothesis ‘H0’ as true:

#### How are Data Science and Machine Learning related to each other?

Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. However, there are some fundamental distinctions that show us how they are different from each other.

Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. The entire process of Data Science takes care of multiple steps that are involved in drawing insights out of the available data. This process includes crucial steps such as data gathering, data analysis, data manipulation, data visualization, etc.

Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.

In short, Data Science deals with gathering data, processing it, and finally, drawing insights from it. The field of Data Science that deals with building models using algorithms is called Machine Learning. Therefore, Machine Learning is an integral part of Data Science.