#### What is the difference between data science and big data?

The common differences between data science and big data are –

Big Data | Data Science |

Large collection of data sets that cannot be stored in a traditional system | An interdisciplinary field that includes analytical aspects, statistics, data mining, machine learning, etc. |

Popular in the field of communication, purchase and sale of goods, financial services, and educational sector | Common applications are digital advertising, web research, recommendation systems (Netflix, Amazon, Facebook), speech and handwriting recognition applications |

Big Data solves problems related to data management and handling, and analyze insights resulting in informed decision making | Data Science uses machine learning algorithms and statistical methods to obtain accurate predictions from raw data |

Popular tools are Hadoop, Spark, Flink, NoSQL, Hive, etc. | Popular tools are Python, R, SAS, SQL, etc. |

#### How do you check for data quality?

Some of the definitions used to check for data quality are:

- Completeness
- Consistency
- Uniqueness
- Integrity
- Conformity
- Accuracy

#### How would you deal with missing random values from a data set?

There are two forms of randomly missing values:

MCAR or Missing completely at random. Such errors happen when the missing values are randomly distributed across all observations.

We can confirm this error by partitioning the data into two parts –

- One set with the missing values
- Another set with the non-missing values.

After we have partitioned the data, we conduct a t-test of mean difference to check if there is any difference in the sample between the two data sets.

In case the data are MCAR, we may choose a pair-wise or a list-wise deletion of missing value cases.

MAR or Missing at random. It is a common occurrence. Here, the missing values are not randomly distributed across observations but are distributed within one or more sub-samples. We cannot predict the probability from the variables in the model. Data imputation is mainly performed to replace them.

Data Science Interview Question

#### What is Hadoop, and why should I care?

Hadoop is an open-source processing framework that manages data processing and storage for big data applications running on pooled systems.

Apache Hadoop is a collection of open-source utility software that makes it easy to use a network of multiple computers to solve problems involving large amounts of data and computation. It provides a software framework for distributed storage and big data processing using the MapReduce programming model.

Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packets of code to nodes to process the data in parallel. This allows the data set to be processed faster and more efficiently than if conventional supercomputing architecture were used.

#### Which is better – good data or good models?

This might be one of the frequently asked data science interview questions.

The answer to this question is very subjective and depends on the specific case. Big companies prefer good data; it is the foundation of any successful business. On the other hand, good models couldn’t be created without good data.

Based on your personal preference, you will probably choose no right or wrong answer (unless the company requires one specifically).

#### Differentiate between wide and long data formats.

In a wide format, categorical data are always grouped.

The long data format is in which there are a number of instances with many variables and subject variables.

Advance Data Science Interview Question

#### How much data is enough to get a valid outcome?

All the businesses are different and measured in different ways. Thus, you never have enough data and there will be no right answer. The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.

#### What is the importance of statistics in data science?

Statistics help data scientists to get a better idea of a customer’s expectations. Using statistical methods, data Scientists can acquire knowledge about consumer interest, behavior, engagement, retention, etc. It also helps to build robust data models to validate certain inferences and predictions.

#### What are the different statistical techniques used in data science?

There are many statistical techniques used in data science, including –

The arithmetic mean – It is a measure of the average of a set of data

Graphic display – Includes charts and graphs to visually display, analyze, clarify, and interpret numerical data through histograms, pie charts, bars, etc.

Correlation – Establishes and measures relationships between different variables

Regression – Allows identifying if the evolution of one variable affects others

Time series – It predicts future values by analyzing sequences of past values

Data mining and other Big Data techniques to process large volumes of data

Sentiment analysis – It determines the attitude of specific agents or people towards an issue, often using data from social networks

Semantic analysis – It helps to extract knowledge from large amounts of texts

A / B testing – To determine which of two variables works best with randomized experiments

Machine learning using automatic learning algorithms to ensure excellent performance in the presence of big data

Data Science Interview Question

#### What is an RDBMS? Name some examples for RDBMS?

This is among the most frequently asked data science interview questions.

A relational database management system (RDBMS) is a database management system that is based on a relational model.

Some examples of RDBMS are MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.

Interviewers often ask such data science interview questions and you must prepare for such abbreviations.

#### What are a Z test, Chi-Square test, F test, and T-test?

Z test is applied for large samples. Z test = (Estimated Mean – Real Mean)/ (square root real variance / n).

Chi-Square test is a statistical method assessing the goodness of fit between a set of observed values and those expected theoretically.

F-test is used to compare 2 populations’ variances. F = explained variance/unexplained variance.

T-test is applied for small samples. T-test = (Estimated Mean – Real Mean)/ (square root Estimated variance / n).

#### What is association analysis? Where is it used?

Association analysis is the task of uncovering relationships among data. It is used to understand how the data items are associated with each other.

Advance Data Science Interview Question

#### What do you understand by Recall and Precision?

Precision is the fraction of retrieved instances that are relevant, while Recall is the fraction of relevant instances that are retrieved.

#### What is market basket analysis?

Market Basket Analysis is a modeling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

#### What is the central limit theorem?

The central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases, regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist.

Data Science Interview Question

#### Explain the difference between type I and type II errors.

Type I error is the rejection of a true null hypothesis or false-positive finding, while Type II error is the non-rejection of a false null hypothesis or false-negative finding.

#### What is Linear Regression?

It is one of the most commonly asked networking interview questions.

Linear regression is the most popular type of predictive analysis. It is used to model the relationship between a scalar response and explanatory variables.

#### What are the limitations of a Linear Model/Regression?

- Linear models are limited to linear relationships, such as dependent and independent variables
- Linear regression looks at a relationship between the mean of the dependent variable and the independent variables, and not the extremes of the dependent variable
- Linear regression is sensitive to univariate or multivariate outliers
- Linear regression tend to assume that the data are independent

Advance Data Science Interview Question

#### What is a Gaussian distribution and how it is used in data science?

Gaussian distribution or commonly known as bell curve is a common probability distribution curve. Mention the way it can be used in data science in a detailed manner.

#### What is Root Cause Analysis?

Root Cause is defined as a fundamental failure of a process. To analyze such issues, a systematic approach has been devised that is known as Root Cause Analysis (RCA). This method addresses a problem or an accident and gets to its “root cause”.

#### What is the Confusion Matrix?

The confusion matrix is a very useful tool to assess how good a classification model based on machine learning is. It is also known as an error matrix and can be presented as a summary table to evaluate the performance of a classification model. The number of correct and incorrect predictions are summarized with the count values and broken down by each class.

The confusion matrix serves to show explicitly when one class is confused with another, which allows us to work separately with different types of errors.

Data Science Interview Question

#### What is the difference between Causation and Correlation?

Causation denotes any causal relationship between two events and represents its cause and effects.

Correlation determines the relationship between two or more variables.

Causation necessarily denotes the presence of correlation, but correlation doesn’t necessarily denote causation.

#### What is cross-validation?

Cross-validation is a technique to assess the performance of a model on a new independent dataset. One example of cross-validation could be – splitting the data into two groups – training and testing data, where you use the testing data to test the model and training data to build the model.

#### What do you mean by logistic regression?

Also known as the logit model, logistic regression is a technique to predict the binary result from a linear amalgamation of predictor variables.

Advance Data Science Interview Question

#### What is ‘cluster sampling’?

Cluster sampling is a probability sampling technique where the researcher divides the population into separate groups, called clusters. Then a simple cluster sample is selected from the population. The researcher conducts his analysis of data from the sample pools.

#### What happens if two users access the same HDFS file at the same time?

This is a bit of a tricky question. The answer itself is not complicated, but it is easy to confuse by the similarity of programs’ reactions.

When the first user is accessing the file, the second user’s inputs will be rejected because HDFS NameNode supports exclusive write.

#### What are the Resampling methods?

Resampling methods are used to estimate the precision of the sample statistics, exchanging labels on data points, and validating models.

Data Science Interview Question

#### What is selection bias, and how can you avoid it?

Selection bias is an experimental error that occurs when the participant pool, or the subsequent data, is not representative of the target population.

Selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases.

#### What is Correlation Analysis?

Correlation Analysis is a statistical method to evaluate the strength of the relationship between two quantitative variables. It consists of autocorrelation coefficients, estimated and calculated to make a different spatial relationship. It is used to correlate data based on distance.

#### What is imputation? List the different types of imputation techniques.

Imputation is the process that allows you to replace missing data with other values. Types of imputation techniques include –

Single Imputation: Single imputation denotes that the missing value is replaced by a value.

Hot-deck: The missing value is imputed from a similar register, which is chosen at random, based on a punched card.

Cold deck Imputation: Select donor data from other sets.

Mean Imputation: Substitute the stored value for the mean of that variable in other cases.

Mean Imputation: Its purpose is to replace the missing value with predicted values of a variable that is based on others.

Stochastic Regression: equal to the regression, but adds the mean regression variance to the regression imputation.

Multiple Imputation: It is a general approach to the problem of missing data, available in commonly used statistical packages. Unlike single imputation, Multiple Imputation estimates the values multiple times.

Advance Data Science Interview Question

#### What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations.

Eigenvalue can be referred to as the strength of the transformation in the direction of the eigenvector or the factor by which the compression occurs.

#### Which technique is used to predict categorical responses?

Classification techniques are used to predict categorical responses.

#### What is the importance of Sampling?

Sampling is a crucial statistical technique to analyze large volumes of datasets. This involves taking out some samples that represent the entire data population. It is imperative to choose samples that are the true representatives of the whole data set. There are two types of sampling methods – Probability Sampling and Non probability Sampling.6

Data Science Interview Question

#### Is it possible to stack two series horizontally? If yes then how will you do it?

Yes, it is possible to stack two series horizontally. We can use concat() function and setting axis = 1.

df = pd.concat([s1, s2], axis=1)

#### Tell me the method to convert date-strings to timeseries in a series.

Input:s = pd.Series([’22 Feb 1984′, ’22-02-2013′, ‘20170105’, ‘2012/02/08’, ‘2016-11-04’, ‘2015-03-02T11:15])

We will use the to_datetime() functionpd.to_datetime(s)

#### What are the data types used in Python?

Python has the following built-in data types:

- Number (float, integer)
- String
- Tuple
- List
- Set
- Dictionary

Numbers, strings, and tuples are immutable data types, which means that they cannot be modified at run time. Lists, sets, and dictionaries are mutable, which means they can be modified at run time.

Advance Data Science Interview Question

#### What libraries do data scientists use to plot data in Python?

Matplotlib is the main library used to plot data in Python. However, graphics created with this library need a lot of tweaking to make them look bright and professional. For that reason, many data scientists prefer Seaborn, which allows you to create attractive and meaningful charts with just one line of code.

#### What packages are used for data mining in Python and R?

There are various packages in Python and R:

Python – Orange, Pandas, NLTK, Matplotlib, and Scikit-learn are some of them.

R – Arules, tm, Forecast, and GGPlot are some of the packages.

#### What is Gradient Descent?

Gradient Descent is a popular algorithm used for training Machine Learning models and find the values of parameters of a function (f), which helps to minimize a cost function.

Data Science Interview Question

#### When do you need to update the algorithm in Data science?

You need to update an algorithm in the following situation:

- You want your data model to evolve as data streams using infrastructure
- The underlying data source is changing
- If it is non-stationarity

#### How to deal with unbalanced data?

Machine learning algorithms don’t work well with imbalanced data. We can handle this data in a number of ways –

- Using appropriate evaluation metrics for model generated using imbalanced data
- Resampling the training set through under sampling and oversampling
- Properly applying cross-validation while using the over-sampling method to address imbalance problems
- Using more data, primarily by ensembling different resampled datasets
- Resampling with different ratios, where the best ratio majorly depends on data and models used
- Clustering the abundant class
- Designing your own models and be creative in using different techniques and approaches to get the best outcome

#### What is Big Data?

Big Data is a set of massive data, a collection of huge in size and exponentially growing data, that cannot be managed, stored, and processed by traditional data management tools.

Advance Data Science Interview Question

#### What are some of the important tools used in Big Data analytics?

The important Big Data analytics tools are –

• NodeXL

• KNIME

• Tableau

• Solver

• Open Refine

• Rattle GUI

• QlikView

#### What is Natural Language Processing?

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret, and manipulate human language. It focuses on the processing of human communications, dividing them into parts, and identifying the most relevant elements of the message. With the Comprehension and Generation of Natural Language, it ensures that machines can understand, interpret and manipulate human language.

#### Why is natural language processing important?

NLP helps computers communicate with humans in their language and scales other language-related tasks. It contributes towards structuring a highly unstructured data source.

Data Science Interview Question

#### What is the usage of natural language processing?

There are several usages of NLP, including –

**Content categorization** – Generate a linguistics-based summary of the document, including search and indexing, content alerts, and duplication detection.

**Discovery and modeling of themes** – Accurately capture meaning and themes in text collections, and apply advanced analytics to text, such as optimization and forecasting.

**Contextual extraction** – Automatically extract structured information from text-based sources.

**Sentiment analysis** – Identification of mood or subjective opinions in large amounts of text, including sentiment mining and average opinions.

**Speech-to-text and text-to-speech conversion** – Transformation of voice commands into written text and vice versa.

**Document summarization** – Automatic generation of synopses of large bodies of text.

**Machine-based translation** – Automatic translation of text or speech from one language to another.

#### What is data visualization?

Data visualization is the process of presenting datasets and other information through visual mediums like charts, graphs, and others. It enables the user to detect patterns, trends, and correlations that might otherwise go unnoticed in traditional reports, tables, or spreadsheets.

#### What does a data scientist do?

A data scientist is a professional who develops highly complex data analysis processes, through the design and development of algorithms that allow finding relevant findings in the information, interpreting results, and obtaining relevant conclusions, thus providing very valuable knowledge for the making strategic decisions of any company.

Advance Data Science Interview Question