Top MNC Interview Questions - Data Science
Q1. What is cross-validation?
Cross-validation is essentially a
technique used to assess how well a model performs on a new independent
dataset. The simplest example of cross-validation is when you split your data
into two groups: training data and testing data, where you use the training
data to build the model and the testing data to test the model.
Q2. What is the distribution of the
target variable?
There are a number of metrics that can
be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision,
f1 score, and the list goes on.
Q3. What does NLP stand for?
NLP stands for Natural Language
Processing. It is a branch of artificial intelligence that gives machines the
ability to read and understand human languages.
Q4. What is the significance of
Sampling? Name some techniques for Sampling?
Answer : For analyzing the data we
cannot proceed with the whole volume at once for large datasets. We need to
take some samples from the data which can represent the whole population. While
making a sample out of complete data, we should take that data which can be a
true representative of the whole data set.
·
There are mainly
two types of Sampling techniques based on Statistics.
·
Probability
Sampling and Non Probability Sampling
·
Probability
Sampling – Simple Random, Clustered Sampling, Stratified Sampling.
·
Non Probability
Sampling – Convenience Sampling, Quota Sampling, Snowball Sampling.
Q5. Explain Naive Bayes Classifier and
the principle on which it works?
Answer : Naive Bayes Classifier
algorithm is a probabilistic model. This model works on the Bayes Theorem
principle. The accuracy of Naive Bayes
can be increased significantly by combining it with other kernel functions for
making a perfect Classifier.
Bayes Theorem – This is a theorem which explains the
conditional probability. If we need to identify the probability of occurrence
of Event A provided the Event B has already occurred such cases are known as
Conditional Probability.
Q6. What is Imbalanced Data? How do
you manage to balance the data?
Answer : If a data is distributed
across different categories and the distribution is highly imbalance. Such data
are known as Imbalance Data. These kind of datasets causes error in model
performance by making category with large values significant for the model
resulting in an inaccurate model.
There are various techniques to handle
imbalance data. We can increase the number of samples for minority classes. We
can decrease the number of samples for classes with extremely high numbers of
data points. We can use a cluster based technique to increase number of Data
points for all the categories.
Q7. Discuss Decision Tree algorithm
A decision tree is a popular
supervised machine learning algorithm. It is mainly used for Regression and
Classification. It allows breaks down a dataset into smaller subsets. The
decision tree can able to handle both categorical and numerical data.
Q8. What is Power Analysis?
The power analysis is an integral part
of the experimental design. It helps you to determine the sample size requires
to find out the effect of a given size from a cause with a specific level of
assurance. It also allows you to deploy a particular probability in a sample
size constraint.
Q9. Explain Collaborative filtering
Collaborative filtering used to search
for correct patterns by collaborating viewpoints, multiple data sources, and
various agents.
Q10. What is bias?
Bias is an error introduced in your
model because of the oversimplification of a machine learning algorithm."
It can lead to underfitting.
Top MNC Data Science Interview Questions
Comments
Post a Comment