Top MNC Interview Questions - Data Science

Q1. What is cross-validation?

Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.

Q2. What is the distribution of the target variable?

There are a number of metrics that can be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision, f1 score, and the list goes on.

Q3. What does NLP stand for?

NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.

Q4. What is the significance of Sampling? Name some techniques for Sampling?

Answer : For analyzing the data we cannot proceed with the whole volume at once for large datasets. We need to take some samples from the data which can represent the whole population. While making a sample out of complete data, we should take that data which can be a true representative of the whole data set.

·        There are mainly two types of Sampling techniques based on Statistics.

·        Probability Sampling and Non Probability Sampling

·        Probability Sampling – Simple Random, Clustered Sampling, Stratified Sampling.

·        Non Probability Sampling – Convenience Sampling, Quota Sampling, Snowball Sampling.

Q5. Explain Naive Bayes Classifier and the principle on which it works?

Answer : Naive Bayes Classifier algorithm is a probabilistic model. This model works on the Bayes Theorem principle.  The accuracy of Naive Bayes can be increased significantly by combining it with other kernel functions for making a perfect Classifier.

 

Bayes Theorem –  This is a theorem which explains the conditional probability. If we need to identify the probability of occurrence of Event A provided the Event B has already occurred such cases are known as Conditional Probability.

Q6. What is Imbalanced Data? How do you manage to balance the data?

Answer : If a data is distributed across different categories and the distribution is highly imbalance. Such data are known as Imbalance Data. These kind of datasets causes error in model performance by making category with large values significant for the model resulting in an inaccurate model.

There are various techniques to handle imbalance data. We can increase the number of samples for minority classes. We can decrease the number of samples for classes with extremely high numbers of data points. We can use a cluster based technique to increase number of Data points for all the categories.

Q7. Discuss Decision Tree algorithm

A decision tree is a popular supervised machine learning algorithm. It is mainly used for Regression and Classification. It allows breaks down a dataset into smaller subsets. The decision tree can able to handle both categorical and numerical data.

Q8. What is Power Analysis?

The power analysis is an integral part of the experimental design. It helps you to determine the sample size requires to find out the effect of a given size from a cause with a specific level of assurance. It also allows you to deploy a particular probability in a sample size constraint.

Q9. Explain Collaborative filtering

Collaborative filtering used to search for correct patterns by collaborating viewpoints, multiple data sources, and various agents.

Q10. What is bias?

Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm." It can lead to underfitting.

Top MNC Data Science Interview Questions

 

Comments

Popular posts from this blog

Top MNC Interview Questions - Amazon Web Services

Data Science Training