1. What are outliers? what are the various ways to detect them?
Outliers are also referred to as anomalies. These are the values in the dataset that generally lie far away from all other values.
It is important to deal with them because when you are creating the Machine Learning Models, you have to either explain the significance for their occurrence or get rid of them so that they don’t disturb the model.
Following are the most commonly used methods to detect them:
i.) Standard Deviation – Here, usually if a value is three times the standard deviation, it is an outlier.
ii.) Boxplots – The data here is plotted on the graph. The boundaries of data are called upper and lower whiskers. Any values that lie on or beyond these whiskers are anomalies.
iii.) DBScan Clustering – This method converts the data into clusters which has core points and border points. To consider the two points to be a part of a cluster, a maximum distance “eps” is calculated. Any values that fall beyond border points are called as noise points.
The biggest challenge with this method is the right calculation of “eps”.
iv.) Isolation Forest – This method works differently than all other methods. It assigns a score to each data point and believes that the anomalies are only a few in numbers and their attribute values are different from the normal values. This method works well with large datasets.
v.) Random Cut Forest – This method also works by associating a score with the data values. Low score value means that the data is normal while the high score value tags it to be an anomaly. This method works with both online and offline data & can take care of high dimensional data.
2. What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
– An observation point that is distant from other observations
– Can occur by chance in any distribution
– Often, they indicate measurement error or a heavy-tailed distribution
– Measurement error: discard them or use robust statistics
– Heavy-tailed distribution: high skewness, can’t use tools assuming a normal distribution
– Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
– Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean
Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).
If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn.
Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3
– No rigid mathematical method
– Subjective exercise: be careful
– QQ plots (sample quantiles Vs theoretical quantiles)
– Depends on the cause
– Retention: when the underlying model is confidently known
– Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cook’s distance)
– Observation lying within the general distribution of other observed values
– Doesn’t perturb the results but are non-conforming and unusual
– Simple example: observation recorded in the wrong unit (°F instead of °C)
– Mahalanobi’s distance
– Used to calculate the distance between two random vectors
– Difference with Euclidean distance: accounts for correlations
– Discard them