In robust statistics, robust regression is a form of regression analysis designed to circumvent the limitations of traditional parametric and non-parametric methods. In particular, least squares estimates for regression models are highly non-robust to outliers.
Despite their superior performance over least squares estimation in many situations, robust methods for regression are still very little used.
Although it is sometimes claimed that least squares (or classical statistical methods in general) are robust, they are only robust in the sense that the type I error rate does not increase under violations of the model. In fact, the type I error rate tends to be lower than the nominal level when outliers are present, and there is often a dramatic increase in the type II error rate. The reduction of the type I error rate has been labelled as the conservatism of classical methods. Other labels might include inefficiecy' or inadmissability''.
In 1973, Huber introduce M-estimation for regression (see robust statistics for a description of M-estimation). The M in M-estimation stands for "maximum likelihood type". The method is robust to outliers in the response variable, but turned out not to be resistant to outliers in the explanatory variables (leverage points). In fact, when there are outliers in the explanatory variables, the method has no advantage over least squares.
In the 1980s, several alternatives to M-estimation were proposed as attempts to overcome the lack of resistance. See the book by Rousseeuw and Leroy for a very practical review. Least median of squares and least trimmed squares both appeared to be viable alternatives. However, both of these methods are inefficient, producing parameter estimates with high variability. Another proposed solution was S-estimation. This method finds a line (plane or hyperplane) that minimizes a robust estimate of the scale of the residuals (and it is "scale" from which the method gets the S in its name). This method is highly resistant to leverage points, and is robust to outliers in the response. However, this method was also found to be inefficient.
MM-estimation attempts to retain the robustness and resistance of S-estimation, whilst gaining the efficiency of M-estimation. The method proceeds by finding a highly robust and resistant S-estimate that minimizes an M-estimate of the scale of the residuals (the first M in the method's name). The estimated scale is then held constant whilst a close-by M-estimate of the parameters is located (the second M).
It seems reasonable to assume that another reason for the slow uptake of robust methods is the confusing terminology.
Under the assumption of t-distributed residuals, the distribution is a location-scale family. That is, . The degrees of freedom of the t-distribution is sometimes called the kurtosis parameter.
An alternative parametric approach is to assume that the residuals follow a mixture of normal distributions; in particular, a contaminated normal distribution in which the majority of observations are from a specified normal distribution, but a small proportion are from a normal distribution with much higher variance. That is, residuals have probability of coming from a normal distribution with variance and probability of coming from a normal distribution with variance for some
Typically, . This is sometimes called the -contamination model.
Parametric approaches have the advantage that likelihood theory provides an 'off the shelf' approach to inference, and it is possible to build simulation models from the fit. However, such parametric models still assume that the underlying model is literally true. As such, they do not account for skewed residual distributions or finite observation precisions.
J. J. Faraway, Linear Models with R, Chapman & Hall/CRC, 2004
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis (Second Edition), Chapman & Hall/CRC, 2003
R. Maronna, D. Martin and V. Yohai, Robust Statistics: Theory and Methods, Wiley, 2006
P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection, Wiley, 1986 (republished in paperback, 2003)
G. A. F. Seber and A. J. Lee, Linear Regression Analysis (Second Edition), Wiley, 2003
A. J. Stromberg, Why write statistical software? The case of robust statistical methods, Journal of Statistical Software, 2004
W. N. Venables and B. D. Ripley, Modern Applied Statistics with S, Springer, 2002
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Robust regression".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world