Algorithms

Hadi & Simonoff (1993)

LinRegOutliers.HS93.hs93Function
hs93(setting; alpha = 0.05, basicsubsetindices = nothing)

Perform the Hadi & Simonoff (1993) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • alpha::Float64: Optional argument of the probability of rejecting the null hypothesis.
  • basicsubsetindices::Array{Int, 1}: Initial basic subset, by default, the algorithm creates an initial set of clean observations.

Description

Performs a forward search by selecting and enlarging an initial clean subset of observations and iterates until scaled residuals exceeds a threshold.

Output

  • ["outliers"]: Array of indices of outliers
  • ["t"]: Threshold, specifically, calculated quantile of a Student-T distribution
  • ["d"]: Internal and external scaled residuals.
  • `["betas"]: Vector of estimated regression coefficients.
  • `["converged"]: Boolean value indicating whether the algorithm converged or not.

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> hs93(reg0001)
Dict{Any,Any} with 3 entries:
  "outliers" => [14, 15, 16, 17, 18, 19, 20, 21]
  "t"        => -3.59263
  "d"        => [2.04474, 1.14495, -0.0633255, 0.0632934, -0.354349, -0.766818, -1.06862, -1.47638, -0.7…
  "converged"=> true

References

Hadi, Ali S., and Jeffrey S. Simonoff. "Procedures for the identification of multiple outliers in linear models." Journal of the American Statistical Association 88.424 (1993): 1264-1272.

source

Kianifard & Swallow (1989)

LinRegOutliers.KS89.ks89Function
ks89(setting; alpha = 0.05)

Perform the Kianifard & Swallow (1989) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • alpha::Float64: Optional argument of the probability of rejecting the null hypothesis.

Description

The algorithm starts with a clean subset of observations. This initial set is then enlarged using recursive residuals. When the calculated statistics exceeds a threshold it terminates.

Output

  • ["outliers]: Array of indices of outliers.
  • ["betas"]: Vector of regression coefficients.

Examples

julia> reg0001 = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> ks89(reg0001)
Dict{String, Vector} with 2 entries:
  "betas"    => [-42.4531, 0.956605, 0.555571, -0.108766]
  "outliers" => [4, 21]

References

Kianifard, Farid, and William H. Swallow. "Using recursive residuals, calculated on adaptively-ordered observations, to identify outliers in linear regression." Biometrics (1989): 571-585.

source

Sebert & Montgomery & Rollier (1998)

LinRegOutliers.SMR98.smr98Function
smr98(setting)

Perform the Sebert, Monthomery and Rollier (1998) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.

Description

The algorithm starts with an ordinary least squares estimation for a given model and data. Residuals and fitted responses are calculated using the estimated model. A hierarchical clustering analysis is applied using standardized residuals and standardized fitted responses. The tree structure of clusters are cut using a threshold, e.g Majona criterion, as suggested by the authors. It is expected that the subtrees with relatively small number of observations are declared to be clusters of outliers.

Output

  • ["outliers"]: Array of indices of outliers.
  • ["betas"]: Vector of regression coefficients.

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> smr98(reg0001)
Dict{String, Vector} with 2 entries:
  "betas"    => [-55.4519, 1.15692]
  "outliers" => [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

References

Sebert, David M., Douglas C. Montgomery, and Dwayne A. Rollier. "A clustering algorithm for identifying multiple outliers in linear regression." Computational statistics & data analysis 27.4 (1998): 461-484.

source

Least Median of Squares

LinRegOutliers.LMS.lmsFunction
lms(setting; iters = nothing, crit = 2.5)

Perform Least Median of Squares regression estimator with random sampling.

Arguments

  • setting::RegressionSetting: A regression setting object.
  • iters::Int: Number of random samples.
  • crit::Float64: Critical value for standardized residuals.

Description

LMS (Least Median of Squares) estimator is highly robust with 50% breakdown property. The algorithm searches for regression coefficients which minimize (h)th ordered squared residual where h is Int(floor((n + 1.0) / 2.0))

Output

  • ["stdres"]: Array of standardized residuals
  • ["S"]: Standard error of regression
  • ["outliers"]: Array of indices of outliers
  • ["objective"]: LMS objective value
  • ["betas"]: Estimated regression coefficients
  • ["crit"]: Threshold value.

Examples

julia> reg = createRegressionSetting(@formula(calls ~ year), phones);

julia> lms(reg)
Dict{Any,Any} with 6 entries:
  "stdres"    => [2.28328, 1.55551, 0.573308, 0.608843, 0.220321, -0.168202, -0.471913, -0.860435, -0.31603, -0.110871  …  85.7265, 88.9849, 103.269, 116.705, 135.229, 159.69,…
  "S"         => 1.17908
  "outliers"  => [14, 15, 16, 17, 18, 19, 20, 21]
  "objective" => 0.515348
  "betas"      => [-56.1972, 1.1581]
  "crit"      => 2.5

References

Rousseeuw, Peter J. "Least median of squares regression." Journal of the American statistical association 79.388 (1984): 871-880.

source

Least Trimmed Squares

LinRegOutliers.LTS.ltsFunction
lts(setting; iters = nothing, crit = 2.5, earlystop = true)

Perform the Fast-LTS (Least Trimmed Squares) algorithm for a given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • iters::Int: Number of iterations.
  • crit::Float64: Critical value.
  • earlystop::Bool: Early stop if the best objective does not change in iters / 2 iterations.

Description

The algorithm searches for estimations of regression parameters which minimize the sum of first h ordered squared residuals where h is Int(floor((n + p + 1.0) / 2.0)). Specifically, our implementation, uses the algorithm Fast-LTS in which concentration steps are used for enlarging a basic subset to subset of clean observation of size h.

Output

  • ["betas"]: Estimated regression coefficients
  • ["S"]: Standard error of regression
  • ["hsubset"]: Best subset of clean observation of size h.
  • ["outliers"]: Array of indices of outliers
  • ["scaled.residuals"]: Array of scaled residuals
  • ["objective"]: LTS objective value.

Examples

julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> lts(reg)
Dict{Any,Any} with 6 entries:
  "betas"            => [-56.5219, 1.16488]
  "S"                => 1.10918
  "hsubset"          => [11, 10, 5, 6, 23, 12, 13, 9, 24, 7, 3, 4, 8]
  "outliers"         => [14, 15, 16, 17, 18, 19, 20, 21]
  "scaled.residuals" => [2.41447, 1.63472, 0.584504, 0.61617, 0.197052, -0.222066, -0.551027, -0.970146, -0.397538, -0.185558  …  …
  "objective"        => 3.43133

References

Rousseeuw, Peter J., and Katrien Van Driessen. "An algorithm for positive-breakdown regression based on concentration steps." Data Analysis. Springer, Berlin, Heidelberg, 2000. 335-346.

source

Minimum Volume Ellipsoid (MVE)

LinRegOutliers.MVE.mveFunction
mve(data; alpha = 0.01)

Performs the Minimum Volume Ellipsoid algorithm for a robust covariance matrix.

Arguments

  • data::DataFrame: Multivariate data.
  • alpha::Float64: Probability for quantiles of Chi-Squared statistic.

Description

mve searches for a robust location vector and a robust scale matrix, e.g covariance matrix. The method also reports a usable diagnostic measure, Mahalanobis distances, which are calculated using the robust counterparts instead of mean vector and usual covariance matrix. Mahalanobis distances are directly comparible with quantiles of a ChiSquare Distribution with p degrees of freedom.

Output

  • ["goal"]: Objective value
  • ["best.subset"]: Indices of best h-subset of observations
  • ["robust.location"]: Vector of robust location measures
  • ["robust.covariance"]: Robust covariance matrix
  • ["squared.mahalanobis"]: Array of Mahalanobis distances calculated using robust location and scale measures.
  • ["chisq.crit"]: Chisquare quantile used in threshold
  • ["alpha"]: Probability used in calculating the Chisquare quantile, e.g chisq.crit
  • ["outliers"]: Array of indices of outliers.

References

Van Aelst, Stefan, and Peter Rousseeuw. "Minimum volume ellipsoid." Wiley Interdisciplinary Reviews: Computational Statistics 1.1 (2009): 71-82.

source

MVE & LTS Plot

LinRegOutliers.MVELTSPlot.mveltsplotFunction
mveltsplot(setting; alpha = 0.05, showplot = true)

Generate MVE - LTS plot for visual detecting of regression outliers.

Arguments

  • setting::RegressionSetting: A regression setting object.
  • alpha::Float64: Probability for quantiles of Chi-Squared statistic.
  • showplot::Bool: Whether a plot is shown or only return statistics.

Description

This is a method of combination of lts and mve. Regression residuals and robust distances obtained by mve and mve are used to generate a plot. Despite this is a visual method, drawing a plot is not really necessary. The algorithm divides the residuals-distances space into 4 parts, one for clean observations, one for vertical outliers (y-space outliers), one for bad-leverage points (x-space outliers), and one for good leverage points (observations far from the remaining of data in both x and y space).

Output

  • ["plot"]: Generated plot object
  • ["robust.distances"]: Robust Mahalanobis distances
  • ["scaled.residuals"]: Scaled residuals of an lts estimate
  • ["chi.squared"]: Quantile of Chi-Squared distribution
  • ["regular.points"]: Array of indices of clean observations
  • ["outlier.points"]: Array of indices of y-space outliers (vertical outliers)
  • ["leverage.points"]: Array of indices of x-space outliers (bad leverage points)
  • ["outlier.and.leverage.points"]: Array of indices of xy-space outliers (good leverage points)

References

Van Aelst, Stefan, and Peter Rousseeuw. "Minimum volume ellipsoid." Wiley Interdisciplinary Reviews: Computational Statistics 1.1 (2009): 71-82.

Dependencies

This method is enabled when the Plots package is installed and loaded.

source

Billor & Chatterjee & Hadi (2006)

LinRegOutliers.BCH.bchFunction
bch(setting; alpha = 0.05, maxiter = 1000, epsilon = 0.000001)

Perform the Billor & Chatterjee & Hadi (2006) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • alpha::Float64: Optional argument of the probability of rejecting the null hypothesis.
  • maxiter::Int: Maximum number of iterations for calculating iterative weighted least squares estimates.
  • epsilon::Float64: Accuracy for determining convergency.

Description

The algorithm initially constructs a basic subset. These basic subset is then used to generate initial weights for a iteratively least squares estimation. Regression coefficients obtained in this stage are robust regression estimates. Squared normalized distances and squared normalized residuals are used in bchplot which serves a visual way for investigation of outliers and their properties.

Output

  • ["betas"]: Final estimate of regression coefficients
  • ["squared.normalized.robust.distances"]:
  • ["weights"]: Final weights used in calculation of WLS estimates
  • ["outliers"]: Array of indices of outliers
  • ["squared.normalized.residuals"]: Array of squared normalized residuals
  • ["residuals"]: Array of regression residuals
  • ["basic.subset"]: Array of indices of basic subset.

Examples

julia> reg  = createRegressionSetting(@formula(calls ~ year), phones);
julia> Dict{Any,Any} with 7 entries:
"betas"                               => [-55.9205, 1.15572]
"squared.normalized.robust.distances" => [0.104671, 0.0865052, 0.0700692, 0.0553633, 0.0423875, 0.03…
"weights"                             => [0.00186158, 0.00952088, 0.0787321, 0.0787321, 0.0787321, 0…
"outliers"                            => [1, 14, 15, 16, 17, 18, 19, 20, 21]
"squared.normalized.residuals"        => [5.53742e-5, 2.42977e-5, 2.36066e-6, 2.77706e-6, 1.07985e-7…
"residuals"                           => [2.5348, 1.67908, 0.523367, 0.567651, 0.111936, -0.343779, …
"basic.subset"                        => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  15, 16, 17, 18, 19, 20, …

References

Billor, Nedret, Samprit Chatterjee, and Ali S. Hadi. "A re-weighted least squares method for robust regression estimation." American journal of mathematical and management sciences 26.3-4 (2006): 229-252.

source

Pena & Yohai (1995)

LinRegOutliers.PY95.py95Function
py95(setting)

Perform the Pena & Yohai (1995) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.

Description

The algorithm starts by constructing an influence matrix using results of an ordinary least squares estimate for a given model and data. In the second stage, the eigen structure of the influence matrix is examined for detecting suspected subsets of data.

Output

  • ["outliers"]: Array of indices of outliers
  • ["suspected.sets"]: Arrays of indices of observations for corresponding eigen value of the influence matrix.
  • ["betas]: Vector of estimated regression coefficients using the clean observations.

Examples

julia> reg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
julia> py95(reg0001)
ict{Any,Any} with 2 entries:
  "outliers"       => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
  "suspected.sets" => Set([[14, 13], [43, 54, 24, 38, 22], [6, 10], [14, 7, 8, 3, 10, 2, 5, 6, 1, 9, 4…

References

Peña, Daniel, and Victor J. Yohai. "The detection of influential subsets in linear regression by using an influence matrix." Journal of the Royal Statistical Society: Series B (Methodological) 57.1 (1995): 145-156.

source

Satman (2013)

LinRegOutliers.Satman2013.satman2013Function
satman2013(setting)

Perform Satman (2013) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.

Description

The algorithm constructs a fast and robust covariance matrix to calculate robust mahalanobis distances. These distances are then used to construct weights for later use in a weighted least squares estimation. In the last stage, C-steps are iterated on the basic subset found in previous stages.

Output

  • ["outliers"]: Array of indices of outliers.
  • ["betas"]: Array of estimated regression coefficients.
  • ["residuals"]: Array of residuals.

Examples

julia> eg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
julia> satman2013(reg0001)
Dict{Any,Any} with 1 entry:
  "outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 47]
  "betas" => ...
  "residuals" => ...

References

Satman, Mehmet Hakan. "A new algorithm for detecting outliers in linear regression." International Journal of statistics and Probability 2.3 (2013): 101.

source

Satman (2015)

LinRegOutliers.Satman2015.satman2015Function
satman2015(setting)

Perform Satman (2015) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.

Description

The algorithm starts with sorting the design matrix using the Non-dominated sorting algorithm. An initial basic subset is then constructed using the ranks obtained in previous stage. After many C-steps, observations with high standardized residuals are reported to be outliers.

Output

  • ["outliers]": Array of indices of outliers.
  • [betas]: Array of regression coefficients.
  • [residuals]: Array of residuals.
  • [standardized_residuals]: Array of standardized residuals.

Examples

julia> eg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
julia> satman2015(reg0001)
Dict{Any,Any} with 1 entry:
  "outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 47]

References

Satman, Mehmet Hakan. "Fast online detection of outliers using least-trimmed squares regression with non-dominated sorting based initial subsets." International Journal of Advanced Statistics and Probability 3.1 (2015): 53.

source

## Setan & Halim & Mohd (2000)

LinRegOutliers.ASM2000.asm2000Function
asm2000(setting)

Perform the Setan, Halim and Mohd (2000) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.

Description

The algorithm performs a Least Trimmed Squares (LTS) estimate and yields standardized residual - fitted response pairs. A single linkage clustering algorithm is performed on these pairs. Like smr98, the cluster tree is cut using the Majona criterion. Subtrees with relatively small number of observations are declared to be outliers.

Output

  • ["outliers"]: Vector of indices of outliers.
  • ["betas"]: Vector of regression coefficients.

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> asm2000(reg0001)
Dict{Any, Any} with 2 entries:
  "betas"    => [-63.4816, 1.30406]
  "outliers" => [15, 16, 17, 18, 19, 20]

References

Robiah Adnan, Mohd Nor Mohamad, & Halim Setan (2001). Identifying multiple outliers in linear regression: robust fit and clustering approach. Proceedings of the Malaysian Science and Technology Congress 2000: Symposium C, Vol VI, (p. 400). Malaysia: Confederation of Scientific and Technological Associations in Malaysia COSTAM.

source

Least Absolute Deviations (LAD)

LinRegOutliers.LAD.ladFunction
lad(setting; exact = true)

Perform Least Absolute Deviations regression for a given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • exact::Bool: If true, use exact LAD regression. If false, estimate LAD regression parameters using GA. Default is true.

Description

The LAD estimator searches for regression the parameters estimates that minimize the sum of absolute residuals. The optimization problem is

Min z = u1(-) + u1(+) + u2(-) + u2(+) + .... + un(-) + un(+) Subject to: y1 - beta0 - beta1 * x2 + u1(-) - u1(+) = 0 y2 - beta0 - beta1 * x2 + u2(-) - u2(+) = 0 . . . yn - beta0 - beta1 * xn + un(-) - un(+) = 0 where ui(-), ui(+) >= 0 i = 1, 2, ..., n beta0, beta1 in R n : Number of observations

Output

  • ["betas"]: Estimated regression coefficients
  • ["residuals"]: Regression residuals
  • ["model"]: Linear Programming Model

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> lad(reg0001)
Dict{Any,Any} with 2 entries:
  "betas"     => [-57.3269, 1.19155]
  "residuals" => [2.14958, 1.25803, 0.0664872, 0.0749413, -0.416605, -0.90815, -1.2997, -1.79124,…
source
lad(X, y, exact = true)

Perform Least Absolute Deviations regression for a given regression setting.

Arguments

  • X::AbstractMatrix{Float64}: Design matrix of the linear model.
  • y::AbstractVector{Float64}: Response vector of the linear model.
  • exact::Bool: If true, use exact LAD regression. If false, estimate LAD regression parameters using GA. Default is true.
source

Least Trimmed Absolute Deviations (LTA)

LinRegOutliers.LTA.ltaFunction
lta(setting; exact = false, earlystop = true)

Perform the Hawkins & Olive (1999) algorithm (Least Trimmed Absolute Deviations) for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • exact::Bool: Consider all possible subsets of p or not where p is the number of regression parameters.
  • earlystop::Bool: Early stop if the best objective does not change in number of remaining iters / 5 iterations.

Description

lta is a trimmed version of lad in which the sum of first h absolute residuals is minimized where h is Int(floor((n + p + 1.0) / 2.0)).

Output

  • ["betas"]: Estimated regression coefficients
  • ["objective]: Objective value

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> lta(reg0001)
Dict{Any,Any} with 2 entries:
  "betas"     => [-55.5, 1.15]
  "objective" => 5.7

julia> lta(reg0001, exact = true)
Dict{Any,Any} with 2 entries:
  "betas"     => [-55.5, 1.15]
  "objective" => 5.7  

References

Hawkins, Douglas M., and David Olive. "Applications and algorithms for least trimmed sum of absolute deviations regression." Computational Statistics & Data Analysis 32.2 (1999): 119-134.

source
lta(X, y; exact = false)

Perform the Hawkins & Olive (1999) algorithm (Least Trimmed Absolute Deviations) for the given regression setting.

Arguments

  • X::AbstractMatrix{Float64}: Design matrix of linear regression model.
  • y::AbstractVector{Float64}: Response vector of linear regression model.
  • exact::Bool: Consider all possible subsets of p or not where p is the number of regression parameters.
  • earlystop::Bool: Early stop if the best objective does not change in number of remaining iters / 5 iterations.

References

Hawkins, Douglas M., and David Olive. "Applications and algorithms for least trimmed sum of absolute deviations regression." Computational Statistics & Data Analysis 32.2 (1999): 119-134.

source

Hadi (1992)

LinRegOutliers.Hadi92.hadi1992Function
hadi1992(multivariateData)

Perform Hadi (1992) algorithm for a given multivariate data.

Arguments

  • multivariateData::AbstractMatrix{Float64}: Multivariate data.

Description

Algorithm starts with an initial subset and enlarges the subset to obtain robust covariance matrix and location estimates.

Output

  • ["outliers"]: Array of indices of outliers
  • ["critical.chi.squared"]: Threshold value for determining being an outlier
  • ["rth.robust.distance"]: rth robust distance, where (r+1)th robust distance is the first one that exceeds the threshold.

Examples

julia> multidata = hcat(hbk.x1, hbk.x2, hbk.x3);

julia> hadi1992(multidata)
Dict{Any,Any} with 3 entries:
  "outliers"              => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
  "critical.chi.squared" => 7.81473
  "rth.robust.distance"   => 5.04541

# Reference Hadi, Ali S. "Identifying multiple outliers in multivariate data." Journal of the Royal Statistical Society: Series B (Methodological) 54.3 (1992): 761-771.

source

Marchette & Solka (2003) Data Images

LinRegOutliers.DataImage.dataimageFunction
dataimage(dataMatrix; distance = :)

Generate the Marchette & Solka (2003) data image for a given data matrix.

Arguments

  • dataMatrix::AbstractVector{Float64}: Data matrix with dimensions n x p, where n is the number of observations and p is the number of variables.
  • distance::Symbol: Optional argument for the distance function.

Notes

distance is :mahalanobis by default, for the Mahalanobis distances. 
use 

    dataimage(mat, distance = :euclidean)

to use Euclidean distances.

Examples

julia> x1 = hbk[:,"x1"];
julia> x2 = hbk[:,"x2"];
julia> x3 = hbk[:,"x3"];
julia> mat = hcat(x1, x2, x3);
julia> di = dataimage(mat, distance = :euclidean)
julia> Plots.plot(di)

References

Marchette, David J., and Jeffrey L. Solka. "Using data images for outlier detection." Computational Statistics & Data Analysis 43.4 (2003): 541-552.

Dependencies

This method is enabled when the Plots package is installed and loaded.

source

Satman's GA based LTS estimation (2012)

LinRegOutliers.GALTS.galtsFunction
galts(setting)

Perform Satman(2012) algorithm for estimating LTS coefficients.

Arguments

  • setting: A regression setting object.

Description

The algorithm performs a genetic search for estimating LTS coefficients using C-Steps.

Output

  • ["betas"]: Robust regression coefficients
  • ["best.subset"]: Clean subset of h observations, where h is an integer greater than n / 2. The default value of h is Int(floor((n + p + 1.0) / 2.0)).
  • ["objective"]: Objective value

Examples

julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> galts(reg)
Dict{Any,Any} with 3 entries:
  "betas"       => [-56.5219, 1.16488]
  "best.subset" => [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 23, 24]
  "objective"   => 3.43133

References

Satman, M. Hakan. "A genetic algorithm based modification on the lts algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5 (2012): 644-652.

source

Fischler & Bolles (1981) RANSAC Algorithm

LinRegOutliers.Ransac.ransacFunction
ransac(setting; t, w=0.5, m=0, k=0, d=0, confidence=0.99)

Run the RANSAC (1981) algorithm for the given regression setting

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and a dataset.
  • t::Float64: The threshold distance of a sample point to the regression hyperplane to determine if it fits the model well.
  • w::Float64: The probability of a sample point being inlier, default=0.5.
  • m::Int: The number of points to sample to estimate the model parameter for each iteration. If set to 0, defaults to picking p points which is the minimum required.
  • k::Int: The number of iterations to run. If set to 0, is calculated according to the formula given in the paper based on outlier probability and the sample set size.
  • d::Int: The number of close data points required to accept the model. Defaults to number of data points multiplied by inlier ratio.
  • confidence::Float64: Required to determine the number of optimum iterations if k is not specified.

Output

  • ["outliers"]: Array of indices of outliers.

Examples

julia> df = DataFrame(y=[0,1,2,3,3,4,10], x=[0,1,2,2,3,4,2])
julia> reg = createRegressionSetting(@formula(y ~ x), df)
julia> ransac(reg, t=0.8, w=0.85)
Dict{String,Array{Int64,1}} with 1 entry:
  "outliers" => [7]

References

Martin A. Fischler & Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography" Comm. ACM. 24 (6): 381–395.

source

Minimum Covariance Determinant Estimator (MCD)

LinRegOutliers.MVE.mcdFunction
mcd(data; alpha = 0.01)

Performs the Minimum Covariance Determinant algorithm for a robust covariance matrix.

Arguments

  • data::DataFrame: Multivariate data.
  • alpha::Float64: Probability for quantiles of Chi-Squared statistic.

Description

mcd searches for a robust location vector and a robust scale matrix, e.g covariance matrix. The method also reports a usable diagnostic measure, Mahalanobis distances, which are calculated using the robust counterparts instead of mean vector and usual covariance matrix. Mahalanobis distances are directly comparible with quantiles of a ChiSquare Distribution with p degrees of freedom.

Output

  • ["goal"]: Objective value
  • ["best.subset"]: Indices of best h-subset of observations
  • ["robust.location"]: Vector of robust location measures
  • ["robust.covariance"]: Robust covariance matrix
  • ["squared.mahalanobis"]: Array of Mahalanobis distances calculated using robust location and scale measures.
  • ["chisq.crit"]: Chisquare quantile used in threshold
  • ["alpha"]: Probability used in calculating the Chisquare quantile, e.g chisq.crit
  • ["outliers"]: Array of indices of outliers.

Notes

Algorithm is implemented using concentration steps as described in the reference paper. However, details about number of iterations are slightly different.

References

Rousseeuw, Peter J., and Katrien Van Driessen. "A fast algorithm for the minimum covariance determinant estimator." Technometrics 41.3 (1999): 212-223.

source

Imon (2005) Algorithm

LinRegOutliers.Imon2005.imon2005Function
imon2005(setting)

Perform the Imon 2005 algorithm for a given regression setting.

Arguments

  • setting::RegressionSetting: A regression setting.

Description

The algorithm estimates the GDFFITS diagnostic, which is an extension of well-known regression diagnostic DFFITS. Unlikely, GDFFITS is used for detecting multiple outliers whereas the original one was used for detecting single outliers.

Output

  • ["crit"]: The critical value used
  • ["gdffits"]: Array of GDFFITS diagnostic calculated for observations
  • ["outliers"]: Array of indices of outliers.
  • ["betas"]: Vector of regression coefficients.

Notes

The implementation uses LTS rather than LMS as suggested in the paper.

References

A. H. M. Rahmatullah Imon (2005) Identifying multiple influential observations in linear regression, Journal of Applied Statistics, 32:9, 929-946, DOI: 10.1080/02664760500163599

source

Barratt & Angeris & Boyd (2020) CCF algorithm

LinRegOutliers.CCF.ccfFunction
ccf(setting; starting_lambdas = nothing)

Perform signed gradient descent for clipped convex functions for a given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • starting_lambdas::AbstractVector{Float64}: Starting values of weighting parameters used by signed gradient descent.
  • alpha::Float64: Loss at which a point is labeled as an outlier (points with loss ≥ alpha will be called outliers).
  • max_iter::Int64: Maximum number of iterations to run signed gradient descent.
  • beta::Float64: Step size parameter.
  • tol::Float64: Tolerance below which convergence is declared.

Output

  • ["betas"]: Robust regression coefficients
  • [""outliers"]: Array of indices of outliers
  • [""lambdas"]: Lambda coefficients estimated in each iteration
  • [""residuals"]: Regression residuals.

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> ccf(reg0001)
Dict{Any,Any} with 4 entries:
  "betas"     => [-63.4816, 1.30406]
  "outliers"  => [15, 16, 17, 18, 19, 20]
  "lambdas"   => [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  2.77556e-17, 2.77556e-17, 0…
  "residuals" => [-2.67878, -1.67473, -0.37067, -0.266613, 0.337444, 0.941501, 1.44556, 2.04962, 1…

References

Barratt, S., Angeris, G. & Boyd, S. Minimizing a sum of clipped convex functions. Optim Lett 14, 2443–2459 (2020). https://doi.org/10.1007/s11590-020-01565-4

source
ccf(X, y; starting_lambdas = nothing)

Perform signed gradient descent for clipped convex functions for a given regression setting.

Arguments

  • X::AbstractMatrix{Float64}: Design matrix of the linear model.
  • y::AbstractVector{Float64}: Response vector of the linear model.
  • starting_lambdas::AbstractVector{Float64}: Starting values of weighting parameters used by signed gradient descent.
  • alpha::Float64: Loss at which a point is labeled as an outlier. If unspecified, will be chosen as p*mean(residuals.^2), where residuals are OLS residuals.
  • p::Float64: Points that have squared OLS residual greater than p times the mean squared OLS residual are considered outliers.
  • max_iter::Int64: Maximum number of iterations to run signed gradient descent.
  • beta::Float64: Step size parameter.
  • tol::Float64: Tolerance below which convergence is declared.

Output

  • ["betas"]: Robust regression coefficients
  • [""outliers"]: Array of indices of outliers
  • [""lambdas"]: Lambda coefficients estimated in each iteration
  • [""residuals"]: Regression residuals.

References

Barratt, S., Angeris, G. & Boyd, S. Minimizing a sum of clipped convex functions. Optim Lett 14, 2443–2459 (2020). https://doi.org/10.1007/s11590-020-01565-4

source

Atkinson (1994) Forward Search Algorithm

LinRegOutliers.Atkinson94.atkinson94Function
    atkinson94(setting, iters, crit)

Runs the Atkinson94 algorithm to find out outliers using LMS method.

Arguments

  • setting::RegressionSetting: A regression setting object.
  • iters::Int: Number of random samples.
  • crit::Float64: Critical value for residuals

Description

The algorithm randomly selects initial basic subsets and performs a very robust method, e.g lms to enlarge the basic subset. In each iteration of forward search, the best objective value and parameter estimates are stored. These values are also used in Atkinson's Stalactite Plot for a visual investigation of outliers. See atkinsonstalactiteplot.

Output

  • ["optimum_index"]: The iteration number in which the minimum objective is obtained
  • ["residuals_matrix"]: Matrix of residuals obtained in each iteration
  • ["outliers"]: Array of indices of detected outliers
  • ["objective"]: Minimum objective value
  • ["coef"]: Estimated regression coefficients
  • ["crit"]: Critical value given by the user.

Examples

julia> reg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> atkinson94(reg)
Dict{Any,Any} with 6 entries:
  "optimum_index"    => 10
  "residuals_matrix" => [0.0286208 0.0620609 … 0.0796249 0.0; 0.0397778 0.120547 … 0.118437 0.0397778; … ; 1.21133 1.80846 … 0.690327 4.14366; 1.61977 0.971592 … 0.616204 3.58098]
  "outliers"         => [1, 3, 4, 21]
  "objective"        => 0.799134
  "coef"             => [-38.3133, 0.745659, 0.432794, 0.0104587]
  "crit"             => 3.0

References

Atkinson, Anthony C. "Fast very robust methods for the detection of multiple outliers." Journal of the American Statistical Association 89.428 (1994): 1329-1339.

source

BACON Algorithm (Billor & Hadi & Velleman (2000))

LinRegOutliers.Bacon.baconFunction
    bacon(setting, m, method, alpha)

Run the BACON algorithm to detect outliers on regression data.

Arguments:

  • setting: RegressionSetting object with a formula and a dataset.
  • m: The number of elements to be included in the initial subset.
  • method: The distance method to use for selecting the points for initial subset
  • alpha: The quantile used for cutoff

Description

BACON (Blocked Adaptive Computationally efficient Outlier Nominators) algoritm, defined in the citation below, has many versions, e.g BACON for multivariate data, BACON for regression etc. Since the design matrix of a regression model is multivariate data, BACON for multivariate data is performed in early stages of the algorithm. After selecting a clean subset of observations, then a forward search is applied. Observations with high studendized residuals are reported as outliers.

Output

  • ["outliers"]: Array of indices of outliers.
  • ["betas"]: Array of estimated coefficients.

Examples

julia> reg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> bacon(reg, m=12)
Dict{String, Vector} with 2 entries:
  "betas"    => [-37.6525, 0.797686, 0.57734, -0.0670602]
  "outliers" => [1, 3, 4, 21]

References

Billor, Nedret, Ali S. Hadi, and Paul F. Velleman. "BACON: blocked adaptive computationally efficient outlier nominators." Computational statistics & data analysis 34.3 (2000): 279-298.

source

Hadi (1994) Algorithm

LinRegOutliers.Hadi94.hadi1994Function
hadi1994(multivariateData)

Perform Hadi (1994) algorithm for a given multivariate data.

Arguments

  • multivariateData::AbstractMatrix{Float64}: Multivariate data.

Description

Algorithm starts with an initial subset and enlarges the subset to obtain robust covariance matrix and location estimates. This algorithm is an extension of hadi1992.

Output

  • ["outliers"]: Array of indices of outliers
  • ["critical.chi.squared"]: Threshold value for determining being an outlier
  • ["rth.robust.distance"]: rth robust distance, where (r+1)th robust distance is the first one that exceeds the threshold.

Examples

julia> multidata = hcat(hbk.x1, hbk.x2, hbk.x3);

julia> hadi1994(multidata)
Dict{Any,Any} with 3 entries:
  "outliers"              => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
  "critical.chi.squared" => 7.81473
  "rth.robust.distance"   => 5.04541

# Reference Hadi, Ali S. "A modification of a method for the dedection of outliers in multivariate samples" Journal of the Royal Statistical Society: Series B (Methodological) 56.2 (1994): 393-396.

source

Chatterjee & Mächler (1997)

LinRegOutliers.CM97.cm97Function
cm97(setting; maxiter = 1000)

Perform the Chatterjee and Mächler (1997) algorithm for the given regression setting.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.

Description

The algorithm performs a iteratively weighted least squares estimation to obtain robust regression coefficients.

Output

  • ["betas"]: Robust regression coefficients
  • ["iterations"]: Number of iterations performed
  • ["converged"]: true if the algorithm converges, otherwise, false.

Examples

julia> myreg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> result = cm97(myreg)
Dict{String,Any} with 3 entries:
  "betas"      => [-37.0007, 0.839285, 0.632333, -0.113208]
  "iterations" => 22
  "converged"  => true

References

Chatterjee, Samprit, and Martin Mächler. "Robust regression: A weighted least squares approach." Communications in Statistics-Theory and Methods 26.6 (1997): 1381-1394.

source

Quantile Regression

LinRegOutliers.QuantileRegression.quantileregressionFunction
quantileregression(setting; tau = 0.5)

Perform Quantile Regression for a given regression setting (multiple linear regression).

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • tau::Float64: Quantile level. Default is 0.5.

Description

The Quantile Regression estimator searches for the regression parameter estimates that minimize the

Min z = (1 - tau) (u1(-) + u2(-) + ... + un(-)) + tau (u1(+) + u2(+) + ... + un(+)) Subject to: y1 - beta0 - beta1 * x2 + u1(-) - u1(+) = 0 y2 - beta0 - beta1 * x2 + u2(-) - u2(+) = 0 . . . yn - beta0 - beta1 * xn + un(-) - un(+) = 0 where ui(-), ui(+) >= 0 i = 1, 2, ..., n beta0, beta1 in R n : Number of observations model is the y = beta1 + beta2 * x + u

Output

  • ["betas"]: Estimated regression coefficients
  • ["residuals"]: Regression residuals
  • ["model"]: Linear Programming Model

Examples

julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> quantileregression(reg0001)
source
quantileregression(X, y, tau = 0.5)

Estimates parameters of linear regression using Quantile Regression Estimator for a given regression setting.

Arguments

  • X::AbstractMatrix{Float64}: Design matrix of the linear model.
  • y::AbstractVector{Float64}: Response vector of the linear model.
  • tau::Float64: Quantile level. Default is 0.5.

Examples

julia> income = [420.157651, 541.411707, 901.157457, 639.080229, 750.875606];
julia> foodexp = [255.839425, 310.958667, 485.680014, 402.997356, 495.560775];

julia> n = length(income)
julia> X = hcat(ones(Float64, n), income)

julia> result = quantileregression(X, foodexp, tau = 0.25)
source

Theil-Sen estimator for multiple regresion

LinRegOutliers.TheilSen.theilsenFunction
theilsen(setting, m, nsamples = 5000)

Theil-Sen estimator for multiple regression.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • m::Int: Number of observations to be used in each iteration. This number must be in the range [p, n], where p is the number of regressors and n is the number of observations.
  • nsamples::Int: Number of m-samples. Default is 5000.

Description

The function starts with a regression formula and datasets. The number of observations to be used in each iteration is specified by the user. The function then randomly selects m observations from the dataset and performs an ordinary least squares estimation. The estimated coefficients are saved. The process is repeated until nsamples regressions are estimated. The multivariate median of the estimated coefficients is then calculated. In this case, the multivariate median is the point that minimizes the sum of distances to all the estimated coefficients. Hooke & Jeeves algorithm is used for the optimization problem.

References

Dang, X., Peng, H., Wang, X., & Zhang, H. (2008). Theil-sen estimators in a multiple linear regression model. Olemiss Edu.

source

Deepest Regression Estimator

LinRegOutliers.DeepestRegression.deepestregressionFunction
deepestregression(setting; maxit = 1000)

Estimate Deepest Regression paramaters.

Arguments

  • setting::RegressionSetting: RegressionSetting object with a formula and dataset.
  • maxit: Maximum number of iterations

Description

Estimates Deepest Regression Estimator coefficients.

References

Van Aelst S., Rousseeuw P.J., Hubert M., Struyf A. (2002). The deepest regression method. Journal of Multivariate Analysis, 81, 138-166.

Output

  • betas: Vector of regression coefficients estimated.
source