Outlier detection using mahalanobis [Question]

miguelG's picture

Hello everyone,
Sorry if my question is too newby, but I have been debating over a problem that I have.
I want to predict outliers and I have been using software Quant from OPUs (bruker) to sort the outliers for me. For the construction of calibration and predictive models I use /Toolbox for matlab.
My question is: what is the mathematical formula for outlier detection in NIR spectra using mahalanobis distance with PLS?
Can you please explain with some detail because I have reasearched in books and papers and tried many ways but none seem to work (when compared to the values obtained by the software OPUS), maybe I am missing something...
Any help is appreciated!

ptillmann's picture

the answer to your question is not trivial. The basics can be found in Howard Mark's paper
Mark, H (1985) Normalized distances for qualitative near-infrared reflectance analysis. Anal Chem 58,379ff
and in the original paper from Mahalanobis from the 1930s :-). (But I doubt people will try to read it, it will hardly be online available.)
The concept can easily be transfered to Mahalanobis distances from PLS scores and for quantitative analysis.
When looking at numbers from a given chemometric package and interpreting them the problems are now :
- determining the number of components used in the calculation
- summing up the distances across components (will increase with number of components used) or calculating a "mean distance" across components
- quoted as a quadratic distance ("variance") or as a square root distance ("standard deviation") 
Each software package has its own solutions. The M,H or what ever Mahalanobis distance from several packages never agree nummerically. But they do solve the practical problem of outlier detection regardles of chemometric package.

jrodrigues's picture

Dear Peter
There is a link for the Mahalanobis  paper (1936) in the Wiki page, See 1st reference.
Kind regards

miguelG's picture

Hello everyone,
I read the papers you proposed and now I comprehend the issue better. The main problem that I was facing was justifying why my values wheren't equal to the ones calculated by the program, now I understand that the numerical value don't necessarily need to be the same, but the conclusion must be similar!
Thanks for your kind replies!