fredtz's picture

Dear all
I hope you are doing fine and I would like to say this is a really enlightening forum, and thank you all for enabling some of startlets like me to learn alot on this part of science
This is my first topic to contribute and first of all I would like to knock a door into this arena
I"m developing a qualitative model for analysis of vegetable oils by using GRAMS AI software by a discriminant analysis using Mahalanobis distance method. I would like to be elaborated how can I can Be able to determine the Principal Components (PCs) number in this software?

jjakhm's picture

Hi fredtz,
I suspect that there may not be many GRAMS users here, but we have each other. I can only speak to the version of GRAMS/AI that I have experience with (7.02); hopefully the information will be sufficient for your needs. If you search the PLSplus IQ Help file with key words like “PCA Eigenvalue Methods” you should find very helpful information regarding selection of factors for the MD/PCA/R discriminate analysis method within the first few hits. I will try to summarize the main points.
The software will actually select the number of factors for you (you can change this number of course). Paraphrased from the help file:
Factor selection is based on the Reduced Eigenvalue (REV) F-test at a significance level of alpha = 0.01 (plot type “Eigenvalues” in the .tdf Report Viewer section of the PLS/IQ Navigator. It is usually the first plot that displays when you load your experiment). Therefore, any factors with probabilities greater than or equal to 0.99 are maintained as primary factors. The remaining factors are assumed to be noise or the set of secondary factors.
Other methods for determination of the optimum number of factors are also discussed in the help file. Among which, Cross Validation Distance Prediction is also offered as an option for examination (plot type “Avg Pred Distance”). There are others such as Total % Variance, etc.
To determine the optimum number of factors, many on this forum would recommend employing these methods. But, they would also recommend doing this in conjunction with examining data, factor loadings, etc. In other words, don’t just accept anything the software does on you behalf without due diligence.  It is also important to remember that many of the statistical settings utilized as default values in software are somewhat arbitrary (although commonly accepted). Perhaps most importantly, the optimum number of factors should be determined through external validation of your models, i.e. prediction/classification with an independent set of data (data not included in the calibration model and from different batches etc.). As a simple first approach, I would suggest starting with the recommended number of factors, then adding and subtracting several factors, and evaluating the performance in each case. Regarding evaluation of performance (and building/validating models in general) there are useful PASG and ASTM documents available.
As for any of the theory behind the aforementioned, you will find some useful information throughout the forums along with great literature recommendations.
On a somewhat related topic, you may want to read my post in the forum regarding the use of RMSG with MD/PCA/R in GRAMS. Note that the Mahalanobis distance is determined with the mean centered residuals vector appended to the scores matrix. Then this value is normalized by RMSG (which is essentially equal to the square root of the number of factors). The manual presents this as a way for normalizing by group size, but you will note that GRAMS (at least my version) does not provide a way to create 1 PCA model consisting of multiple classes/groups and a way to specify the scores space of those constituent classes for use in classification of unknowns. This may lead to some confusion. Therefore you will need to create a PCA model for each class, assuming that you have several types of oil classes and would like to discriminate future samples using those classes. In this case, the RMSG normalization seems more appropriately to be a way to account for the M-distance dependence on model rank. There is still a little uncertainty associated with that statement as my question in the forum remains unanswered currently.
Hopefully you find this information useful. Let us know how you do and if there are anymore questions.
Kind Regards,