NIRs Model Calibration

sabki's picture

Dear All,

I need some questions about NIRs calibration so I hope some one can give me feedback that is :

  1. How many percent of n-samples best for calibration & validation (75 vs 25), (65 vs 35), (70 vs 30) or any others?
  2. What is the best range of wave length for wood chemistry NIRs (extractive, lignin, cellulose)?
  3. Did to choise factor in NIRs model by software automatically is very good than adjust manually?
  4. Which one is more important RPDcal than RPDprediction?

Thank & Regards,


dardenne's picture


  1. The % between cal and val is not important. It is important to set up a validation set or several validation sets which are independent of the calibration set. The easiest way is to keep the spectra according to the date of acquisition and cross validate by blocks. If the model is to be used in the future with new samples, it has to be challenged to prove its robustness.
  2. For any calibration, if the Xvariance is well covered, the coefficients at the non-informative wavelengths will set to zero or close. With small data sets, a variable selection is often beneficial.
  3. The choice of the number of factor is generally too optimistic because the validation or cross validation is wrongly set.
  4. Never consider SEC either RPDcal. RMSEV and RPDpred are the real parameters to estimate the model performance, but again only when the validation is correctly done with independent samples. In agriculture, it means the samples coming from a new season for instance.

Best regards,



shileyda's picture


You ask some very good (and important) questions.

1. I use 80/20 for my calibration and test sets. 20% should be the minimum number though. However, I believe that how the test set is selected is much more important than the percentage of samples used in this set. I rank order all available samples by their constituent value then mark every 5th sample to use in the test set to assure that the test set represents the range of values present in the calibration set.

2. Mostly this depends on whether the sample is wet or dry. If the sample is dry I would include 1000-2500 nm, if the sample is wet I would use 1000-2100 nm. Moisture absorbs most of the information above 2100 nm.

3. If you have a sufficiently large sample set I would tend to trust the automatic factor identification. You should have at least 10 samples in the calibration set for every one factor used in the model, more conservatively you can use 20 samples/factor.  If you have fewer than 10 samples per factor, I would not use the automatic factor identified and would manually select a lower number of factors.

4. RPD prediction is much more important. Any statistic created from the calibration set is only an estimate of model performance, whereas statistics produced on the test set represent actual model performance. If the model has been properly fitted both the calibration and test set statistics should be similar.