k-fold Cross Validation

The !KCV k qualifier, when used with !CYCLE 1:n and !FILTER f !EXCLUDE $I causes ASReml to save crossvalidation predictions for model term k based on repeated analyses of the data, excluding records where variable f has a value corresponding to the current cycle.

This implementation of crossvalidation requires the user to define a variable, say CVgroup, which allocates the data records to g groups. The analysis can then be repeated g times (using !CYCLE 1:g) dropping records in group i using !EXCLUDE $I in the various runs. If the records dropped pertain uniquely to levels of a model term ( grm1(Nclone) in the example below), the !KCV grm1(Nclone) qualifier will collect together predicted values for the levels of grm1(Nclone) which are predicted in the run but having no direct data. These are written to a .kcv file.

Predicted values from cross validation should normally be correlated with an independent measure of the value. In a simulation context, we might keep the 'true' value. A less desirable option is to correlate the predicted values with values predicted from the whole data. If the CYCLE is extended by 1, (that is, n=g+1) no records are dropped in the final round and ASReml will report the predictions from the full data in a second field in the .kcv file), and correlation with the CV predictions.

For example, in evaluating the accuracy of prediction from a genomic model, one might run the following model.

  !WORK 1
 Cross Validation test with Nassau Data
  Nfam 71 !A
  Nfemale 26 !A
  Nmale 37 !A
  Nclone 857 !A   !L  Clones.txt !LSKIP 1
  MatOrder 914 !A
  rep 8 !A
  iblk 80 !A
  culture 2 !A
  DBH6	 HT6 HT8 CWAC6 !M-9
  CVgroup 10 !=Nclone !-1 !MOD 10 !+1
 !CYCLE 1:11
 snpData.mkr  !SKIP 1 !HEAD 0  !CENTRE   !MARKERS 4854 !IDS 923
 nassau_cut_v3.csv !MAXIT 30 !SKIP 1 !DFF -1
       !FILTER CVgroup !EXCLUDE $I  !KCV grm1(Nclone)		# Data

  HT6 ~  mu culture culture.rep  !r grm1(Nclo) 0.276 Nclone 0.152 rep.iblk 0.308

This code partitions the data into 10 classes using the variable CVgroup defined from variable Nclone in this example by allocating every 10th clone to each group. The !CYCLE 1:11 runs the analysis 11 times. The first 10 drop the records pertaining to the respective groups. The last run includes all the data. The !KCV grm1(Nclone) qualifier causes \ASReml to save the solutions for model term grm1(Nclone) corresponding to levels for which the data was omitted from the in one field and the values from cycle 11 in a second field. The correlation between the fields is reported to the .asr file.

Important When performing cross-validation, the manner of partitioning the records can be critical. The method used here is just a simple method used for convenience in this example. Furthermore, correlation of the predicted values from reduced data with predicted values from the full data is not very helpful. Where an independent 'true' value exisits (as in the case of simulated data), that should be used.

Return to index