Genetic Marker operators

Setting marker positions

!MMAP s assigns Haldane map positions (s) to marker variables and imputes values to the markers where they have missing values. This transformation will normally be used on a !G n factor where the n variables are the marker states for n markers in a linkage group in map order and coded [-1,1] (backcross) or [-1,0,1] (F2 design). s (length n+1) should be the n marker positions relative to a left telomere position of zero, and an extra value being the length of the linkage group (the position of the right telomere). The length (right telomere) may be omitted in which case the last marker is taken as the end of the linkage group. The positions may be given in Morgans or centiMorgans (if the length is greater than 10, it will be divided by 100 to convert to Morgans). Further details are provided in the User Guide.

 ChrAadd !G 10 !MM 1 ...

Dominance variables

!DOM A is used to form dominance covariables from a set of additive marker covariables previously declared with the !MM marker map qualifier. It assumes the argument A is an existing group of marker variables relating to a linkage group defined using !MM which represents additive marker variation coded [-1, 0, 1] (representing marker states aa, aA and AA) respectively. It is a group transformation which takes the [-1,1] interval values, and calculates (|X|-0.5)*2 i.e. -1 and 1 become one, 0 becomes -1. The marker map is also copied and applied to this model term so it can be the argument in a qtl().

  ChrAdom !DOM ChrAadd

Large Random Regression (GBLUP from large marker files)

One use of the GRM matrix is to fit a Genomic model where G=MM'/s, M is a centred matrix of marker scores (0, 1, 2) and s=2Σ p_i1-p_i where p_i is the proportion for the minor allele. The marker file is recognised by the filename extension .mkr and is specified after any pedigree file and before the data file (with any other GRM files). There may only be one .mkr file. The standard format is to have SNP markers coded in the columns, Genotypes in the rows, the first column identifying (labelling) the Genotypes and the first row identifying the markers. ASReml does not utilize the information in the first row or first column. The syntax for the line is
SNP .mkr !IDS n !MARKERS m [!CENTRE !CSKIP c !SKIP s !HEADER h] where
SNP .mkr is the name of the marker file
!IDS n informs ASReml to expect markers for n genotypes
!MARKERS m informs \ASReml to expect data for m markers
!CENTRE requests ASReml to centre the marker scors at the mean; this also reverses the coding if the mean is greater than one.
!CSKIP c tells ASReml to skip the first c columns in the file. Use it if the first field is not integer, or there is more than 1 field to skip.
!SKIP s tells ASReml to skip the first s rows in the file.
!HEADER 0 indicates that a numeric header line is not present.If the header is alphanumeric, use !SKIP 1 !HEADER 0.

Missing values in the marker file may be represented by *, NA or any number outside the range (-2,2). They are replaced by the mean of the marked levels.

Example

 !WORK 1
 Nassau Clone Data
 Nfam 71 !A
 Nfemale 26 !A
 Nmale 37 !A
 Nclone 857 !A   !L  Clones.txt !LSKIP 1
 rep 8 !A
 iblk 80 !A
 culture 2 !A
 DBH6	

 snpData.mkr  !SKIP 1 !HEAD 0  !CENTRE   !MARKERS 4854 !IDS 923

 nassau.csv !MAXIT 30 !SKIP 1 !DFF -1
 DBH6 ~  mu culture culture.rep !r grm1(Nclo) 0.27 Nclone 0.15 rep.iblk 0.31

Clones.txt contains the genotype identifiers (as used in the data file) as a list the first field and in the order of the marker file, snpData.mkr which looks like

 Genotype,0-10024-01-114,0-10037-01-257,0-10040-02-394,...
 140099,2,2,1,2,2,2,2,2,2,1,2,1,2,1,1,2,1,2,2,2,2,2,1,2...
 141099,2,2,0,0,2,2,1,2,2,1,2,1,2,2,0,2,2,2,2,1,2,2,1,1...
 ...
 547853,2,2,1,2,2,2,1,2,2,0,2,1,2,2,2,2,2,2,2,1,2,...
 547966,2,2,1,1,1,2,0,2,2,1,2,2,2,2,2,2,2,2,2,1,2,...
 548082,2,2,1,2,2,2,1,2,1,2,2,1,2,2,1,2,2,2,2,1,2,...}

Back to transformation summary

Using GRM matrices One use of the GRM matrix is to allow more computationally efficient fitting of random regression models associating u, a vector of f factor effects with v a vector of m regression effects through the model u=Mv where the matrix M contains m regressor variables for each of the f levels of the factor. Direct fitting of the regression effects is facilitated by using the my basis function ( mbf function) associating the regressor variables to the levels of the factor, essentially fitting ZMv where Z is the design matrix linking observations to the levels of the factor. But if m is much bigger than f, it is more computational efficient to fit an equivalent model Zu with a variance structure for u based on MM'. ASReml can read the matrix M associated with a factor and group of regressor variables from a .grr file, construct a GRM matrix (G=MM'/s), fit the equivalent model and report both factor and regressor predictions. One common case of this model is when u represents genotype effects, the regressors represent SNP marker counts (typically 0/1/2) and v are marker effects.

The .grr file is specified after any pedigree file and before the data file (with any other GRM files). There may only be one .grr file. It is assumed to contain a row for each level of the factor, each row containing m regressor values. Optionally the factor level name associated with the i-th row can be included before the relevant regressor values. Also a heading row might include a name for each field/regressor variable. Superfluous fields before the factor or regressor fields can be skipped and superfluous rows before the regressor information can be skipped.

The syntax for specifying and reading the .grr file is
M.grr [!CSKIP c₁] Factor [f] [!NOID] [!CSKIP c₂] Regressors [m] [!NONAMES] [!SKIP s]
where
M .grr is the name of the file to be read,
!CSKIP c₁ indicates c₁ fields are to be skipped before the factor identifiers are read,
Factor is the name of the variable in the data that is associated with the regressors,
f sets the maximum number of levels (default 1000) of Factor with regressor data; \ASReml\ will count the actual number,
!NOID indicates that the factor identifiers are not present in the .grr file,
!CSKIP c₂ indicates c₂ fields are to be skipped before the regressor variables are read,
Regressors is the name for the set of regressor variables,
m sets the number of regressor variables (default is the number of names found); must be set if there are extraneous fields to be ignored,
!SKIP s specifies how many lines are to be skipped before reading the regressor data,
!NONAMES indicates there is no line containing the individual names of the regressor variables; otherwise names are taken from the first (non-skipped) line in the file.

If the factor identifiers are not present ( !NOID), ASReml assumes that the order of the factor classes in the data file matches the order in the .grr file. If the factor identifiers are present, ASReml uses the identifiers obtained from the .grr file to define the order of the factor classes when the data is read; any extra identifiers in the data not in the .grr file are appended at the end of the factor level name list. If !NOID is set, identifiers in the .grr file are not needed and if present should be skipped using !CSKIP.

Values are typically TAB, COMMA or SPACE separated but may be packed (no separator) when all values are integers 0/1/2. Missing values in the regression variables may be represented by *, NA. Invalid data is also treated as missing. Missing values are replaced by the mean of the respective regressor. Alternative missing data methods that involve imputation from neighbouring markers have not been implemented.

Some general qualifiers are:
!SAVEGIV instructs ASReml to write the G matrix in .dgiv format,
!PSD s declares that the derived variance matrix may have up to s singularities,
!PEV requests calculation of Prediction Error Variance of marker effects which are reported in the .mef file. Calculation of Prediction error variances is computationally very expensive,
!CENTRE\index{qualifier! "!CENTRE } [c] requests ASReml to centre the regressors at c if c is specified else at the individual regressor means; otherwise the G matrix is formed from uncentered regressors.

Other qualifiers relate specifically to whether the regressors are markers. Markers are typically coded 0/1/2 being counts of the minor allele. However, if they are imputed, they will take real values between 0 and 2. Since marker files may be huge,
!SMODE b sets the storage mode for the regressor data, indicating whether it is marker data: b = 2 sets 2bit storage for strictly 0/1/2 marker data, b=8 (the default) sets 8bit storage useful for marker data with imputed values having 2 digits after the decimal, b = 16 sets 16bit storage useful for marker data with imputation with more than 2 digits and b = 32 sets 32bit real storage and should be used for non-marker data,
!RANGE l h indicates the marker scores range l:h and are to be transformed to have a range 0:2,
!GSCALE s, controls the scaling of the GRM matrix. If unspecified s=Σ 2p(1-p) is used for marker data, s=1 for non marker data ( !SMODE 32). Scaling is often used with centred marker data to scale the MM' matrix so that it is a genomic matrix.

Example

In this forestry example, there are multiple trees per clone and so two clone model terms;
grm1(Clone)} fits the additive genetic variance based on marker covariance,
Clone fits the non-additive variance, while
the residual represents the within clone samplng variance.

 !WORK 1
 Nassau Clone Data
  Nfam 71 !A
  Nfemale 26 !A
  Nmale 37 !A
  Clone  !A 860
  rep 8
  iblk 80
  culture  !A
  DBH6

 snpData.grr Clone

 nassau.csv !MAXIT 30 !SKIP 1 !DFF -1
 DBH6 ~  mu culture/rep  !r grm1(Clon) 0.27 Clone 0.15 rep.iblk 0.31

where snpData.grr is first used to declare Clone identifiers (taken from the first field) in the correct order, and then contains the marker scores; it looks like

 Genotype,0-10024-01-114,0-10037-01-257,0-10040-02-394,...
 140099,2,2,1,2,2,2,2,2,2,1,2,1,2,1,1,2,1,2,2,2,2,2,1,2...
 141099,2,2,0,0,2,2,1,2,2,1,2,1,2,2,0,2,2,2,2,1,2,2,1,1...
 ...
 547853,2,2,1,2,2,2,1,2,2,0,2,1,2,2,2,2,2,2,2,1,2,...
 547966,2,2,1,1,1,2,0,2,2,1,2,2,2,2,2,2,2,2,2,1,2,...
 548082,2,2,1,2,2,2,1,2,1,2,2,1,2,2,1,2,2,2,2,1,2,...

The primary output follows.

 ASReml 4.1 [01 Apr 2014] Testing Pedigree Matrices against Marker Matrices for Variance Partition with Na
   Build lg [15 Sep 2014]   64 bit  Windows x64
 16 Sep 2014 14:11:26.277   1024 Mbyte  clonesHT6_2/clones
..

 Nfam 71 !A
 Nfemale 26 !A
 Nmale 37 !A
 Clone  !A 860
 MatOrder 914 !A
 rep 8 !A
 iblk 80 !A
 prop 1 !A
 culture 2 !A
 treat 2 !A
 measure 1 !A
 CWAC6 !M-9
 Class names for factor "Clone" are initialized from the .grr file.
 Marker Header: Genotype,0-10024-01-114,0-10037-01-257,0
        4854 Marker labels found
 Marker labels 0-10024-01-114 ... UMN-CL98Contig1-
 Notice: SNP data begins: 140099,2,2,1,2,2,2,2,2,2,1,2,1,2,1,1,
 Notice: Markers coded -9 treated as missing.
 Marker data [0/1/2] for 923 genotypes and 4854 markers read from snpData.grr
      160414 missing marker values (  3.6%) replaced by column average!
        Marker values ranged 0.00 to 2.00
        Marker Means ranged  1.00 to 2.00
          Sigma2p(1-p) is   1057.12515
 GIV1  snpData.grr      923       9     -947.91
 QUALIFIERS: !MAXIT 30 !SKIP 1 !DFF -1
 QUALIFIER: !DOPART    2 is active
 Reading nassau.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of HT6
 Summary of 6399 records retained of 6795 read

  Model term          Size #miss #zero   MinNon0    Mean      MaxNon0  StndDevn
   1 Nfam               71     0     0      1    36.3379         71
   2 Nfemale            26     0     0      1    12.8823         26
   3 Nmale              37     0     0      1    15.2285         37
  Warning: More levels found in Clone  than specified
   4 Clone             926     0     0      1   464.6765        926
  Warning: Fewer levels found in MatOrder  than specified
   5 MatOrder          914     0     0      1   432.5760        860
   6 rep                 8     0     0      1     4.4837          8
   7 iblk               80     0     0      1    40.1164         80
   8 tree                      0     0 1.0000      7.473      14.00      4.018
   9 row                       0     0 1.0000      28.52      56.00      16.09
  10 col                       0     0 1.0000      10.50      20.00      5.760
  Warning: Fewer levels found in prop  than specified
  11 prop                2     0     0      1     1.0000          1
  12 culture             2     0     0      1     1.4945          2
  13 treat               2     0     0      1     1.4945          2
  Warning: Fewer levels found in measure  than specified
  14 measure             2     0     0      1     1.0000          1
  15 SURV                      0     6 1.0000     0.9991     1.0000     0.3061E-01
  16 DBH6                      4     0 0.3000E-01  11.29      18.80      2.400
  17 HT6            Variate    0     0  76.20      838.6      1286.      163.6
  18 HT8                      83     0  91.44      1148.      1576.      170.6
  19 CWAC6                  3167     0  97.54      301.3      542.5      52.26
  20 mu                          1
  21 culture.rep                16 12 culture   :   2   6 rep            :    8
 Warning: GRM matrix is too SMALL
  22 grm1(Clone)       923
  23 rep.iblk                  640  6 rep       :   8   7 iblk           :   80
 Forming     2508 equations:  19 dense.
 Initial updates will be shrunk by factor    0.316
 Notice: LogL values are reported relative to a base of -30000.000
 Notice:     11 singularities detected in design matrix.
   1 LogL=-2845.13     S2=  8956.4       6390 df
   2 LogL=-2798.45     S2=  8568.1       6390 df
   3 LogL=-2758.19     S2=  8131.3       6390 df
   4 LogL=-2741.14     S2=  7766.2       6390 df
   5 LogL=-2740.55     S2=  7702.9       6390 df
   6 LogL=-2740.54     S2=  7700.1       6390 df

          - - - Results from analysis of HT6 - - -
 Akaike Information Criterion    65489.09 (assuming 4 parameters).
 Bayesian Information Criterion  65516.14

 Model_Term                      Gamma         Sigma   Sigma/SE   % C
 rep.iblk         IDV_V  640  0.307847       2370.47      13.00   0 P
 grm1(Clone)      GRM_V  923  0.275811       2123.79       5.82   0 P
 Clone            IDV_V  926  0.152452       1173.90       6.08   0 P
 Residual         SCA_V 6399  1.000000       7700.14      49.64   0 P

                                   Wald F statistics
     Source of Variation           NumDF              F-inc
  20 mu                                1           0.11E+06
  12 culture                           1            2616.00
  21 culture.rep                       6              30.44
  23 rep.iblk                            640 effects fitted
  22 grm1(Clone)                         923 effects fitted
   4 Clone                               926 effects fitted (   66 are zero)
          78  possible outliers: see .res file
 Finished: 16 Sep 2014 14:12:50.574   LogL Converged

Notes:
of 926 clones identified, 860 have data and 923 have genomic data.
The .res file contains additional details about the analysis including a listing of the larger marker effects. All marker effects are reported in the .mef file.
Particular columns of the .grr data can be included in the model using the grr( Factor,i) model term where and i specifies which (number) regressor variable to include.

 Listing of the larger marker effects
       368  0-12761-01-121     1.40736       0.00000
       617  0-14383-01-111     1.26081       0.00000
       777  0-15417-01-138    -1.25597       0.00000
      1246  0-18644-02-210     1.22522       0.00000
      1903  0-6963-01-202     -1.24800       0.00000
      2102  0-8683-02-432      1.15496       0.00000
      2445  2-1563-02-244     -1.35181       0.00000
      2497  2-2167-01-413     -1.21339       0.00000
      3180  2-8668-03-42      -1.21629       0.00000
      3521  CL1577Contig1-03  -1.15833       0.00000
      3802  CL2573Contig1-03   1.17005       0.00000
      4195  CL595Contig1-01-  -1.19330       0.00000
      4351  UMN-1397-01-416   -1.34916       0.00000

Return to index