Factor Definition

Introduction

The data fields are defined immediately after the job title. They tell ASReml how many fields to expect in the data file and what they are. No more than 10,000 variables may be read or formed.

Data field definitions
  • should be given for all fields in the data file; data fields on the end of a data line that do not have a corresponding field definition will be ignored,
  • must be presented in the order in which they appear in the data file,
  • may be indented one or more spaces,
  • may appear with other definitions on the same line,
  • data fields can be transformed as they are defined (see below),
  • additional data fields can be created by transformation; these should be listed after the data fields read from the data file.

    Syntax

    Usually there will be a field definition for every data field. For example, field definitions typical of a simple randomised block might be
     Randomised Block Experiment       #  Title Line
      Blocks *                         #  coded 1...
      Treatments !A                    #  alphabetic names
      yield                            #  response variable
     rcb.dat                           #  data file
     yield ~ mu Treatments !r Blocks   #  model line
    

    field definitions appear in the ASReml command file in the form
  • a LABEL for the data field [ FieldType ] [ Size ] [ transformations]

    The LABEL
  • is an alphanumeric string to identify the field,
  • has a maximum of 48 characters of which only 20 are printed; the remaining characters are not displayed,
  • must begin with a letter,
  • must not contain the special characters ., *, :, /, !, #, | or ( ,
  • names of predefined model terms and variance structures must not be used,

    FieldType and Size control how a variable is interpreted as it is read; whether it is regarded as a factor or variate if specified in the linear model.

    For a simple variate, the Size attribute is 1; leave FieldType and Size blank or specify 1 after the LABEL.

    For a model factor, various qualifiers ( FieldType) are required depending how the factor levels ( Size > 1) relate to the values in the data field.

    * or n [ !L c ]
    indicates the data field has n classes labelled 1... n directly coding for the factor. Names may be supplied for the classes by specifying !L and a list of names. The list of names may run over several lines if incomplete lines end with COMMA (,). These names will be printed in the .sln and .pvs output files. If * is specified, the largest value in the field is taken as the number of classes.
    For example
     Sex 2 !L Make Female
     Row *
    
    !A [n] [ L c|f [!LSKIP r]
    indicates the data field is alphanumeric; n must be specified if more than 2000 level names are present, for example
     Location !A
    

    Class names are assigned in the order they are discovered. The !L qualifier allows the user to provide a list of class names (thereby explicitly setting the order). The names may be listed in the first field of a file. Use !LSKIP to skip any heading lines in the file. Enclose the filename in quotes if it does not contain a file extension. Examples
     Sex !A !L Male Female # Ensure Male is coded 1.
     Genotype !A !L MyNames.txt
     Genotype !A !L 'My Names.txt'   !LSKIP 1
     Genotype !A !L 'MyNames'
    
    Note that class names not present in the list but found in the data are appended to the list.

    !I [n]
    indicates the data is numeric but not 1... n ; n must be specified if more than 1000 codes are present, for example Year !I,

    !AS p
    indicates the data field is similar to a previous !A variable p and is to be coded identically, for example in a plant diallel experiment
    Male !A 22 Female !AS Male # integrated coding,

    !P
    indicates the special case of a pedigree factor; ASReml will determine the classes from the pedigree file

    In all these, a warning is printed if the nominated value for n does not agree with the actual number of levels found in the data and if the nominated value is too small the correct value is used.

    !G m [n]
    is used when m contiguous data fields are to be treated as a set or group of variates (n omitted or 1) or factor variables (n> 1). For example
     :
      X1 X2 X3 X4 X5 y
     data.dat
     y ~ mu X1 X2 X3 X4 X5
    
    can be expressed as
     :
      X !G 5 y
     data.dat
     y ~ mu X
    
    so that the 5 variates can be referred to in the model as X by using X !G 5 as the factor definition.

    Date and Time fields

  • !DATE specifies the field has one of the date formats dd/mm/yy, dd/mm/ccyy, dd-Mon-yy or dd-Mon-ccyy and is to be converted into a Julian day dd is a 1 or 2 digit day of the month, mm is a 1 or 2 digit month of the year, Mon is a three letter month name ( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ), yy is the year within the century (00 to 99), cc is the century (19 or 20). The separators '/' and '-' must be present as indicated. The dates are converted to days since 1899. When the century is not specified, yy of 0-32 is taken as 2000-2032, 33-99 taken as 1933-1999.
  • !DMY specifies the field has one of the date formats dd/mm/yy or dd/mm/ccyy and is to be converted into a Julian day.
  • !MDY specifies the field has one of the date formats mm/dd/yy or mm/dd/ccyy and is to be converted into a Julian day.
  • !TIME specifies the field has one of the format hh:mm:ss and is to be converted into seconds past midnight where hh is hours (0 to 23), mm is minutes (0-59) and ss is seconds (0 to 59). The separator ':' must be present as indicated.

    Storage of alphabetic factor labels

    Space is allocated dynamically for the storage of alphabetic factor labels with a default allocation being 2000 labels of 16 characters long. If there are large !A factors (so that the total across all factors will exceed 2000), you must specify the anticipated size (within say 5%).
  • If some labels are longer than 16 characters and the extra characters are significant, you must lengthen the space for each label by specifying !LL c e.g.

    cross !A 2300 !LL 48
    indicates the factor cross will have about 2300 levels and needs 48 characters to hold the class names. Note that only the first 20 characters of the names are ever printed. The names from all factors are held in a single list and the name length need only be set once, at the highest value required.
  • !PRUNE on a field definition line means that if fewer levels are actually present in the factor than were declared, will reduce the factor size to the actual number of levels. Use !PRUNALL for this action to be taken on the current and subsequent factors up to (but not including) a factor with the !PRUNEOFF qualifier.
  • The user may overestimate the size for large ALPHA and INTEGER coded factors so that ASReml reserves enough space for the list. Using !PRUNE will mean the extra (undefined) levels will not appear in the .sln file. Since it is sometimes necessary that factors not be pruned in this way, for example in pedigree/GIV factors, pruning is only done if requested.

    Reordering the factor levels

    !SORT declared after !A or !I on a field definition line will cause ASReml to sort the levels so that labels occur in alphabetic/numeric order for the analysis. As ASReml reads the data file, it encodes !I and !A factor levels in the order they appear in the data so that for example, the user cannot tell whether SEX will be coded 1=Male, 2=Female or 1=Female, 2=Male without looking at the data file to see whether Male or Female appears first in the SEX field. If !SORT is specified, ASReml creates a lookup table after reading the data to select levels in sorted order and uses this sorted order when forming the design matrices. Consequentially, with the !SORT qualifier, the order of fitted effects will be 1=Female, 2=Male in the analysis regardless of which appears first in the file. This can lead to some confusion because some other operations will be applied to the unsorted order. In particular any transformations are performed as the data is read in and before the sorting occurs.

    !SORTALL means that the levels for the current and subsequent factors are to be sorted.

    Skipping input fields

    !CSKIP f will skip f data fields. It is particularly useful in large files with alphabetic fields which are not needed as it saves ASReml the time required to classify the alphabetic labels. !CSKIP is intended to replace !SKIP. For example
     !CSKIP 1 Sire !I
    
    would skip the field before the field which is read as 'Sire'.

    !SKIP f will skip f data fields BEFORE reading the next field field to keep. For example
     Sire !I !SKIP 1
    
    would skip the field before the field which is read as 'Sire'.

    Return to index