Menu
Overview
Installation
Xml Input File
Data View
Include/Exclude Chars
Genetic Codes
Saving Data
Export Data
Tree View
Assign Branch Lengths
Rooting Trees
Print Trees
Export Trees
Output Files
Substitution Models
Morphology Priors
Analysis (Mapping)
Mapping Statistics
Analysis (Correlation)
Correlation Statistics
Analysis (Selection)
Selection Statistics
Analysis (ASR)

Xml Input File

SIMMAP 1.5 uses an Xml-like formated input file. This is by far the largest change from version 1. The input to the program -- data, trees, and model parameters (if molecular data) -- are collected into a single file rather than input as multiple files. To convert your file(s) to the needed Xml format see the Read Me file in the Nex2Xml Folder in the SIMMAP 1.5 distribution, or download the most recent distribution here.

Some common problems experienced with the Nex2Xml program:

  1. The program does not produce a valid ouput file because the Nexus input file is formatted with a taxa and characters block (common when exporting from MacClade, Mesquite, or PAUP*). Simply save the file with a single data block. Nex2Xml and all versions of SIMMAP do not support taxa and character blocks.
  2. Trees exported by the program are not accepted by SIMMAP 1.5. See section 2 below.

The following is description of the Xml format used by SIMMAP 1.5. See one of the sample files for additional examples.

A few things to keep in my mind if you are going to generate these files on your own:

  1. The format comforms to the basic Xml/Html behavior using opening tags and closing tags. Some opening tags may have attributes associated with them. Others won't. SIMMAP 1.5 currently does not allow a forward slash at the end of an opening tag to indicate that no string value will be found and a closing tag is absent. I realize this is probably a minor thing to fix, but it is also minor thing to add the closing tag.
  2. The file can contain only one root element (see below) and it must be the <simmap></simmap> element.
  3. Two child elements of the root element are absolutely required: <data></data> and <trees></trees>.

Basic file anatomy:

The basic root and child element structure of the file is as follows (i.e., the basic, minimum, file should look like this):

<?xml version="1.0"?>
<simmap>

     <data ntaxa="4" nchars="4" datatype="dna">
     </data>
     <trees>
     </trees>
</simmap>

Of course, the indentation is optional and simply makes viewing easier.

Within the <data>...</data> element will be a set of sequence tags containing the sequence names and characters (see below).

Within the <trees>...</trees> element will be a set of translate and tree tags containing the sequence names and the trees (see below).

Finally, one optional child element of the root is <parameters>...</parameters> (not shown above). This element only applies to molecular data and can be absent regardless of the data type (see below for more details).

All of the elements and their child elements are discussed below.

1. <data> Element:

The data is contained within the data tags. The opening tag requires a number of attributes that define the data matrix contents. The required attributes are: ntaxa, nchars, and datatype.

ntaxa defines the number of taxa and should take the following form, ntaxa="4", with 4 being replaced by actual number of taxa or sequences.

nchars defines the number of characters and should take the following form, nchars="4", with 4 being replaced by actual number of characters or sites.

datatype defines the type of data and should take the following form, datatype="dna". The following are valid values: dna, rna, nucleotide, and standard.

A valid opening tag should look like this:

<data ntaxa="4" nchars="4" datatype="dna">

Each sequence of characters should be enclosed within a <seq name="taxon_name"></seq> tag. The opening tag, as shown, has the required name attribute which defines the taxon or sequence label. Each <seq name="taxon_name"> must have a unique name attribute.

The following is an example of a data block:

<data ntaxa="4" nchars="4" datatype="dna">
     <seq name="mickey">AACT</seq>
     <seq name="minnie">ACCT</seq>
     <seq name="goofey">ATTT</seq>
     <seq name="donald">CGGA</seq>
</data>

Constraints or lack of constraints in the the <data> element:

  1. Nucleotide data coding (values between the <seq name="xxxx">...</seq> tag): All IUPAC symbols are allowed. However, the use of "." to represent "the same as previous" is not supported. Characters can be both lowercase and uppercase. White spaces, including simple white spaces, tabs, and carriage [line] returns, can be included. However, the attribute name="xxxx" can not include white spaces.
  2. Morphological/Standard data coding (values between the <seq name="xxxx">...</seq> tag): All characters should have states from 0 to 6. A two character state must be coded as 0 and 1. A three character state must be coded as 0, 1, and 2. Etc.

2. <trees> Element:

The trees are contained within the <trees> element. Within the opening and closing tags two elements are used to define the translation (taxon name integer representation in the tree) and a tree.

The translation tag defines the mapping taxon names to the trees. The same number of taxa must exist as in the data block. The <translate id="1">mickey</translate> has a single attribute, id, which defines the integer that will appear in each Newick tree. Within the translate tags the taxon name should occur. The spelling should match the names in a <seq> definition.

The tree tag is used to define each tree and has no attributes. This is not to be confused with the <trees> tag. For example:

<tree>((1:0.1,2:0.1):0.1,(3:0.1,4:0.1):0.1)</tree>

The following is an example of a trees block:

<trees>
     <translate id="1">mickey</translate>
     <translate id="2">minnie</translate>
     <translate id="3">goofey</translate>
     <translate id="4">donald</translate>
     <tree>((1:0.1,2:0.1):0.1,(3:0.1,4:0.1):0.1)</tree>
     <tree>((1:0.1,3:0.1):0.1,(2:0.1,4:0.1):0.1)</tree>
</trees>

Solutions for what to do if you try to load the file and nothing happens or the program crashes:

  1. Remove comments from the trees before running the program. For example: ((1:[&color green]0.1,2:[&color green]0.1):0.1,(3:[&color red]0.1,4:0.1):0.1) should be ((1:0.1,2:0.1):0.1,(3:0.1,4:0.1):0.1).
  2. Branch lengths in your trees are derived from parsimony so some are zero and others not. By default when SIMMAP 1.5 finds a zero length branch (i.e., the length is absent not zero) will then apply a value of 0.1 to ALL of the branches (even those that have positive non-zero lengths) and ALL trees. There are three possible solutions. First, estimate branch lengths using molecular data. Second, collapse all zero length branches to hard polytomies. Lastly, load the trees and then apply branch lengths using the Assign Branch Length option (see here for more information).
  3. Remove clade support values from the trees. For example: ((1:0.1,2:0.1)'1.0':0.1,(3:0.1,4:0.1)'1.0':0.1) should be ((1:0.1,2:0.1):0.1,(3:0.1,4:0.1):0.1).

3. <parameters> Element:

The evolutionary model parameters for molecular data are contained within a parameters tag. This is optional for molecular data and is not required for morphological/standard data. Each set of model parameters are contained as attributes in the opening of the <model></model> tags. The following atrributes are available (required ones are indicated in braces):

  1. nst: number of substitution types (valid values: 1,2, and 6) [REQUIRED]
  2. pia: frequency of nucleotide A [REQUIRED IF nst > 1]
  3. pic: frequency of nucleotide C [REQUIRED IF nst > 1]
  4. pig: frequency of nucleotide G [REQUIRED IF nst > 1]
  5. pit: frequency of nucleotide T [REQUIRED IF nst > 1]
  6. kappa: transition/transversion rate ratio [REQUIRED IF nst = 2]
  7. rac: instantaneous rate from A<->C [REQUIRED IF nst = 6]
  8. rag: instantaneous rate from A<->G [REQUIRED IF nst = 6]
  9. rat: instantaneous rate from A<->T [REQUIRED IF nst = 6]
  10. rcg: instantaneous rate from C<->G [REQUIRED IF nst = 6]
  11. rct: instantaneous rate from C<->T [REQUIRED IF nst = 6]
  12. rgt: instantaneous rate from G<->T [REQUIRED IF nst = 6]
  13. alpha: shape paremeter of discrete gamma distribution [nratecats REQUIRED]
  14. nratecats: number of discrete gamma categories [alpha REQUIRED]
  15. pinv: proportion of invariant sites (value must be between 0 and 1)
  16. rate: overall rate multiplier

The following is an example of a parmeters block:

<parameters>
     <model nst="2" pia="0.25" pic="0.25" pig="0.25" pit="0.25"
     kappa="10.0" alpha="0.5" nratecats="4"></model>

     <model nst="2" pia="0.25" pic="0.25" pig="0.25" pit="0.25"
     kappa="5.0" alpha="0.5" nratecats="4"></model>

</parameters>

Each of the model parameters are matched to a tree in the order they appear in the file. This only applys when trees and parameters are linked.