Wilke:Using HyPhy

From OpenWetWare
Jump to navigationJump to search
Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.
THE WILKE LAB

Home        Contact        People        Research        Publications        Materials

The Basics

Each HyPhy analysis must include several essential components:

  • Data Set
    • This is a multiple sequence alignment file which may be in one of several formats, including fasta, phylip, or nexus.
  • Data Filter
    • This selects which parts of the data sets should be used in analysis. In the simplest case, the entire set will be processed as a single unit. In a more complex scenario, however, you may have a data set which includes both introns and exons, which you would want to analyze under different evolutionary models. This may be specified using a data filter, which thus "partitions" your data set.
  • Evolutionary Model
    • You will need to provide HyPhy with a rate matrix describing your substitution model of choice in order to process the data.
  • Phylogeny
    • This should be in newick format.
  • Likelihood Function

HyPhy Batch File

Here is a basic HyPhy script.

DataSet myData = ReadDataFile ("aln.fasta");
DataSetFilter myFilter = CreateFilter (myData,1,"", "", "" );'''
F81RateMatrix = etc.
           {{* ,mu,mu,mu}
                 {mu,* ,mu,mu}
                 {mu,mu,* ,mu}
                 {mu,mu,mu,* }};
HarvestFrequencies (obsFreqs, myFilter, 1, 1, 1);
Tree myTree = ((a,b),c,d);
Model F81 = (F81RateMatrix, obsFreqs);
UseModel(F81)
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize (MLEs, theLikFun);
fprintf  (stdout, theLikFun);

Now, let's go line by line through the script above.

DataSet myData = ReadDataFile ("aln.fasta");

  • Stores your multiple alignment file in the variable myData. Note that the path to the file must be specified if it is not found in the working directory.
  • A phylogeny may be (optionally) included at the bottom of the data file to be included in later analysis. More on this later...

DataSetFilter myFilter = CreateFilter (myData,1,"", "", "" );

  • Stores a data filter in the variable myFilter. From this point on, you should refer to your data as its filter rather than its unprocessed form inputted previously. The function "Create Filter" takes five arguments, the last three of which are optional: CreateFilter (DataSetId, Unit, Vertical Partition, Horizontal Partition, Exclusions);
  • DataSetId is the variable name for the previously imported data set, in this case called myData.
  • Unit defines how many characters should be treated as a single object. For codon data, this value would be 3 since every three characters are analyzed together. For nucleotide data, this value is 1.
  • Vertical Partition specifies which sites should be analyzed. In this case, the entire data set is analyzed together so no partition is specified
  • Horizontal Partition ......
  • Alphabet Exclusions is a comma-separated list of characters to be ignored during analysis. An example may be stop codons, which would be written “TAA, TGA, TAG”.

F81RateMatrix = etc.

  • Here, a nucleotide rate matrix is defined with the parameter “mu.” The value for mu will be returned by the likelihood function later. More models are included in the HyPhy package in the directory “trunk/SubstitutionModels.”

HarvestFrequencies (obsFreqs, myFilter, 1, 1, 1);

  • This function collects the dataset's nucleotide frequencies into a vector, which is necessary to provide to the model. It takes five arguments:HarvestFrequencies (Receptacle, FilterId, Atom, Unit, Position dependent flag);
  • Receptacle is the name of the variable which will store the outputted vector.
  • FilterId refers to the data set filter to be analyzed.
  • Atom is the unit of data. For codon data, this is 3, and for nucleotide data it is 1.
  • Unit ......
  • Position dependent flag .......
  • IMPORTANT NOTE! In this example, a nucleotide substitution model is being used. If you are using a codon model, you will need to include the additional command BuildCodonFrequencies, which uses the vector of nucleotide frequencies to determine the codon frequencies.
    • It may be used as follows: codonFreq = BuildCodonFrequencies(obsFreqs)

Tree myTree = ((a,b),c,d);

  • Here, a tree for the data is directly written as a string. One may alternatively include the phylogeny at the bottom of the multiple alignment file.
  • If this is done, the command called would instead be Tree myTree = DATAFILE_TREE;
    • DATAFILE_TREE refers to the tree found in the most recently inputted data file.

Model F81 = (F81RateMatrix, obsFreqs);

  • Defines the model to be used. The command Model will store the evolutionary model in F81. Its arguments are the rate matrix and nucleotide frequency vector (or, if using a codon model, input codonFreq instead!)

UseModel(F81)

  • Explicitly calls for F81 model to be used in analysis.


Finally, you can define and maximize the likelihood function and then print its output. LikelihoodFunction theLikFun = (myFilter, myTree); Optimize (MLEs, theLikFun); fprintf (stdout, theLikFun);

  • MLEs refers to the parameter values of the evolutionary model.

Good Resources

The HyPhy website may be found here: HyPhy

An excellent overview, with many examples, of running more advanced HyPhy scripts may be found in chapter six of the book Statistical Methods in Molecular Evolution by Rasmus Nielsen, which may be accessed via GoogleBooks here: Book