Wilke:Using HyPhy: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 131: Line 131:
*codonFreqs is used rather than obsFreqs since we are concerned with codon, not nucleotide, frequencies under the current model.
*codonFreqs is used rather than obsFreqs since we are concerned with codon, not nucleotide, frequencies under the current model.


*The parameters that this script will print values for are k, t, and w - you now have some useful information about the selective regime on your protein sequence!
*The parameters that this script will determine values for are k, t, and w - you now have some useful information about the selective regime on your protein sequence!


== Good Resources ==
== Good Resources ==

Revision as of 09:56, 8 February 2012

Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.
THE WILKE LAB

Home        Contact        People        Research        Publications        Materials

The Basics

Each HyPhy analysis must include several essential components:

  • Data Set
    • This is a multiple sequence alignment file which may be in one of several formats, including fasta, phylip, or nexus.
  • Data Filter
    • This selects which parts of the data sets should be used in analysis. In the simplest case, the entire set will be processed as a single unit. In a more complex scenario, however, you may have a data set which includes both introns and exons, which you would want to analyze under different evolutionary models. This may be specified using a data filter, which thus "partitions" your data set.
  • Evolutionary Model
    • You will need to provide HyPhy with a rate matrix describing your substitution model of choice in order to process the data.
  • Phylogeny
    • This should be in newick format.
  • Likelihood Function

HyPhy Batch File

Here is a basic HyPhy script.

DataSet myData = ReadDataFile ("aln.fasta");
DataSetFilter myFilter = CreateFilter (myData,1,"", "", "" );
F81RateMatrix =
           {{* ,mu,mu,mu}
                 {mu,* ,mu,mu}
                 {mu,mu,* ,mu}
                 {mu,mu,mu,* }};
HarvestFrequencies (obsFreqs, myFilter, 1, 1, 1);
Tree myTree = ((a,b),c,d);
Model F81 = (F81RateMatrix, obsFreqs);
UseModel(F81);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize (MLEs, theLikFun);
fprintf  (stdout, theLikFun);

Now, let's go line by line through the script above.

DataSet myData = ReadDataFile ("aln.fasta");

  • Stores your multiple alignment file in the variable myData. Note that the path to the file must be specified if it is not found in the working directory.
  • A phylogeny may be (optionally) included at the bottom of the data file to be included in later analysis. More on this later...

DataSetFilter myFilter = CreateFilter (myData,1,"", "", "" );

  • Stores a data filter in the variable myFilter. From this point on, you should refer to your data as its filter rather than its unprocessed form inputted previously. The function "Create Filter" takes five arguments, the last three of which are optional: CreateFilter (DataSetId, Unit, Vertical Partition, Horizontal Partition, Exclusions);
  • DataSetId is the variable name for the previously imported data set, in this case called myData.
  • Unit defines how many characters should be treated as a single object. For codon data, this value would be 3 since every three characters are analyzed together. For nucleotide data, this value is 1.
  • Vertical Partition specifies which sites should be analyzed. In this case, the entire data set is analyzed together so no partition is specified
  • Horizontal Partition ......
  • Alphabet Exclusions is a comma-separated list of characters to be ignored during analysis.

F81RateMatrix = etc.

  • Here, a nucleotide rate matrix is defined with the parameter “mu.” The value for mu will be returned by the likelihood function later. More models are included in the HyPhy package in the directory “trunk/SubstitutionModels.”

HarvestFrequencies (obsFreqs, myFilter, 1, 1, 1);

  • This function collects the dataset's nucleotide frequencies into a vector, which is necessary to provide to the model. It takes five arguments:HarvestFrequencies (Receptacle, FilterId, Atom, Unit, Position dependent flag);
  • Receptacle is the name of the variable which will store the outputted vector.
  • FilterId refers to the data set filter to be analyzed.
  • Atom is the unit of data. For codon data, this is 3, and for nucleotide data it is 1.
  • Unit ......
  • Position dependent flag .......

Tree myTree = ((a,b),c,d);

  • Here, a tree for the data is directly written as a string. One may alternatively include the phylogeny at the bottom of the multiple alignment file.
  • If this is done, the command called would instead be Tree myTree = DATAFILE_TREE;
    • DATAFILE_TREE refers to the tree found in the most recently inputted data file.

Model F81 = (F81RateMatrix, obsFreqs);

  • Defines the model to be used. The command Model will store the evolutionary model in F81. Its arguments are the rate matrix and nucleotide frequency vector (or, if using a codon model, input codonFreq instead!)

UseModel(F81)

  • Explicitly calls for F81 model to be used in analysis.


Finally, you can define and maximize the likelihood function and then print its output. LikelihoodFunction theLikFun = (myFilter, myTree); Optimize (MLEs, theLikFun); fprintf (stdout, theLikFun);

  • MLEs refers to the parameter values of the evolutionary model.

Using Codon Models

The previous example provides an example of running an analysis using a nucleotide model of substitution. When analyzing protein coding data, however, it is often more useful and informative to use codon models of substitution. Such models also use nucleotide data, but consider them in terms of how the amino acid data they provide factors in to their composition. Matrices for codon substitution models, therefore, describe how each tri-nucleotide codon might evolve into a different codon as a unit rather than simply considering simple nucleotide substitutions.

Here is an example script which calculates the value of "omega," or dN/dS, which provides information about the direction and strength of selection on a coding sequence. It uses the Goldman-Yang codon model GY94, which has four parameters: omega, kappa (ratio of the rate of transversions to the rate of transitions), time, and "pi" (equilibrium codon frequencies). The original paper describing this model can be found here. Values for the first three parameters (which will be represented as w, k, and t, respectively) are deduced by optimizing the likelihood function, where as the codon frequencies are directly estimated from the provided data.


#include "ratematrixfile.txt"
#include "functions.txt"

global k;
global t;
global w;

DataSet myData = ReadDataFile ("aln.fasta");
DataSetFilter myFilter = CreateFilter (myData,3,"", "", " TAA,TAG,TGA" );
HarvestFrequencies (obsFreqs, myFilter, 3, 1, 1);
codonFreqs=BuildCodonFrequencies(obsFreqs);
Tree myTree = ((a,b),c,d);
Model GY94 = (GY94RateMatrix,codonFreqs);
UseModel(GY94);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize (MLEs, theLikFun);
fprintf  (stdout, theLikFun);

Now, let's go through the aspects of this script which differ from the one discussed in the previous section.

#include "ratematrixfile.txt" and #include "functions.txt"

  • The function #include tells HyPhy to read in certain files and be considered as part of this script. Note that the paths to these files must be specified if they are not found in the working directory.
  • First, a file which contains your rate matrix. In the example presented above, the rate matrix was written into the script since it is very small (only 4x4). A codon substitution model's matrix, however, is much bigger - the GY94 matrix is 61x61 as there are 64 codons in the universal genetic code, but the three stop codons are excluded from the matrix.
    • A text file containing the GY94 rate matrix can be downloaded here.
  • Secondly, you will want to include functions required to analyze codon data in HyPhy. These include a matrix that defines the universal genetic code, and the function "BuildCodonFrequencies." This function uses the genetic code matrix to find the codon equilibrium frequencies from your data's nucleotide frequencies.
    • A text file containing the matrix and that function can be downloaded here.

global k; global t; global w;

  • These lines are optional and depend on how you want your data to be analyzed. These variable are found in the provided rate matrix as the parameters kappa, time, and omega, respectively. If you specify that they are "global," then the likelihood analysis will fit one kappa value, one time variable, and one omega value to your entire data set. If, however, you prefer that lineage/branch-specific k,t, and w values are calculated, these lines may be excluded from your script.

DataSetFilter myFilter = CreateFilter (myData,3,"", "", " TAA,TAG,TGA" );

  • Here, the second parameter specified should be "3" since each evolving unit should have 3 nucleotides in it (3 nucs/codon).
  • The last parameter is "TAA, TGA, TAG" since those are the stop codons. As stop codons are not included in the GY94 matrix, we want to make sure there are none in the data filter.

HarvestFrequencies (obsFreqs, myFilter, 3, 1, 1);

  • Again, the third parameter reflects that the unit of the data is 3.

codonFreqs=BuildCodonFrequencies(obsFreqs);

  • This command uses the vector of nucleotide frequencies determined with the "HarvestFrequencies" function to determine the equilibrium codon frequencies.

Model GY94 = (GY94RateMatrix,codonFreqs);

  • The name "Gy94RateMatrix" should correspond to however you have named the matrix in the file you included that contains the matrix.
  • codonFreqs is used rather than obsFreqs since we are concerned with codon, not nucleotide, frequencies under the current model.
  • The parameters that this script will determine values for are k, t, and w - you now have some useful information about the selective regime on your protein sequence!

Good Resources

The HyPhy website may be found here: HyPhy They have a very extensive user forum which we highly recommending looking at.

An excellent overview of running more advanced (tons of great examples!) HyPhy scripts may be found in chapter six of the book Statistical Methods in Molecular Evolution edited by Rasmus Nielsen, which may be accessed via GoogleBooks here: Book