Wilke:Using HyPhy: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 2: Line 2:


== The Basics==
== The Basics==
Here is a basic HyPhy script which will perform a maximum likelihood analysis given a multiple alignment file and a phylogeny. Below, the script is discussed in more depth.
Each HyPhy analysis must include several essential components:
<pre>
DataSet myData = ReadDataFile ("aln.fasta");
DataSetFilter myFilter = CreateFilter (myData,1);
F81RateMatrix =
                {{* ,mu,mu,mu}
                {mu,* ,mu,mu}
                {mu,mu,* ,mu}
                {mu,mu,mu,* }};
HarvestFrequencies (obsFreqs, myFilter, 1, 1, 1);
Tree myTree = ((a,b),c,d);
Model F81 = (F81RateMatrix, obsFreqs);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize (MLEs, theLikFun);
fprintf  (stdout, theLikFun);
</pre>
 
This script illustrates the essential components that each HyPhy analysis requires:


* Data Set
* Data Set

Revision as of 15:34, 9 January 2012

Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.
THE WILKE LAB

Home        Contact        People        Research        Publications        Materials

The Basics

Each HyPhy analysis must include several essential components:

  • Data Set
    • This is a multiple sequence alignment file which may be in one of several formats, including fasta, phylip, or nexus.
  • Data Filter
    • This selects which parts of the data sets should be used in analysis. In the simplest case, the entire set will be processed as a single unit. In a more complex scenario, however, you may have a data set which includes both introns and exons, which you would want to analyze under different evolutionary models. This may be specified using a data filter, which thus "partitions" your data set.
  • Evolutionary Model
    • You will need to provide HyPhy with a rate matrix describing your substitution model of choice in order to process the data.
  • Phylogeny
    • This should be in newick format.
  • Likelihood Function

Functions

The following is a HYPHY translation of a Python function written by Tong Zhou to calculate the proportions of synonymous and nonsynonymous mutations, or rho. Takes as parameters a 61 x 1 frequency vector, a 61 x 61 rate matrix, and the number of species in the alignment. Returns the proportion of nonsynonymous mutations, the proportion of synonymous mutations, and the sum of all the mutations divided by the number of species, which is equal to the t parameter. Function must be placed after optimization in the script. To calculate dN or dS, the function must be called twice: once where the selection on the protein has been estimated and once where selection on the protein has been fixed.

function calcRhoRate(frequency_vector, rate_matrix, species)
{
	Codon_Index = {
					{14,13,14,13, 7, 7, 7, 7,19, 5,19, 5, 2, 2, 3, 2,
					 12,11,12,11, 6, 6, 6, 6,19,19,19,19, 1, 1, 1, 1,
					 16,15,16,15, 8, 8, 8, 8,20,20,20,20, 4, 4, 4, 4,
					     9,    9, 5, 5, 5, 5,   17,18,17, 1, 0, 1, 0}
				};
					
	Codon_Index_ENG = {

		{"AAA","AAC","AAG","AAT","ACA","ACC","ACG","ACT","AGA","AGC","AGG","AGT","ATA","ATC","ATG","ATT",
		 "CAA","CAC","CAG","CAT","CCA","CCC","CCG","CCT","CGA","CGC","CGG","CGT","CTA","CTC","CTG","CTT",
		 "GAA","GAC","GAG","GAT","GCA","GCC","GCG","GCT","GGA","GGC","GGG","GGT","GTA","GTC","GTG","GTT",
		       "TAC",      "TAT","TCA","TCC","TCG","TCT",      "TGC","TGG","TGT","TTA","TTC","TTG","TTT"}

				   };				
					
					
	n = 61;
	sum_diag = 0.0;
	sum_all = 0.0;
	sum_syn = 0.0;

	for (i=0; i<n; i=i+1)
	{
		for (j=0; j<n; j=j+1)
		{
			if (i == j)
			{
				sum_diag = sum_diag + (frequency_vector[i] * rate_matrix[i][j] * frequency_vector[j]);
				continue;
			}
			else
			{
					codon_i = Codon_Index[i];
					codon_j = Codon_Index[j];
					sum_all =  sum_all + (frequency_vector[i] * rate_matrix[i][j] * frequency_vector[j]);
					if (codon_i == codon_j)
					{	
						sum_syn = sum_syn + (frequency_vector[i] * rate_matrix[i][j] * frequency_vector[j]);
					}
			}
		
		}
	}

	if (sum_all != 0)
	{
		rho_ns = (sum_all - sum_syn) / sum_all;
		rho_syn = sum_syn / sum_all;
	}
	else
	{
		rho_ns = 0;
		rho_syn = 0;
	}

        t = sum_all/species;

	return {{ rho_ns, rho_syn, t }};
	
}

Scripts