Dahlquist:Modified ANOVA and p value Corrections for Microarray Data: Difference between revisions

Revision as of 16:36, 18 October 2011

Home Research Protocols Notebook People Publications Courses Contact

To analyze the significant changes in our gene expression, p-values were calculated using an F-distribution.
The first step to calculating this F-distribution is by calculating the sum of the squares of the null hypothesis, or SSH.
The null hypothesis being that no genes experience any significant change in expression, therefore the population mean(μ) is 0.
To calculate the SSH each genes log fold change was squared and summed over every flask i; i=1, 2, 3, 4, 5; and every time point j; j=t15, t30, t60, t90, and t120.

SSH=Σ_iΣ_j(Y_ij)²

The second step is to calculate the sum of squares of the alternate hypothesis, or SSF, the difference from the SSH being that the hypothesis states there is at least one significantly changed gene.
This is represented by subtracting the population mean from the log fold change before squaring it.
The population mean can be calculated by averaging the log fold change for each time point, these values were subtracted from each log fold change for each gene at their respective time point.
The values were then squared, and summed over every flask i, and time point j.

SSF=Σ_iΣ_j(Y_ij-μ_j)²

Finally, the F-distribution is calculated by subtracting the SSF from the SSH and dividing by the SSH, then by multiplying this value by the number of flasks(F) subtracted from the number of trials(N) divided by the number of flasks.
This will give you the F-distribution with degrees of freedom F, N-F.

[(SSH-SSF)/SSF * (N-F)/F] ~ F(F,N-F)

This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F.
Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains.

False positives must be corrected from the data's p values before it can be considered accurate.
One way of correcting these p values is the Bonferroni Correction.
Accomplished by multiplying each p value by the total number of hypotheses(n).

P≤α/n

One negative aspect of this correction is that the final result will consist of only the most extreme outliers, thus ignoring some potential significant genes.

A more robust method of correcting the data's p values is the Benjamini & Hochberg correction, or B&H.
Once the p values are calculated they are sorted from least to greatest and an index(i) from 1 to n is created to rank these values, 1 being the lowest p value and n being the highest.
The p values are then multiplied by the total number of hypotheses(n) and divided by their rank(i).

P≤i*α/n

@@ Line 22: / Line 22: @@
 *This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F.
 *Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains.
-*While the Benjamini Hochberg correction is our main focus due to its robustness when compared to the Bonferroni, both the Benjamini and Hochberg and Bonferroni were calculated to compare the results of signifcant change in gene expression.
-*The Bonferroni was calculated by multiplying all of the p-values by the total number of hypotheses, in this case the 6189 genes being expressed.
-*The Benjamini and Hochberg was calculated by first sorting all of the p-values from least to greatest, and then multiplying by the total number of hypotheses and divided by its position it was sorted in.
 ==Bonferroni==