Dahlquist:Modified ANOVA and p value Corrections for Microarray Data: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Statistical Analysis Within Excel: Made modification to how to sort.)
(Renamed and edited the sections regarding the ANOVA.)
Line 2: Line 2:
==Introduction==
==Introduction==


===Modified ANOVA===
The protocol below describes the statistical tests that have been performed to determine the significance of the gene expression changes seen in the microarray data. The Bonferonni and Benjamini & Hochberg corrections were used to correct for false positives after performing the statistical tests.
*To analyze the significant changes in our gene expression, p-values were calculated using an F-distribution.
*The first step to calculating this F-distribution is by calculating the sum of the squares of the null hypothesis, or SSH.  
*The null hypothesis being that no genes experience any significant change in expression, therefore the population mean (μ) is 0.
*To calculate the SSH each genes log fold change was squared and summed over every flask i; i=1, 2, 3, 4, 5; and every time point j; j=t15, t30, t60, t90, and t120.  


==Analyzing Significant Gene Expression Changes Within Each Strain==
For each strain, an ANOVA statistic was calculated to determine what genes have a significant change in expression during at least one time point.
*To analyze significant changes in gene expression, we need to compute an F statistic for each gene, which is the ratio of the MST (the mean square among the time points) to the MSE (the mean square within each time point).
*The MSE is the sum of squares of deviations from the mean (henceforth, sum of squares) within each time point (SSF) divided by the corresponding degrees of freedom (i.e. the number of time points T).
*The SSF is calculated by finding the sum of the squared difference between a gene’s log fold change for flask i and time point j and the mean log fold change for the time point.
SSF=Σ<sub>i</sub>Σ<sub>j</sub>(Y<sub>ij</sub>-μ<sub>j</sub>)<sup>2</sup>
*The MST is the sum of squares among the time points (SSG) divided by the corresponding degrees of freedom (the difference between the total number of time points and replicates N and the number of time points T). The SSG is the sum of squares within each group (SSF) subtracted from the sum of squares for all of the data for a gene (SSH).
*The SSH is calculated by squaring the log fold change and then summing over all of the flasks i and all of the time points j.
  SSH=Σ<sub>i</sub>Σ<sub>j</sub>(Y<sub>ij</sub>)<sup>2</sup>  
  SSH=Σ<sub>i</sub>Σ<sub>j</sub>(Y<sub>ij</sub>)<sup>2</sup>  
*So, the F statistic with degrees of freedom N-T and T is
  [(SSH-SSF)/SSF] * [(N-F)/F]
===Calculate the F Statistic in Excel===
*Create an Excel workbook containing a separate spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
*For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective spreadsheets.
*In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space by doing the following.
#Select all of the cells.
#Click the “Find & Select” icon in the Excel toolbar.
#Click on “Replace…”
#In the window, type in “NA” in the “Find what:” box.
#In the “Replace with” box, put your cursor in the box and then hit the space key once.
#Press the “Replace All” button in the window.
*Complete the following steps for each of the spreadsheets containing the normalized data:
*After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.
*Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
=AVERAGE(B2:E2)
Press enter. Double click on the lower right hand corner of the cell so that Excel computes the average log fold change for the rest of the genes. 
*Calculate the average log fold change for the remaining time points by using the formula aforementioned substituting the range for the cell corresponding to the first replicate measurement for the time point to the cell corresponding to the last replicate measurement for the time point.
*After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized data) with the header “<strain>_mean_centered_LogFC_<timepoint>-<flask>” in row 1.
*For each time point, subtract the mean fold change for the time point from each fold change measurement for the time point for each gene (henceforth, mean centered log fold changes).  For examples, in row 2 of the column labeled “wt_mean_centered_LogFC-t15-1”, type the following formula and hit enter:
=B2-$Y2.
Double click on the lower right hand corner of the cell to make the computation for each gene. Use the same formula for the remainder of the time points and genes substituting the first term for the cell corresponding to the average log fold change for the time point and the second term for a cell containing the fold change for a replicate of the time point.
*After the last column of the mean centered normalized log fold changes, label the next column with the header "<strain>_SSH" in row 1.
*Calculate the SSH by finding the sum of squares of the log fold changes. For example, in row 2 of the column labeled “wt_SSH” type the following formula
=SUMSQ(B2:X2)
and hit enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.


*The second step is to calculate the sum of squares of the alternate hypothesis, or SSF, the difference from the SSH being that the hypothesis states there is at least one significantly changed gene.  
*Label the next column after the SSH calculates with the header "<strain>_SSF in row 1.
*This is represented by subtracting the population mean from the log fold change before squaring it.  
*Calculate the SSF by find the sum of squares of all of the mean centered log fold changes. For example, in row 2 of the column labeled “wt_SSF” type the following formula
*The population mean can be calculated by averaging the log fold change for each time point, these values were subtracted from each log fold change for each gene at their respective time point.
=SUMSQ(AD2:AZ2)
*The values were then squared, and summed over every flask i, and time point j.  
and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.


SSF<sub>i</sub>Σ<sub>j</sub>(Y<sub>ij</sub>-μ<sub>j</sub>)<sup>2</sup>
*In the first cell of the next column after the SSF calculations, enter the header “<strain>_Fstat” in row 1.
*Calculate the F statistic as previously described. For example, in row 2 of the column labeled “wt_Fstat” type the formula
=((23-5)/5)*((BA2-BB2)/BB2)
and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.


*Finally, the F-distribution is calculated by subtracting the SSF from the SSH and dividing by the SSH, then by multiplying this value by the number of flasks(F) subtracted from the number of trials(N) divided by the number of flasks.
*In the first cell of the next column after F stat calculations, type the header “<strain>_Pval”. 
*This will give you the F-distribution with degrees of freedom F, N-F.
*To calculate the p value, use the formula
=FDIST(F statistic, T, N-T)
For example, in row 2 of the column labeled “wt_Pval” type the formula
=FDIST(BC2,5,23-5)


  [(SSH-SSF)/SSF * (N-F)/F] ~ F(F,N-F)
*In the first cell of the next column after p value calculations, type the header "Significantly_Expressed_Genes".
*Find which genes are significantly differentially expressed (i.e. the p value for that gene is less than 0.05). For example, for the wild type data, type the following formula into the second cell of the column.
  =IF(BD2<0.05,1,0)
and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.
*Find the sum is the total number of genes that have a p value of less than 0.05.
*For the wild type data, scroll down to the bottom of the column and type the following formula in the first blank cell
=SUM(BD2:BD6190)
and press enter.


*This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F.
*Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains.


===Bonferroni p value Correction===
===Bonferroni p value Correction===
Line 43: Line 89:


*For any p values that are above 1, change the values to 1.
*For any p values that are above 1, change the values to 1.
==Statistical Analysis Within Excel==
*Create an Excel workbook with a spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
*Copy the list of gene IDS from the final output of normalized data from R into the first column (labelled "Gene ID") of each of the first five spreadsheets of the workbook.
*For each strain, copy all of the normalized log fold changes from the same R output into the columns after the first of the spreadsheet within the workbook corresponding to the strain. If not already labelled appropriately in the R output, label the columns containing the normalized log fold changes in the format "[strain]_LogFC_[timepoint]-[flask number]". For example, column B in the first spreadsheet should be labelled "wt_LogFC_t15-1".
*Select the entire first column with the Gene ID's. Sort by ascending order. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".
*In order to continue with the calculations, tt is very important that you make sure there are no cells with the term "NA". Every cell with an "NA" should have been replaced with a space as described in the protocol [[Dahlquist:Microarray Data Processing in R]].
===Calculate the Modified ANOVA===
*In the first spreadsheet after the last column of normalized log fold changes, label the next five columns by typing "<strain>_Avg_LogFC_<timepont>" in the first cell of each column. For example, the first of cell of the first of the five columns should be "wt_Avg_LogFC_t15".
*Calculate the average log fold change for each timepoint in the wildtype. To do so, type the following formula into the second cell of the t15 column:
=AVERAGE(B2:E2)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner. Calculate the average log fold change for the other timepoints using the formula above but changing the columns to those that correspond to the timepoint you are working on.
*After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized log fold changes) by typing "<strain>_mean_centered_LogFC_<timepoint>-<flask number>" into the first cell of each column. For example, the first cell of the first column should be "wt_mean_centered_LogFC_t15-1".
*Stubtract the mean from the log fold changes for each timepoint for each flask. To do so, type the following formula into the second cell of the first column "wt_mean_centered_LogFC_t15-1":
=B2-$Y2
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner. Repeat this procedure for the rest of the timepoints and flaks. You can use the same formula as above but change the columns to the normalized log fold change and average log fold change columns that you are using.
*After the last column of the mean centered normalized log fold changes, label the first cell of the next column as "wt_SSH".
*Calculate the SSH (the sum of the squares of the normalized log fold changes for all timepoints and flasks)of the wildtype data. To do so, type the following formula into the second cell of the column:
=SUMSQ(B2:X2)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
*In first cell of the column after the "SSH column", type "wt_SSF".
*Calculate the SSF (the sum of the squares of the mean centered normalized log fold changes for all timepoints and flasks)of the wildtype data. To do so, type the following formula into the second cell of the column:
=SUMSQ(AD2:AZ2)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
*In the first cell of the next column after the "wt_SSF" column, type "wt_Fstat".
*Calculate the F statistic for the wildtype. To do so, type the following formula into the second cell of the column:
=((23-5)/5)*((BA2-BB2)/BB2)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
*In the first cell of the next column after the "wt_Fstat" column, type "wt_Pval".
*To calculate the p value for the wildtype data, type the following formula into the second cell of the column:
=FDIST(BC2,5,23-5)
*In the first cell of the next column after the "wt_Pval" column, type "Significantly_Expressed_Genes".
*Find which genes are significantly differentially expressed. In other words, the p value for that gene is less than 0.05. In order to do so, type the following formula into the second cell of the column.
=IF(BD2<0.05,1,0)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
*Scroll down to the bottom of the column.
*In cell 6191 of the column, type the following formula:
=SUM(BD2:BD6190)
and press enter. This sum is the total number of genes that have a p value of less than 0.05.
For the remainder of the strains in the other spreadsheets,
*Change the column names so that they reflect the strain which the spreadsheet corresponds to.
*Calculate the average log fold change for each time point, the mean centered log fold change for each timepoint and flask, the SSH, and the SSF as described for the wildtype but remembering to alter the columns in the formula to match the columns for which you are doing the calculations.
*Calculate the F statistic using the formula outlined for the wildtype with some alterations. Make sure to use the correct the total number of flasks for the strain. While it was 23 for the wildtype, it is 20 for the deletion strains. Also, make sure the columns in the formula to match the columns for which you are doing the calculations.
*Calculate the p values and the number of significantly differentially expressed genes using the formula outlined for the wildtype but remembering to change the columns in the formula to match the columns for which you are doing the calculations.


===Calculate the Bonferroni p value Correction===
===Calculate the Bonferroni p value Correction===

Revision as of 01:51, 23 August 2013

Home        Research        Protocols        Notebook        People        Publications        Courses        Contact       


Introduction

The protocol below describes the statistical tests that have been performed to determine the significance of the gene expression changes seen in the microarray data. The Bonferonni and Benjamini & Hochberg corrections were used to correct for false positives after performing the statistical tests.

Analyzing Significant Gene Expression Changes Within Each Strain

For each strain, an ANOVA statistic was calculated to determine what genes have a significant change in expression during at least one time point.

  • To analyze significant changes in gene expression, we need to compute an F statistic for each gene, which is the ratio of the MST (the mean square among the time points) to the MSE (the mean square within each time point).
  • The MSE is the sum of squares of deviations from the mean (henceforth, sum of squares) within each time point (SSF) divided by the corresponding degrees of freedom (i.e. the number of time points T).
  • The SSF is calculated by finding the sum of the squared difference between a gene’s log fold change for flask i and time point j and the mean log fold change for the time point.
SSF=ΣiΣj(Yijj)2 
  • The MST is the sum of squares among the time points (SSG) divided by the corresponding degrees of freedom (the difference between the total number of time points and replicates N and the number of time points T). The SSG is the sum of squares within each group (SSF) subtracted from the sum of squares for all of the data for a gene (SSH).
  • The SSH is calculated by squaring the log fold change and then summing over all of the flasks i and all of the time points j.
SSH=ΣiΣj(Yij)2 
  • So, the F statistic with degrees of freedom N-T and T is
 [(SSH-SSF)/SSF] * [(N-F)/F]

Calculate the F Statistic in Excel

  • Create an Excel workbook containing a separate spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
  • For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective spreadsheets.
  • In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space by doing the following.
  1. Select all of the cells.
  2. Click the “Find & Select” icon in the Excel toolbar.
  3. Click on “Replace…”
  4. In the window, type in “NA” in the “Find what:” box.
  5. In the “Replace with” box, put your cursor in the box and then hit the space key once.
  6. Press the “Replace All” button in the window.
  • Complete the following steps for each of the spreadsheets containing the normalized data:
  • After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.
  • Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
=AVERAGE(B2:E2)

Press enter. Double click on the lower right hand corner of the cell so that Excel computes the average log fold change for the rest of the genes.

  • Calculate the average log fold change for the remaining time points by using the formula aforementioned substituting the range for the cell corresponding to the first replicate measurement for the time point to the cell corresponding to the last replicate measurement for the time point.
  • After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized data) with the header “<strain>_mean_centered_LogFC_<timepoint>-<flask>” in row 1.
  • For each time point, subtract the mean fold change for the time point from each fold change measurement for the time point for each gene (henceforth, mean centered log fold changes). For examples, in row 2 of the column labeled “wt_mean_centered_LogFC-t15-1”, type the following formula and hit enter:
=B2-$Y2. 

Double click on the lower right hand corner of the cell to make the computation for each gene. Use the same formula for the remainder of the time points and genes substituting the first term for the cell corresponding to the average log fold change for the time point and the second term for a cell containing the fold change for a replicate of the time point.

  • After the last column of the mean centered normalized log fold changes, label the next column with the header "<strain>_SSH" in row 1.
  • Calculate the SSH by finding the sum of squares of the log fold changes. For example, in row 2 of the column labeled “wt_SSH” type the following formula
=SUMSQ(B2:X2)

and hit enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • Label the next column after the SSH calculates with the header "<strain>_SSF in row 1.
  • Calculate the SSF by find the sum of squares of all of the mean centered log fold changes. For example, in row 2 of the column labeled “wt_SSF” type the following formula
=SUMSQ(AD2:AZ2)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • In the first cell of the next column after the SSF calculations, enter the header “<strain>_Fstat” in row 1.
  • Calculate the F statistic as previously described. For example, in row 2 of the column labeled “wt_Fstat” type the formula
=((23-5)/5)*((BA2-BB2)/BB2)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • In the first cell of the next column after F stat calculations, type the header “<strain>_Pval”.
  • To calculate the p value, use the formula
=FDIST(F statistic, T, N-T)

For example, in row 2 of the column labeled “wt_Pval” type the formula

=FDIST(BC2,5,23-5)
  • In the first cell of the next column after p value calculations, type the header "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed (i.e. the p value for that gene is less than 0.05). For example, for the wild type data, type the following formula into the second cell of the column.
=IF(BD2<0.05,1,0)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • Find the sum is the total number of genes that have a p value of less than 0.05.
  • For the wild type data, scroll down to the bottom of the column and type the following formula in the first blank cell
=SUM(BD2:BD6190)

and press enter.


Bonferroni p value Correction

  • False positives must be corrected from the data's p values before it can be considered accurate.
  • One way of correcting these p values is the Bonferroni Correction.
  • Accomplished by multiplying each p value by the total number of hypotheses(n).
P≤α/n
  • One negative aspect of this correction is that the final result will consist of only the most extreme outliers, thus ignoring some potential significant genes.
  • For any p values that are above 1, change the values to 1.

Benjamini & Hochberg p value Correction

  • A more robust method of correcting the data's p values is the Benjamini & Hochberg correction, or B&H.
  • Once the p values are calculated they are sorted from least to greatest and an index(i) from 1 to n is created to rank these values, 1 being the lowest p value and n being the highest.
  • The p values are then multiplied by the total number of hypotheses(n) and divided by their rank(i).
P≤i*α/n
  • For any p values that are above 1, change the values to 1.

Calculate the Bonferroni p value Correction

  • Select the spreadsheet labelled "Bonferroni".
  • Paste the "Gene ID" column from one of the earlier five spreadsheets into the colum A of this spreadsheet.
  • Paste the entire column of p values for the wildtype into the column B of the spreadsheet as values. Paste the p values for the other strains in the same manner into columns C-F.
  • In cell G1 and H1, type "wt_Bonferroni_pval".
  • Calculate the Bonferroni corrected p value for the wildtype data. To do so, type the following formula into cell G2:
=B2*6189

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into cell H2:
=IF(G2>1,1,G2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • In cell I2, type "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed by typing the following formula into the cell I2:
=IF(G2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column.
  • In cell I6192, type the following formula:
=SUM(I2:I6190)
  • Find the Bonferroni corrected p values for the rest of the strains following the procedure outlined for the wildtype above. Change the column names so that they reflect the strain on which you are working on. Change the cell designations in the formulas to reflect the cells corresponding to the strain on which you are working on.

Calculate the Benjamini & Hochberg p value Correction

  • Click on the spreadsheet named "B&H".
  • Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190.
  • Copy and paste the column of Gene ID's from one of the previous spreadsheets into columns C, G, K, O, and S. Paste the column of p values from the wildtype spreadsheet into column D. Paste the column of p values from the dCIN5 spreadsheet into column H. Paste the column of p values from the dGLN3 spreadsheet into column L. Paste the column of p values from the dHMO1 spreadsheet into column P. Paste the column of p values from the dZAP1 spreadsheet into column T.
  • Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".

Sort columns, H, L, P, and T in the same way.

  • Calculate the Benjamini and Hochberg p value correction for the wildtype. Type "wt_BH_pval" in cell E1. Type the following formula in cell E2:
=(D2*6189)/A2 

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Calculate the Benjamini and Hochberg correction for the rest of the strains in column I for dCIN5, in column M for dGLN3, in column Q for dHMO1, and in column U for dZAP1.
  • Type "Significantly_Expressed_Genes" into cell F1, J1, N1, R1, and V1.
  • Find which genes are significantly differentially expressed. In order to do so for the wildtype, type the following formula into cell F2:
=IF(E2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column F.
  • In cell 6191 of the column, type the following formula:
=SUM(F2:F6190)
  • Find which genes are significantly differentially expressed for the rest of the strains using columns J, N, R, and V. You can use the formulas above but make sure to alter the cell number.