Dahlquist:Modified ANOVA and p value Corrections for Microarray Data: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Calculate the Modified ANOVA: Continued the procedure.)
(→‎Statistical Analysis Within Excel: Made changes to intially setting up the spreadsheets.)
Line 46: Line 46:
==Statistical Analysis Within Excel==
==Statistical Analysis Within Excel==


*Create an Excel workbook with a spreadsheet for each of the following labels: "wt_logFCs_SSH", "wt_LogFCs_SSF_SSH_Fstat_pval", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
*Create an Excel workbook with a spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
*Copy the list of gene IDS from the final output of normalized data from R into the first column (labelled "Gene ID") of each of the first six spreadsheets of the workbook.
*Copy the list of gene IDS from the final output of normalized data from R into the first column (labelled "Gene ID") of each of the first five spreadsheets of the workbook.
*For each strain, copy all of the normalized log fold changes from the same R output into the columns after the first of the spreadsheet within the workbook corresponding to the strain. If not already labelled appropriately in the R output, label the columns containing the normalized log fold changes in the format "[strain]_LogFC_[timepoint]-[flask number]". For example, column B in the first spreadsheet should be labelled "wt_LogFC_t15-1".  
*For each strain, copy all of the normalized log fold changes from the same R output into the columns after the first of the spreadsheet within the workbook corresponding to the strain. If not already labelled appropriately in the R output, label the columns containing the normalized log fold changes in the format "[strain]_LogFC_[timepoint]-[flask number]". For example, column B in the first spreadsheet should be labelled "wt_LogFC_t15-1".  
*Select the entire first column with the Gene ID's. Click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK."
*In order to continue with the calculations, tt is very important that you make sure there are no cells with the term "NA". Every cell with an "NA" should have been replaced with a space as described in the protocol


===Calculate the Modified ANOVA===
===Calculate the Modified ANOVA===

Revision as of 16:09, 1 November 2011

Home        Research        Protocols        Notebook        People        Publications        Courses        Contact       


Introduction

Modified ANOVA

  • To analyze the significant changes in our gene expression, p-values were calculated using an F-distribution.
  • The first step to calculating this F-distribution is by calculating the sum of the squares of the null hypothesis, or SSH.
  • The null hypothesis being that no genes experience any significant change in expression, therefore the population mean (μ) is 0.
  • To calculate the SSH each genes log fold change was squared and summed over every flask i; i=1, 2, 3, 4, 5; and every time point j; j=t15, t30, t60, t90, and t120.
SSH=ΣiΣj(Yij)2 
  • The second step is to calculate the sum of squares of the alternate hypothesis, or SSF, the difference from the SSH being that the hypothesis states there is at least one significantly changed gene.
  • This is represented by subtracting the population mean from the log fold change before squaring it.
  • The population mean can be calculated by averaging the log fold change for each time point, these values were subtracted from each log fold change for each gene at their respective time point.
  • The values were then squared, and summed over every flask i, and time point j.
SSF=ΣiΣj(Yijj)2 
  • Finally, the F-distribution is calculated by subtracting the SSF from the SSH and dividing by the SSH, then by multiplying this value by the number of flasks(F) subtracted from the number of trials(N) divided by the number of flasks.
  • This will give you the F-distribution with degrees of freedom F, N-F.
[(SSH-SSF)/SSF * (N-F)/F] ~ F(F,N-F)
  • This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F.
  • Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains.

Bonferroni p value Correction

  • False positives must be corrected from the data's p values before it can be considered accurate.
  • One way of correcting these p values is the Bonferroni Correction.
  • Accomplished by multiplying each p value by the total number of hypotheses(n).
P≤α/n
  • One negative aspect of this correction is that the final result will consist of only the most extreme outliers, thus ignoring some potential significant genes.
  • For any p values that are above 1, change the values to 1.

Benjamini & Hochberg p value Correction

  • A more robust method of correcting the data's p values is the Benjamini & Hochberg correction, or B&H.
  • Once the p values are calculated they are sorted from least to greatest and an index(i) from 1 to n is created to rank these values, 1 being the lowest p value and n being the highest.
  • The p values are then multiplied by the total number of hypotheses(n) and divided by their rank(i).
P≤i*α/n
  • For any p values that are above 1, change the values to 1.

Statistical Analysis Within Excel

  • Create an Excel workbook with a spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
  • Copy the list of gene IDS from the final output of normalized data from R into the first column (labelled "Gene ID") of each of the first five spreadsheets of the workbook.
  • For each strain, copy all of the normalized log fold changes from the same R output into the columns after the first of the spreadsheet within the workbook corresponding to the strain. If not already labelled appropriately in the R output, label the columns containing the normalized log fold changes in the format "[strain]_LogFC_[timepoint]-[flask number]". For example, column B in the first spreadsheet should be labelled "wt_LogFC_t15-1".
  • Select the entire first column with the Gene ID's. Click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK."
  • In order to continue with the calculations, tt is very important that you make sure there are no cells with the term "NA". Every cell with an "NA" should have been replaced with a space as described in the protocol

Calculate the Modified ANOVA

  • In the first spreadsheet after the last column of normalized log fold changes, label the next five columns by typing "[strain]_Avg_LogFC_[timepont]" in the first cell of each column. For example, the first of cell of the first of the five columns should be "wt_Avg_LogFC_t15".
  • Calculate the average log fold change for each timepoint in the wildtype. To do so, type the following formula into the second cell of the t15 column:
=AVERAGE(B2:E2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner. Calculate the average log fold change for the other timepoints using the formula above but changing the columns to those that correspond to the timepoint you are working on.

  • After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized log fold changes) by typing "[strain]_mean_centered_LogFC_[timepoint]-[flask number]" into the first cell of each column. For example, the first cell of the first column should be "wt_mean_centered_LogFC_t15-1".
  • Stubtract the mean from the log fold changes for each timepoint for each flask. To do so, type the following formula into the second cell of the first column "wt_mean_centered_LogFC_t15-1":
=B2-$Y2

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner. Repeat this procedure for the rest of the timepoints and flaks. You can use the same formula as above but change the columns to the normalized log fold change and average log fold change columns that you are using.

  • After the last column of the mean centered normalized log fold changes, label the first cell of the next column as "wt_SSH".
  • Calculate the SSH (the sum of the squares of the normalized log fold changes for all timepoints and flasks)of the wildtype data. To do so, type the following formula into the second cell of the column:
=SUMSQ(B2:X2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Select all of the columns with the mean centered normalized log fold changes.
  • Copy these columns and paste them as values into the columns after the first column of the second spreadsheet "wt_LogFCs_SSF_SSH_Fstat_pval". To paste as values, right click on the first cell of the second column of the second spreadsheet, click "Paste Special...", and select "Values".
  • In first cell of the column after the last column of the mean centered normalized log fold changes, type "wt_SSF".
  • Calculate the SSF (the sum of the squares of the mean centered normalized log fold changes for all timepoints and flasks)of the wildtype data. To do so, type the following formula into the second cell of the column:
=SUMSQ(B2:X2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • In the first cell of the next column after the "wt_SSF" column, type "wt_Fstat".
  • Calculate the F statistic for the wildtype. To do so, type the following formula into the second cell of the column:
=((23-5)/5)*((Z2-Y2)/Y2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • In the first cell of the next column after the "wt_Fstat" column, type "wt_Pval".
  • To calculate the p value for the wildtype data, type the following formula into the second cell of the column:
=FDIST(AA2,5,23-5)
  • In the first cell of the next column after the "wt_Pval" column, type "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed. In other words, the p value for that gene is less than 0.05. In order to do so, type the following formula into the second cell of the column.
=IF(AB2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column.
  • In cell 6191 of the column, type the following formula:
=SUM(AC2:AC6190)

and press enter. This sum is the total number of genes that have a p value of less than 0.05.

For the remainder of the strains in the other spreadsheets,

  • Calculate the average log fold change for each time point and the mean centered log fold change for each timepoint and flask as described for the wildtype but remembering to alter the columns in the formula to match the columns for which you are doing the calculations.
  • There is no need to copy and paste the mean centered log fold changes into any part of the spreadsheet or any other spreadsheet.
  • Calculate the SSH and SSF using the formula outlined for the wildtype but remembering to alter the columns in the formula to match the columns for which you are doing the calculations.
  • Calculate the F statistic using the formula outlined for the wildtype with some alterations. Make sure to use the correct the total number of flasks for the strain. While it was 23 for the wildtype, it is 20 for the deletion strains. Also, make sure the columns in the formula to match the columns for which you are doing the calculations.
  • Calculate the p values and the number of significantly differentially expressed genes using the formula outlined for the wildtype but remembering to change the columns in the formula to match the columns for which you are doing the calculations.

Calculate the Bonferroni p value Correction

Calculate the Benjamini & Hochberg p value Correction