Dahlquist:Modified ANOVA and p value Corrections for Microarray Data: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(Added a section.)
Line 297: Line 297:


==Comparing Significant Changes in Expression Between Two Strains==
==Comparing Significant Changes in Expression Between Two Strains==
A general linear model is used to test the hypothesis that each gene shows the same pattern of gene expression over the time course of the experiment in each strain. The null hypothesis is that the average log fold change of a gene ''g'' for a time point ''j'' in one strain ''s'' is the same as that in another strain ''s<nowiki>'</nowiki>''. The statistical analysis was performed in MATLAB.

Revision as of 23:21, 9 May 2014

Home        Research        Protocols        Notebook        People        Publications        Courses        Contact       


Introduction

The protocol below describes the statistical tests that have been performed to determine the significance of the gene expression changes seen in the microarray data. The Bonferroni and Benjamini & Hochberg corrections were used to correct for false positives after performing the statistical tests.

Analyzing Significant Gene Expression Changes Within Each Strain

For each strain, an ANOVA statistic was calculated to determine what genes have a significant change in expression during at least one time point.

  • To analyze significant changes in gene expression, we need to compute an F statistic for each gene, which is the ratio of the MST (among time points mean square) to the MSE (within time points mean square).
  • The MSE is the within time points sum of squares of deviations from the mean (henceforth, sum of squares (SSF) divided by the corresponding degrees of freedom (i.e. the number of time points T).
  • The SSF is calculated by finding the sum of the squared difference between a gene’s log fold change for time point i and flask j and the mean log fold change for the time point.
SSF=ΣiΣj(Yijj)2 
  • The MST is the among time points sum of squares (SST) divided by the corresponding degrees of freedom (the difference between the total number of time points and replicates N and the number of time points T). The SST is the within time points sum of squares (SSF) subtracted from the sum of squares for all of the data for a gene (SSH).
  • The SSH is calculated by squaring the log fold change and then summing over all of the time points i and flasks j.
SSH=ΣiΣj(Yij)2 
  • So, the F statistic with degrees of freedom N-T and T is
 [(SSH-SSF)/SSF] * [(N-T)/T]
  • Find the p value by using the F distribution with your computed F statistic, significance value, and N-T and T degrees of freedom.

In the case, such as this, when multiple statistical tests are being performed simultaneously, a correction should be applied to the p values in order to mitigate the false positive rate. The Bonferroni method is one such method performed by multiplying each p value by the total number of hypotheses, which in this case corresponds to the number of genes.

p*n≤α

A downside to this correction is that the final will consist of only the most extreme outliers, thus ignoring some potentially significant genes.

A more robust method of correcting the p values is the Benjamini & Hochberg correction (B&H). This method is performed by first sorting all the p values from least to greatest and each gene is given a rank i. Then, the p values are multiplied by the total number of hypotheses n and divided by their rank i.

p*n/i≤α

Statistical Analysis in Excel

The F statistic, Bonferroni correction, and the Benjamini and Hochberg correction can be performed in Excel.

Calculate the F Statistic

  • Create an Excel workbook containing a separate spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
  • For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective spreadsheets.
  • In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space.
  • Complete the following steps for each of the spreadsheets containing the normalized data:
  • After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.
  • Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
=AVERAGE(B2:E2)

Press enter. Double click on the lower right hand corner of the cell so that Excel computes the average log fold change for the rest of the genes.

  • Calculate the average log fold change for the remaining time points by using the formula aforementioned substituting the range for the cell corresponding to the first replicate measurement for the time point to the cell corresponding to the last replicate measurement for the time point.
  • After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized data) with the header “<strain>_mean_centered_LogFC_<timepoint>-<flask>” in row 1.
  • For each time point, subtract the mean fold change for the time point from each fold change measurement for the time point for each gene (henceforth, mean centered log fold changes). For examples, in row 2 of the column labeled “wt_mean_centered_LogFC-t15-1”, type the following formula and hit enter:
=B2-$Y2. 

Double click on the lower right hand corner of the cell to make the computation for each gene. Use the same formula for the remainder of the time points and genes substituting the first term for the cell corresponding to the average log fold change for the time point and the second term for a cell containing the fold change for a replicate of the time point.

  • After the last column of the mean centered normalized log fold changes, label the next column with the header "<strain>_SSH" in row 1.
  • Calculate the SSH by finding the sum of squares of the log fold changes. For example, in row 2 of the column labeled “wt_SSH” type the following formula
=SUMSQ(B2:X2)

and hit enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • Label the next column after the SSH calculates with the header "<strain>_SSF in row 1.
  • Calculate the SSF by find the sum of squares of all of the mean centered log fold changes. For example, in row 2 of the column labeled “wt_SSF” type the following formula
=SUMSQ(AD2:AZ2)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • In the first cell of the next column after the SSF calculations, enter the header “<strain>_Fstat” in row 1.
  • Calculate the F statistic as previously described. For example, in row 2 of the column labeled “wt_Fstat” type the formula
=((23-5)/5)*((BA2-BB2)/BB2)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • In the first cell of the next column after F stat calculations, type the header “<strain>_Pval”.
  • To calculate the p value, use the formula
=FDIST(F statistic, T, N-T)

For example, in row 2 of the column labeled “wt_Pval” type the formula

=FDIST(BC2,5,23-5)
  • In the first cell of the next column after p value calculations, type the header "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed (i.e. the p value for that gene is less than 0.05). For example, for the wild type data, type the following formula into the second cell of the column.
=IF(BD2<0.05,1,0)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • Find the sum is the total number of genes that have a p value of less than 0.05.
  • For the wild type data, scroll down to the bottom of the column and type the following formula in the first blank cell
=SUM(BD2:BD6190)

and press enter.

Calculate the Bonferroni p value Correction

  • Select the spreadsheet labelled "Bonferroni".
  • Paste the "Gene ID" column from one of the earlier five spreadsheets into the colum A of this spreadsheet.
  • Paste the entire column of p values for the wildtype into the column B of the spreadsheet as values. Paste the p values for the other strains in the same manner into columns C-F.
  • In cell G1 and H1, type "wt_Bonferroni_pval".
  • Calculate the Bonferroni corrected p value for the wildtype data. To do so, type the following formula into cell G2:
=B2*6189

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into cell H2:
=IF(G2>1,1,G2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • In cell I2, type "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed by typing the following formula into the cell I2:
=IF(G2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column.
  • In cell I6192, type the following formula:
=SUM(I2:I6190)
  • Find the Bonferroni corrected p values for the rest of the strains following the procedure outlined for the wildtype above. Change the column names so that they reflect the strain on which you are working on. Change the cell designations in the formulas to reflect the cells corresponding to the strain on which you are working on.

Calculate the Benjamini & Hochberg p value Correction

  • Click on the spreadsheet named "B&H".
  • Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190.
  • Copy and paste the column of Gene ID's from one of the previous spreadsheets into columns C, G, K, O, and S. Paste the column of p values from the wildtype spreadsheet into column D. Paste the column of p values from the dCIN5 spreadsheet into column H. Paste the column of p values from the dGLN3 spreadsheet into column L. Paste the column of p values from the dHMO1 spreadsheet into column P. Paste the column of p values from the dZAP1 spreadsheet into column T.
  • Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".

Sort columns, H, L, P, and T in the same way.

  • Calculate the Benjamini and Hochberg p value correction for the wildtype. Type "wt_BH_pval" in cell E1. Type the following formula in cell E2:
=(D2*6189)/A2 

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Calculate the Benjamini and Hochberg correction for the rest of the strains in column I for dCIN5, in column M for dGLN3, in column Q for dHMO1, and in column U for dZAP1.
  • Type "Significantly_Expressed_Genes" into cell F1, J1, N1, R1, and V1.
  • Find which genes are significantly differentially expressed. In order to do so for the wildtype, type the following formula into cell F2:
=IF(E2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column F.
  • In cell 6191 of the column, type the following formula:
=SUM(F2:F6190)
  • Find which genes are significantly differentially expressed for the rest of the strains using columns J, N, R, and V. You can use the formulas above but make sure to alter the cell number.

Calculating the F statistic and p value corrections in MATLAB

  • The MATLAB script for the statistical analysis needs as input a spreadsheet with all of the normalized data. To prepare this input,
  1. Open the .csv file created after normalizing the data (i.e. GCAT_and_Ontario_Final_Normalized_Data.csv).
  2. Rename the spreadsheet as "Master_Sheet".
  3. Type the identifier "Gene ID" in cell A1.
  4. Insert a new column after column A.
  5. Type the identifier "Standard Name" in cell B1. Column B will contain the common names for the genes on the microarray.
  6. Copy the gene names in column A.
  7. Paste the names into the "Value" field of the List <-> Gene List tool in YEASTRACT [1]. Then, click on the "Transform" button.
  8. Select all of the names in the "Gene Name" column of the resulting table.
  9. Copy and paste these names into column B of the .csv file.
  10. Save the .csv file as a .xls file.
  • Close the .csv file and open MATLAB. Make sure the directory is set to where the input file is located.
  • Specify the names of the Excel file and spreadsheet that contains the data and strain for which the data will be import into MATLAB.
filename    = 'Katrina_Latest_Statistical_Analysis_Normalized_Data_20111101.xls';
sheetname   = 'Master_Sheet';
strain      = 'wt';

Make sure that the strain name is exactly as it appears in the column headers in the Excel spreadsheet.

  • Import the Excel spreadsheet and create a separate matrix containing only the numerical data in the spreadsheet.
[log2FC,b]=xlsread(filename,sheetname);
a = log2FC(:,2:end);
  • For each time point in the experiment, identify the columns in the input file that contain the data for the specified strain and time point.
strainT15 = strfind(b(1,:),[strain '_LogFC_t15']);
strainT30 = strfind(b(1,:),[strain '_LogFC_t30']);
strainT60 = strfind(b(1,:),[strain '_LogFC_t60']);
strainT90 = strfind(b(1,:),[strain '_LogFC_t90']);
strainT120 = strfind(b(1,:),[strain '_LogFC_t120']);
  • Create vectors of column indicies that were identified for each time point for the strain specified.
ind15 = find(not(cellfun('isempty',strainT15)))-(length(b(1,:))-length(a(1,:)));
ind30 = find(not(cellfun('isempty',strainT30)))-(length(b(1,:))-length(a(1,:)));
ind60 = find(not(cellfun('isempty',strainT60)))-(length(b(1,:))-length(a(1,:)));
ind90 = find(not(cellfun('isempty',strainT90)))-(length(b(1,:))-length(a(1,:)));
ind120 = find(not(cellfun('isempty',strainT120)))-(length(b(1,:))-length(a(1,:)));
  • Specify the significance value alpha for the ANOVA, the total number of genes n to be analyzed, the number of independent variables p, and the number of constraints q on the linear model.
alpha = 0.05; 
n = length(a(:,1));
p = 5;
q = 5;
  • Concatenate all of the indices into one vector.
ind = [ind15,ind30,ind60,ind90,ind120];
N = length(ind);
  • To perform the ANOVA, it is necessary to set up a regression matrix X, which is a matrix of ones and zeros that specifies the independent variables. Set up the regressor matrix so that it has as many rows as the number of indices and as many columns as there are independent variables.
X = zeros(N,p);
  • If the indices do not start at 1, then the regressor matrix will not be constructed correctly. To remedy this, create a new set of indices by manipulating the initial indices to begin at 1. Then, create the regressor matrix X.
if ind(1)~= 1
 ind151 = ind15-ind15(1)+1;
 ind301 = ind30-ind30(1)+ind151(end)+1;
 ind601 = ind60-ind60(1)+ind301(end)+1;
 ind901 = ind90-ind90(1)+ind601(end)+1;
 ind1201 = ind120-ind120(1)+ind901(end)+1;
 X(ind151,1) = 1;
 X(ind301,2) = 1;
 X(ind601,3) = 1;
 X(ind901,4) = 1;
 X(ind1201,5) = 1;
else 
  X(ind15,1) = 1;
  X(ind30,2) = 1;
  X(ind60,3) = 1;
  X(ind90,4) = 1;
  X(ind120,5) = 1;
end
  • Specify the dimensions of the output matrix. This output matrix will contain two columns of gene names, a column for the average log fold change for each time point, the F statistic, the p value, the SSH, the SSF, the Benjamini & Hochberg corrected p values, whether or not the p value is significant, and the rank of each gene's p value after performing the Benjamini & Hochberg correction.
out_data = zeros(n,12);
  • To compute the F statistic and p value for each gene for the wild type strain, first the indices have to be modified. This modification involves ignoring the indices that correspond to the time points for which a gene has "NaN" as the log fold change.
for ii = 1:n
if strcmpi(strain,'wt') == 1
   I = find(~isnan(a(ii,ind)));
   ind15x = I(I>=ind15(1)&I<=ind15(end));
   ind30x = I(I>=ind30(1)&I<=ind30(end));
   ind60x = I(I>=ind60(1)&I<=ind60(end));
   ind90x = I(I>=ind90(1)&I<=ind90(end));
   ind120x = I(I>=ind120(1)&I<=ind120(end));
  • Concatenate these modified indices into a new vector indx. Find the length N of the newly created vector.
   indx = [ind15x,ind30x,ind60x,ind90x,ind120x];
   N = length(indx);
  • The continuity of the indices from one time point to another was lost in modifying the indices for the wild type data. Manipulate the modified indices so that they are continuous from one time point to the next.
   if N<23
       ind15x = find(ind15x);
       ind30x = find(ind30x)+ind15x(end);
       ind60x = find(ind60x)+ind30x(end);
       ind90x = find(ind90x)+ind60x(end);
       ind120x = find(ind120x)+ind90x(end);
   end
  • Set up the regressor matrix using the indices ind15x, ind30x, ind60x, ind90x, and ind120x.
   X = zeros(N,p);
   X(ind15x,1) = 1;
   X(ind30x,2) = 1;
   X(ind60x,3) = 1;
   X(ind90x,4) = 1;
   X(ind120x,5) = 1;
  • The aforementioned modifications to the indices do not need to be performed for the deletion strains. Therefore, the following condition ensures that the indices remain as they were computed at the beginning of the code.
else
   indx = ind;
end
  • For each gene, create a vector fo the log fold change values that correspond to that gene.
 Y = a(ii,indx)'; 
  • Compute the average log fold change.
 beta = X\Y;
  • Specify which columns in the output matrix will store the average log fold changes.
 out_data(ii,1:q) = beta';
  • Find the sum of squares of error for the hypothesized model s2h and the full model s2. Then, find the F statistic and p value.
 s2 = (1/N)*(Y-X*beta)'*(Y-X*beta); 
 s2h = (1/N)*(Y'*Y); 
 W      = ((N-p)/q)*((s2h-s2)/s2); 
 pval    = 1-fcdf(W,q,N-p); 
  • Specify the columns that are to be outputted.
 out_data(ii,6) = W;
 out_data(ii,7) = pval;
 out_data(ii,8) = s2*N;
 out_data(ii,9) = s2h*N;
end
  • Perform the Benjamini & Hochberg correction.
p1 = out_data(:,7);
q1 = (alpha/n)*[1:n]';
[ps,is] = sort(p1); 
tests   = ps<=q1; 
itests  = find(tests==1);
itestm  = max(itests);
qs = zeros(size(q1));
rs = qs;
for ii = 1:n
   qs(is(ii)) = ps(ii)*n/ii; 
   rs(is(ii)) = ii;
end
  • Output the data.
p1 = out_data(:,7); %P values
q1 = (alpha/n)*[1:n]';
[ps,is] = sort(p1); %Sort the p values.
tests   = ps<=q1; 
itests  = find(tests==1);
itestm  = max(itests);
qs = zeros(size(q1));
rs = qs;
for ii = 1:n
   qs(is(ii)) = ps(ii)*n/ii; % Perform B&H Correction
   rs(is(ii)) = ii;
end

Comparing Significant Changes in Expression Between Two Strains

A general linear model is used to test the hypothesis that each gene shows the same pattern of gene expression over the time course of the experiment in each strain. The null hypothesis is that the average log fold change of a gene g for a time point j in one strain s is the same as that in another strain s'. The statistical analysis was performed in MATLAB.