Dahlquist:Modified ANOVA and p value Corrections for Microarray Data
From OpenWetWare
(→Finding a General Linear Model in MATLAB: Created two new variables in the if strcmpi(strain1,'wt') == 0 loop to get rid of error in running code line by line) |
(→Calculate the Benjamini & Hochberg p value Correction: note to paste values when pasting p values to B&H correction) |
||
Line 110: | Line 110: | ||
*Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190. | *Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190. | ||
- | *Copy and paste the column of Gene ID's from one of the previous worksheets into columns C, G, K, O, and S. Paste the column of p values from the wildtype worksheet into column D. Paste the column of p values from the dCIN5 worksheet into column H. Paste the column of p values from the dGLN3 worksheet into column L. Paste the column of p values from the dHMO1 worksheet into column P. Paste the column of p values from the dZAP1 worksheet into column T. | + | *Copy and paste the column of Gene ID's from one of the previous worksheets into columns C, G, K, O, and S. For the following, use Paste special > Paste values. Paste the column of p values from the wildtype worksheet into column D. Paste the column of p values from the dCIN5 worksheet into column H. Paste the column of p values from the dGLN3 worksheet into column L. Paste the column of p values from the dHMO1 worksheet into column P. Paste the column of p values from the dZAP1 worksheet into column T. |
*Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK". | *Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK". |
Revision as of 17:45, 13 August 2014
Contents |
Introduction
The protocol below describes the statistical tests that have been performed to determine the significance of the gene expression changes seen in the microarray data. The Bonferroni and Benjamini & Hochberg corrections were used to correct for false positives after performing the statistical tests.
Analyzing Significant Gene Expression Changes Within Each Strain
For each strain, an ANOVA statistic was calculated to determine what genes have a significant change in expression during at least one time point.
- To analyze significant changes in gene expression, we need to compute an F statistic for each gene, which is the ratio of the MST (among time points mean square) to the MSE (within time points mean square).
- The MSE is the within time points sum of squares of deviations from the mean (henceforth, sum of squares (SSF) divided by the corresponding degrees of freedom (i.e. the number of time points T).
- The SSF is calculated by finding the sum of the squared difference between a gene’s log fold change for time point i and flask j and the mean log fold change for the time point.
SSF=Σ_{i}Σ_{j}(Y_{ij}-μ_{j})^{2}
- The MST is the among time points sum of squares (SST) divided by the corresponding degrees of freedom (the difference between the total number of time points and replicates N and the number of time points T). The SST is the within time points sum of squares (SSF) subtracted from the sum of squares for all of the data for a gene (SSH).
- The SSH is calculated by squaring the log fold change and then summing over all of the time points i and flasks j.
SSH=Σ_{i}Σ_{j}(Y_{ij})^{2}
- So, the F statistic with degrees of freedom n-p and q is
[(SSF-SSH)/SSF] * [(n-p)/q] ,
where n is the number of data points for a gene, p is the number of time points, and q = p*(number of strains-1).
- Find the p value by using the F distribution with your computed F statistic, significance value, and n-p and q degrees of freedom.
In the case, such as this, when multiple statistical tests are being performed simultaneously, a correction should be applied to the p values in order to mitigate the false positive rate. The Bonferroni method is one such method performed by multiplying each p value by the total number of hypotheses m, which in this case corresponds to the number of genes.
p*m≤α
A downside to this correction is that the final will consist of only the most extreme outliers, thus ignoring some potentially significant genes.
A more robust method of correcting the p values is the Benjamini & Hochberg correction (B&H). This method is performed by first sorting all the p values from least to greatest and each gene is given a rank i. Then, the p values are multiplied by the total number of hypotheses m and divided by their rank i.
p*m/i≤α
Statistical Analysis in Excel
The F statistic, Bonferroni correction, and the Benjamini and Hochberg correction can be performed in Excel.
Calculate the F Statistic
- Create an Excel workbook containing a separate worksheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
- For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective worksheets.
- In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space.
- Complete the following steps for each of the worksheets containing the normalized data:
- After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.
- Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
=AVERAGE(B2:E2)
Press enter. Double click on the lower right hand corner of the cell so that Excel computes the average log fold change for the rest of the genes.
- Calculate the average log fold change for the remaining time points by using the formula aforementioned substituting the range for the cell corresponding to the first replicate measurement for the time point to the cell corresponding to the last replicate measurement for the time point.
- After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized data) with the header “<strain>_mean_centered_LogFC_<timepoint>-<flask>” in row 1.
- For each time point, subtract the mean fold change for the time point from each fold change measurement for the time point for each gene (henceforth, mean centered log fold changes). For examples, in row 2 of the column labeled “wt_mean_centered_LogFC-t15-1”, type the following formula and hit enter:
=B2-$Y2
Double click on the lower right hand corner of the cell to make the computation for each gene. Use the same formula for the remainder of the time points and genes substituting the first term for a cell containing the fold change for a replicate of the time point and the second term for the cell corresponding to the average log fold change for the time point.
- After the last column of the mean centered normalized log fold changes, label the next column with the header "<strain>_SSH" in row 1.
- Calculate the SSH by finding the sum of squares of the log fold changes. For example, in row 2 of the column labeled “wt_SSH” type the following formula
=SUMSQ(B2:X2)
and hit enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.
- Label the next column after the SSH calculates with the header "<strain>_SSF in row 1.
- Calculate the SSF by find the sum of squares of all of the mean centered log fold changes. For example, in row 2 of the column labeled “wt_SSF” type the following formula
=SUMSQ(AD2:AZ2)
and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.
- In the first cell of the next column after the SSF calculations, enter the header “<strain>_Fstat” in row 1.
- Calculate the F statistic as previously described. For example, in row 2 of the column labeled “wt_Fstat” type the formula
=((23-5)/5)*((BA2-BB2)/BB2)
and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.
- In the first cell of the next column after F stat calculations, type the header “<strain>_Pval”.
- To calculate the p value, use the formula
=FDIST(F statistic, T, N-T)
For example, in row 2 of the column labeled “wt_Pval” type the formula
=FDIST(BC2,5,23-5)
- NOTE: There are 54 genes that occur on the "Ontario" chip, but not on the "GCAT" chip, which results in missing data for those replicates. This will cause a #VALUE! error.
- To fix this, filter the records to show these errors and then replace all of the #VALUE! errors in the "LogFC" and "mean_centered_LogFC" columns with a single space character.
- This affects the degrees of freedom in the Fstat and Pval calculations. Instead of 23 chips, there are just 14 with data. Replace "23" with "14" in the formulas above, accordingly.
- In the first cell of the next column after p value calculations, type the header "Significantly_Expressed_Genes".
- Find which genes are significantly differentially expressed (i.e. the p value for that gene is less than 0.05). For example, for the wild type data, type the following formula into the second cell of the column.
=IF(BD2<0.05,1,0)
and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.
- Find the sum is the total number of genes that have a p value of less than 0.05.
- For the wild type data, scroll down to the bottom of the column and type the following formula in the first blank cell
=SUM(BD2:BD6190)
and press enter.
Calculate the Bonferroni p value Correction
- Select the worksheet labelled "Bonferroni".
- Paste the "Gene ID" column from one of the earlier five worksheets into the column A of this worksheet.
- Paste the entire column of p values for the wildtype into the column B of the worksheet as values. Paste the p values for the other strains in the same manner into columns C-F.
- In cell G1 and H1, type "wt_Bonferroni_pval".
- Calculate the Bonferroni corrected p value for the wildtype data. To do so, type the following formula into cell G2:
=B2*6189
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
- Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into cell H2:
=IF(G2>1,1,G2)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
- In cell I2, type "Significantly_Expressed_Genes".
- Find which genes are significantly differentially expressed by typing the following formula into the cell I2:
=IF(G2<0.05,1,0)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
- Scroll down to the bottom of the column.
- In cell I6192, type the following formula:
=SUM(I2:I6190)
- Find the Bonferroni corrected p values for the rest of the strains following the procedure outlined for the wildtype above. Change the column names so that they reflect the strain on which you are working on. Change the cell designations in the formulas to reflect the cells corresponding to the strain on which you are working on.
Calculate the Benjamini & Hochberg p value Correction
- Click on the worksheet named "B&H".
- Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190.
- Copy and paste the column of Gene ID's from one of the previous worksheets into columns C, G, K, O, and S. For the following, use Paste special > Paste values. Paste the column of p values from the wildtype worksheet into column D. Paste the column of p values from the dCIN5 worksheet into column H. Paste the column of p values from the dGLN3 worksheet into column L. Paste the column of p values from the dHMO1 worksheet into column P. Paste the column of p values from the dZAP1 worksheet into column T.
- Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".
Sort columns, H, L, P, and T in the same way.
- Calculate the Benjamini and Hochberg p value correction for the wildtype. Type "wt_BH_pval" in cell E1. Type the following formula in cell E2:
=(D2*6189)/A2
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
- Calculate the Benjamini and Hochberg correction for the rest of the strains in column I for dCIN5, in column M for dGLN3, in column Q for dHMO1, and in column U for dZAP1.
- Type "Significantly_Expressed_Genes" into cell F1, J1, N1, R1, and V1.
- Find which genes are significantly differentially expressed. In order to do so for the wildtype, type the following formula into cell F2:
=IF(E2<0.05,1,0)
and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.
- Scroll down to the bottom of the column F.
- In cell 6191 of the column, type the following formula:
=SUM(F2:F6190)
- Find which genes are significantly differentially expressed for the rest of the strains using columns J, N, R, and V. You can use the formulas above but make sure to alter the cell number.
Calculating the F statistic and p value corrections in MATLAB
- The MATLAB script for the statistical analysis needs as input a Microsoft Excel spreadsheet with all of the normalized data. To prepare this input,
- Open the .csv file created after normalizing the data (i.e. GCAT_and_Ontario_Final_Normalized_Data.csv).
- Rename the worksheet as "Master_Sheet".
- Type the identifier "Gene ID" in cell A1.
- Insert a new column after column A.
- Type the identifier "Standard Name" in cell B1. Column B will contain the common names for the genes on the microarray.
- Copy the gene names in column A.
- Paste the names into the "Value" field of the List <-> Gene List tool in YEASTRACT [1]. Then, click on the "Transform" button.
- Select all of the names in the "Gene Name" column of the resulting table.
- Copy and paste these names into column B of the .csv file.
- Save the .csv file as a .xls file.
- Close the .csv file and open MATLAB. Make sure the directory is set to where the input file is located.
- Specify the name of the Excel file in filename and the name of the worksheet that contains the data and strain for which the data will be import into MATLAB in sheetname. Specify the strain in strain.
filename = 'GCAT_and_Ontario_Final_Normalized_Data.xls'; sheetname = 'Master_Sheet'; strain = 'wt';
Make sure that the strain name is exactly as it appears in the column headers in the Excel spreadsheet (including the same case).
- Import the Excel spreadsheet and create a separate matrix containing only the numerical data in the spreadsheet.
[log2FC,b]=xlsread(filename,sheetname); a = log2FC(:,2:end);
- For each time point in the experiment, identify the columns in the input file that contain the data for the specified strain and time point.
strainT15 = strfind(b(1,:),[strain '_LogFC_t15']); strainT30 = strfind(b(1,:),[strain '_LogFC_t30']); strainT60 = strfind(b(1,:),[strain '_LogFC_t60']); strainT90 = strfind(b(1,:),[strain '_LogFC_t90']); strainT120 = strfind(b(1,:),[strain '_LogFC_t120']);
- Create vectors of column indices that were identified for each time point for the strain specified.
ind15 = find(not(cellfun('isempty',strainT15)))-(length(b(1,:))-length(a(1,:)))+1; ind30 = find(not(cellfun('isempty',strainT30)))-(length(b(1,:))-length(a(1,:)))+1; ind60 = find(not(cellfun('isempty',strainT60)))-(length(b(1,:))-length(a(1,:)))+1; ind90 = find(not(cellfun('isempty',strainT90)))-(length(b(1,:))-length(a(1,:)))+1; ind120 = find(not(cellfun('isempty',strainT120)))-(length(b(1,:))-length(a(1,:)))+1;
- Specify the significance value alpha for the ANOVA, the total number of genes n to be analyzed, the number of independent variables p, and the number of constraints q on the linear model.
alpha = 0.05; n = length(a(:,1)); p = 5; q = 5;
- Concatenate all of the indices into one vector.
ind = [ind15,ind30,ind60,ind90,ind120]; N = length(ind);
- To perform the ANOVA, it is necessary to set up a regression matrix X, which is a matrix of ones and zeros that specifies the independent variables. Set up the regressor matrix so that it has as many rows as the number of indices and as many columns as there are independent variables.
X = zeros(N,p);
- If the indices do not start at 1, then the regressor matrix will not be constructed correctly. To remedy this, create a new set of indices by manipulating the initial indices to begin at 1. Then, create the regressor matrix X.
if ind(1)~= 1 ind151 = ind15-ind15(1)+1; ind301 = ind30-ind30(1)+ind151(end)+1; ind601 = ind60-ind60(1)+ind301(end)+1; ind901 = ind90-ind90(1)+ind601(end)+1; ind1201 = ind120-ind120(1)+ind901(end)+1; X(ind151,1) = 1; X(ind301,2) = 1; X(ind601,3) = 1; X(ind901,4) = 1; X(ind1201,5) = 1; else X(ind15,1) = 1; X(ind30,2) = 1; X(ind60,3) = 1; X(ind90,4) = 1; X(ind120,5) = 1; end
- Specify the dimensions of the output matrix. This output matrix will contain two columns of gene names, a column for the average log fold change for each time point, the F statistic, the p value, the SSH, the SSF, the Benjamini & Hochberg corrected p values, whether or not the p value is significant, and the rank of each gene's p value after performing the Benjamini & Hochberg correction.
out_data = zeros(n,12);
- To compute the F statistic and p value for each gene for the wild type strain, first the indices have to be modified. This modification involves ignoring the indices that correspond to the time points for which a gene has "NaN" as the log fold change.
for ii = 1:n if strcmpi(strain,'wt') == 1 I = find(~isnan(a(ii,ind))); ind15x = I(I>=ind15(1)&I<=ind15(end)); ind30x = I(I>=ind30(1)&I<=ind30(end)); ind60x = I(I>=ind60(1)&I<=ind60(end)); ind90x = I(I>=ind90(1)&I<=ind90(end)); ind120x = I(I>=ind120(1)&I<=ind120(end));
- Concatenate these modified indices into a new vector indx. Find the length N of the newly created vector.
indx = [ind15x,ind30x,ind60x,ind90x,ind120x]; N = length(indx);
- The continuity of the indices from one time point to another was lost in modifying the indices for the wild type data. Manipulate the modified indices so that they are continuous from one time point to the next.
if N<23 ind15x = find(ind15x); ind30x = find(ind30x)+ind15x(end); ind60x = find(ind60x)+ind30x(end); ind90x = find(ind90x)+ind60x(end); ind120x = find(ind120x)+ind90x(end); end
- Set up the regressor matrix using the indices ind15x, ind30x, ind60x, ind90x, and ind120x.
X = zeros(N,p); X(ind15x,1) = 1; X(ind30x,2) = 1; X(ind60x,3) = 1; X(ind90x,4) = 1; X(ind120x,5) = 1;
- The aforementioned modifications to the indices do not need to be performed for the deletion strains. Therefore, the following condition ensures that the indices remain as they were computed at the beginning of the code.
else indx = ind; end
- For each gene, create a vector for the log fold change values that correspond to that gene.
Y = a(ii,indx)';
- Compute the average log fold change.
beta = X\Y;
- Specify which columns in the output matrix will store the average log fold changes.
out_data(ii,1:q) = beta';
- Find the sum of squares of error for the hypothesized model s2h and the full model s2. Then, find the F statistic and p value.
s2 = (1/N)*(Y-X*beta)'*(Y-X*beta); s2h = (1/N)*(Y'*Y); W = ((N-p)/q)*((s2h-s2)/s2); pval = 1-fcdf(W,q,N-p);
- Specify the columns that are to be outputted.
out_data(ii,6) = W; out_data(ii,7) = pval; out_data(ii,8) = s2*N; out_data(ii,9) = s2h*N; end
- Perform the Benjamini & Hochberg correction.
p1 = out_data(:,7); q1 = (alpha/n)*[1:n]';
[ps,is] = sort(p1);
tests = ps<=q1; itests = find(tests==1); itestm = max(itests);
qs = zeros(size(q1)); rs = qs; for ii = 1:n qs(is(ii)) = ps(ii)*n/ii; rs(is(ii)) = ii; end
- Output the data.
signif = zeros(n,1); signif(is(1:itestm)) = 1;
out_data(:,10) = qs; out_data(:,11) = signif; out_data(:,12) = rs;
eval(['save ' strain '_out_data out_data;']);
out_data_cells{1,1} = 'Systematic Name'; out_data_cells{1,2} = 'Standard Name'; out_data_cells{1,3} = [strain '_t15']; out_data_cells{1,4} = [strain '_t30']; out_data_cells{1,5} = [strain '_t60']; out_data_cells{1,6} = [strain '_t90']; out_data_cells{1,7} = [strain '_t120']; out_data_cells{1,8} = 'f stat'; out_data_cells{1,9} = 'p val'; out_data_cells{1,10} = 'SS full'; out_data_cells{1,11} = 'SS hyp'; out_data_cells{1,12} = 'B&H comps'; out_data_cells{1,13} = '? signif ?'; out_data_cells{1,14} = 'B&H rank'; for ii = 1:n out_data_cells{1+ii,1} = b{ii+1,1}; out_data_cells{1+ii,2} = b{ii+1,2}; for jj = 1:12 out_data_cells{1+ii,2+jj} = out_data(ii,jj); end end xlswrite([strain '_one_strain_ANOVA_out_data.xls'],out_data_cells)
Comparing Significant Changes in Expression Between Two Strains
A general linear model is used to test the hypothesis that each gene shows the same pattern of gene expression over the time course of the experiment in each strain. The null hypothesis is that the average log fold change μ of a gene g for a time point j in one strain s is the same as that in another strain s'. The alternative hypothesis is then that the average log fold change μ of a gene g for a time point j in one strain s is different as that in another strain s'. As with the statistical test to determine significant changes in expression in each strain, the analysis of gene expression between two genes involves finding the sum of squares of error for the models under the null and alternative hypothesis, called SSH and SSF, respectively:
SSH=Σ_{i,j,s}(Y^{g}_{ij}-μ^{g}_{j})^{2} SSF=Σ_{i,j,s}(Y^{g,s}_{ij}-μ^{g,s}_{j})^{2}
and, then, finding the F statistic and p value for each gene.
Finding a General Linear Model in MATLAB
All two strain comparisons were performed in MATLAB using the following code:
For the first block of code, the user must input the name of the Excel file (.xls) to be imported as the variable "filename", the sheet from which the data will be imported as the variable "sheetname", and the two strains that will be compared as the variables "strain1" and "strain2".
%% User must input filename, sheetname, and strains for comparison filename = 'GCAT_and_Ontario_Final_Normalized_Data.xls'; % Name of input file sheetname = 'Master_Sheet'; % Name of sheet in input file containing data to analyze % % If one of the two strains you are working on is the wildtype, keep that % % wildtype as strain 1. strain1 = 'wt'; %Here should be wt, dCIN5, dGLN3, dHMO1, or dZAP1 % % Select strain 2 to be one of the other strains you would like to % % compare with the first strain. strain2 = 'dZAP1'; %Here should be dCIN5, dGLN3, dHMO1, or dZAP1
The user does not have to modify any of the code from here on.
- The next two lines of code ask the user whether or not they would like to see plots for each gene with an unadjusted p-value < 0.05. If the user does want to see these plots, they enter "1". If they would not like to see these plots, the user enters "0".
disp('Do you want to view plots for each gene with an unadjusted p-value < 0.05?') graph = input('If yes, enter "1". If no, enter "0". ');
- Load numerical and character data separately. The numerical data is stored in a matrix "a" and the character data is stored in a matrix "b".
[a,b]=xlsread(filename,sheetname); % % a = numerical data from sheet in matrix form % b = character data from sheet in cell array form % %
- Identify where in the cell array b does the first strain strain1 and the specified time point appear.
T15S1 = strfind(b(1,:),[strain1 '_LogFC_t15']); %15 minutes T30S1 = strfind(b(1,:),[strain1 '_LogFC_t30']); %30 minutes T60S1 = strfind(b(1,:),[strain1 '_LogFC_t60']); %60 minutes T90S1 = strfind(b(1,:),[strain1 '_LogFC_t90']); %90 minutes T120S1 = strfind(b(1,:),[strain1 '_LogFC_t120']); %120 minutes
- Identify where in the cell array b does the second strain strain2 and the specified time point appear.
T15S2 = strfind(b(1,:),[strain2 '_LogFC_t15']); %15 minutes T30S2 = strfind(b(1,:),[strain2 '_LogFC_t30']); %30 minutes T60S2 = strfind(b(1,:),[strain2 '_LogFC_t60']); %60 minutes T90S2 = strfind(b(1,:),[strain2 '_LogFC_t90']); %90 minutes T120S2 = strfind(b(1,:),[strain2 '_LogFC_t120']); %120 minutes
- Create an index designating which column of the data array a corresponds to each time point for the first strain strain1.
ind15S1 = find(not(cellfun('isempty',T15S1)))-(length(b(1,:))-length(a(1,:))); %15 minutes ind30S1 = find(not(cellfun('isempty',T30S1)))-(length(b(1,:))-length(a(1,:))); %30 minutes ind60S1 = find(not(cellfun('isempty',T60S1)))-(length(b(1,:))-length(a(1,:))); %60 minutes ind90S1 = find(not(cellfun('isempty',T90S1)))-(length(b(1,:))-length(a(1,:))); %90 minutes ind120S1 = find(not(cellfun('isempty',T120S1)))-(length(b(1,:))-length(a(1,:))); %120 minutes
- Combine all data column indices into one array
indS1 = [ind15S1,ind30S1,ind60S1,ind90S1,ind120S1];
- Create an index designating which column of the data array a corresponds to each time point for the second strain strain2.
ind15S2 = find(not(cellfun('isempty',T15S2)))-(length(b(1,:))-length(a(1,:))); %15 minutes ind30S2 = find(not(cellfun('isempty',T30S2)))-(length(b(1,:))-length(a(1,:))); %30 minutes ind60S2 = find(not(cellfun('isempty',T60S2)))-(length(b(1,:))-length(a(1,:))); %60 minutes ind90S2 = find(not(cellfun('isempty',T90S2)))-(length(b(1,:))-length(a(1,:))); %90 minutes ind120S2 = find(not(cellfun('isempty',T120S2)))-(length(b(1,:))-length(a(1,:))); %120 minutes
- Combine all data column indices into one array
indS2 = [ind15S2,ind30S2,ind60S2,ind90S2,ind120S2]; N = length(indS1)+length(indS2);
- Determine the significance value for the statistical test and the total number of genes. The variable p is number of time points multiplied by the number of strains while q = p/number of strains.
alpha = 0.05; %Significance value for statistical test. n = length(a(:,1)); %Total number of genes. p = 10; % Number of time points multiplied by the number of strains. q = 5; % p/number of strains
- To compute the sum of the square of the error in the models associated with the null and alternative hypotheses, it is necessary to construct two matrices (X and Xh) representing the independent variables in the linear models associated with the two hypotheses. The following code constructs these matrices in the case that neither of the two strains in the statistical analysis are the wild type.
if strcmpi(strain1,'wt') == 0 indS1x = find(indS1); i0 = ind15S1(1); i1 = indS1x(end); i2 = indS2(1); %Create new indices without any breaks so as to properly construct the %regressor matrices X and Xh. ind151 = find(ind15S1); ind301 = ind30S1-i0+1; ind601 = ind60S1-i0+1; ind901 = ind90S1-i0+1; ind1201 = ind120S1-i0+1;
ind1 = [ind151,ind301,ind601,ind901,ind1201]; ind152 = ind15S2-i2+1+i1; ind302 = ind30S2-i2+1+i1; ind602 = ind60S2-i2+1+i1; ind902 = ind90S2-i2+1+i1; ind1202 = ind120S2-i2+1+i1;
ind2 = [ind152,ind302,ind602,ind902,ind1202]; %Length of each array of indices n151 = length(ind151); n301 = length(ind301); n601 = length(ind601); n901 = length(ind901); n1201 = length(ind1201); n152 = length(ind152); n302 = length(ind302); n602 = length(ind602); n902 = length(ind902); n1202 = length(ind1202); %ts1 and ts2 are arrays of time points reflecting the number of %replicates in the data for strain1 and strain2, respectively. %This will be necessary to later plot the linear model for each gene. ts1 = [ones(n151,1)*15;ones(n301,1)*30;ones(n601,1)*60;ones(n901,1)*90;ones(n1201,1)*120]; ts2 = [ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120]; t = [ts1;ts2]; [tsort,is] = sort(t); %tsort contains the sorted list of time points after combining arrays ts1 and ts2. % X and Xh are regressor matrices for the full and hypothesized % models, respectively X = zeros(N,p); Xh = zeros(N,p-q); Xh(ind151,1) = 1; Xh(ind301,2) = 1; Xh(ind601,3) = 1; Xh(ind901,4) = 1; Xh(ind1201,5) = 1; Xh(ind152,1) = 1; Xh(ind302,2) = 1; Xh(ind602,3) = 1; Xh(ind902,4) = 1; Xh(ind1202,5) = 1; X(ind151,1) = 1; X(ind301,2) = 1; X(ind601,3) = 1; X(ind901,4) = 1; X(ind1201,5) = 1; X(ind152,6) = 1; X(ind302,7) = 1; X(ind602,8) = 1; X(ind902,9) = 1; X(ind1202,10) = 1; end
- Specify a matrix to store the parameters estimated by solving the linear system of equations to obtain the model.
% % out_data includes % out_data = zeros(n,12); nsig = 0;
- The following code constructs the matrices X and Xh in the case that the first of the two specified strains in the statistical analysis is the wild type. This for loop also creates graphs of the linear model for each gene for each strain.
for ii=1:n if strcmpi(strain1,'wt') == 1 % Excludes all fold changes that have are an NaN. I = find(~isnan(a(ii,indS1))); ind15x = I(I>=ind15S1(1)&I<=ind15S1(end)); ind30x = I(I>=ind30S1(1)&I<=ind30S1(end)); ind60x = I(I>=ind60S1(1)&I<=ind60S1(end)); ind90x = I(I>=ind90S1(1)&I<=ind90S1(end)); ind120x = I(I>=ind120S1(1)&I<=ind120S1(end)); %indx represents the list of columns from which the data will be drawn indx = [ind15x,ind30x,ind60x,ind90x,ind120x]; c = length(indx); if c<23 ind15x = find(ind15x); ind30x = find(ind30x)+ind15x(end); ind60x = find(ind60x)+ind30x(end); ind90x = find(ind90x)+ind60x(end); ind120x = find(ind120x)+ind90x(end); end indX = find(indx); i1 = indX(end); i2 = indS2(1); ind152m = ind15S2-i2+1+i1; ind302m = ind30S2-i2+1+i1; ind602m = ind60S2-i2+1+i1; ind902m = ind90S2-i2+1+i1; ind1202m = ind120S2-i2+1+i1; N = length(indx)+length(indS2); % X and Xh are regressor matrices for the full and hypothesized % models, respectively X = zeros(N,p); Xh = zeros(N,p-q); Xh(ind15x,1) = 1; Xh(ind30x,2) = 1; Xh(ind60x,3) = 1; Xh(ind90x,4) = 1; Xh(ind120x,5) = 1; Xh(ind152m,1) = 1; Xh(ind302m,2) = 1; Xh(ind602m,3) = 1; Xh(ind902m,4) = 1; Xh(ind1202m,5) = 1; X(ind15x,1) = 1; X(ind30x,2) = 1; X(ind60x,3) = 1; X(ind90x,4) = 1; X(ind120x,5) = 1; X(ind152m,6) = 1; X(ind302m,7) = 1; X(ind602m,8) = 1; X(ind902m,9) = 1; X(ind1202m,10) = 1; %Length of each array of indices n151 = length(ind15x); n301 = length(ind30x); n601 = length(ind60x); n901 = length(ind90x); n1201 = length(ind120x); n152 = length(ind152m); n302 = length(ind302m); n602 = length(ind602m); n902 = length(ind902m); n1202 = length(ind1202m); %ts1 and ts2 are arrays of time points reflecting the number of %replicates in the data for strain1 and strain2, respectively. %This will be necessary to later plot the linear model for each gene. ts1 = [ones(n151,1)*15;ones(n301,1)*30;ones(n601,1)*60;ones(n901,1)*90;ones(n1201,1)*120]; ts2 = [ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120]; t = [ts1;ts2]; [tsort,is] = sort(t); %tsort contains the sorted list of time points after combining arrays ts1 and ts2. else indx = indS1; end Ys1 = a(ii,indx)'; %Data for a gene for the first strain Ys2 = a(ii,indS2)'; %Data for a gene for the second strain Y = [Ys1;Ys2]; beta = X\Y; %Average log fold change for each time point in the full model. betah = Xh\Y; %Average log fold change for each time point in the hypothesized model. out_data(ii,1:10) = beta'; out_data(ii,11:15) = betah'; s2 = (1/N)*(Y-X*beta)'*(Y-X*beta); %Sum of squares of error in the full model s2h = (1/N)*(Y-Xh*betah)'*(Y-Xh*betah); %Sum of squares of error in the hypothesized model W = ((N-p)/q)*((s2h-s2)/s2); % F statistic pval = 1-fcdf(W,q,N-p); %P value out_data(ii,16) = W; out_data(ii,17) = pval; out_data(ii,18) = s2*N; out_data(ii,19) = s2h*N;
- NOTE: When running this protocol piece by piece using the version of MATLAB installed on computers in new PC lab in the Mathematics Department, an error comes up at the following if statement designating that the construct "ind1201(...end...)" is ambiguous. However, this error does not appear when running the MATLAB script all the way through. In addition, this error does not appear when I run the protocol piece by piece on my personal computer running MATLAB R2009b.
if strcmpi(strain1,'wt') == 1 Ym1 = X(ind15x(1):ind120x(end),1:5)*beta(1:5,1); Ym2 = X(ind152m(1):ind1202m(end),6:10)*beta(6:10,1); else Ym1 = X(ind151(1):ind1201(end),1:5)*beta(1:5,1); Ym2 = X(ind152(1):ind1202(end),6:10)*beta(6:10,1); end Yh = Xh*betah; %Model based on null hypothesis that both genes exhibit the same dynamics
- Graph the expression profile of each gene as determined by the general linear model.
if graph == 1 pause on if pval<0.05 figure(1),plot(ts1,Ys1,'ro'),hold on,plot(ts2,Ys2,'mo'),plot(ts1,Ym1,'r','LineWidth',2),plot(ts2,Ym2,'m','LineWidth',2),... plot(tsort,Yh(is),'k','LineWidth',2),title([b{ii+1,1} '/' b{ii+1,2},' W = ',num2str(W),' pval = ',num2str(pval)]),... xlabel('time, minutes'),ylabel('log fold change'),... legend([strain1 ' data'],[strain2 ' data'],[strain1 ' model'],[strain2 ' model'],'hypothesized model'),drawnow hold off nsig = nsig + 1; [ii,nsig] drawnow pause end end end
- Perform the Benjamini & Hochberg correction.
p1 = out_data(:,17); %P values q1 = (alpha/n)*[1:n]';
[ps,is] = sort(p1); %Sort the p values.
tests = ps<=q1; itests = find(tests==1); itestm = max(itests);
qs = zeros(size(q1)); rs = qs; for ii = 1:n qs(is(ii)) = ps(ii)*n/ii; % Perform B&H Correction rs(is(ii)) = ii; end
signif = zeros(n,1); signif(is(1:itestm)) = 1; % Specify whether or not the corrected p value is significant
- Output all of the statistical data.
% Output all data
out_data(:,20) = qs; out_data(:,21) = signif; out_data(:,22) = rs;
eval(['save ' strain1 '_' strain2 '_out_data out_data;']);
out_data_cells{1,1} = 'Systematic Name'; out_data_cells{1,2} = 'Standard Name'; out_data_cells{1,3} = [strain1 '_t15']; out_data_cells{1,4} = [strain1 '_t30']; out_data_cells{1,5} = [strain1 '_t60']; out_data_cells{1,6} = [strain1 '_t90']; out_data_cells{1,7} = [strain1 '_t120']; out_data_cells{1,8} = [strain2 '_t15']; out_data_cells{1,9} = [strain2 '_t30']; out_data_cells{1,10} = [strain2 '_t60']; out_data_cells{1,11} = [strain2 '_t90']; out_data_cells{1,12} = [strain2 '_t120']; out_data_cells{1,13} = 'hyp_t15'; out_data_cells{1,14} = 'hyp_t30'; out_data_cells{1,15} = 'hyp_t60'; out_data_cells{1,16} = 'hyp_t90'; out_data_cells{1,17} = 'hyp_t120'; out_data_cells{1,18} = 'f stat'; out_data_cells{1,19} = 'p val'; out_data_cells{1,20} = 'SS full'; out_data_cells{1,21} = 'SS hyp'; out_data_cells{1,22} = 'B&H comps'; out_data_cells{1,23} = '? signif ?'; out_data_cells{1,24} = 'B&H rank'; for ii = 1:n out_data_cells{1+ii,1} = b{ii+1,1}; out_data_cells{1+ii,2} = b{ii+1,2}; for jj = 1:22 out_data_cells{1+ii,2+jj} = out_data(ii,jj); end end xlswrite([strain1 '_' strain2 '_ANOVA_out_data.xls'],out_data_cells)