My Computational Journal Summer 2012
May 14, 2012
Today code for a simple ANOVA and a comparison of two strains was tested to see if the results produced by our adviser could be reproduced. The output without modifying the code for the comparison of two strains using wildtype and dGLN3 for the comparison could be reproduced. A few difficulties have been faced when trying to modify the simple ANOVA code. It was attempted to match the results produced for dGLN3 using the two strain comparison code with the results of the simple ANOVA code with the input specified as the log fold concentrations for the dGLN3 data. After being initially unsuccessful, the code was revisited and looked at more closely. It was found that two matrices in the matrix division had an unequal number of rows (matrix X and matrix Y). The indices (which specify columns of the input file) were modified after which matrix X had the same number of rows as matrix Y. The following code was used:
ind15 = ind15-indx(1)+1; ind30 = ind30-indx(1)+1; ind60 = ind60-indx(1)+1; ind90 = ind90-indx(1)+1; ind120 = ind120-indx(1)+1;
However, the results from the simple ANOVA for dGLN3 still did not match the results for dGLN3 from the comparison of two strains code.
Katrina Sherbina 21:31, 14 May 2012 (EDT)
May 15, 2012
A separate set of indices were made to designate the rows of the X matrices for the reduced and the full model. This separate set of indices (one for each timpeoint) removed rows of zeros that were previously added when substituting the select 0's for 1's to create the two X matrices. As a result, it was possible to call upon the original indices to extract the log fold changes corresponding to a specific deletion to create the Y matrix. After these corrections, the two strain comparisons were performed between wildtype and each one of the deletion strains.
Further modifications were made to the two strain comparison code to be able to make a comparison of all of the strains simulatenously. To do so, indices were added to designate the columns in the input corresponding to each of the deletion strains. A separate set of indices (one index per timepoint) for each strain as in the case of the two strain comparison. The parameters p were increased to 25 (five timepoints for each deletion strain) and the constraints q were increased to 20 to yield an X and Xh matrices with the appropriate numbers of columns. Changes were made to the out_data(ii,[number]) lines to account for the increase in the number of strains being compared. In the output, the data for each of the timepoints for each of the deletion strains matched the corresponding data in the two strain comparisons.
The next task is figure out how to handle missing data. Since both GCAT and Ontario chips were used, log fold change concentrations are missing for some genes for some timepoints for the wildtype (these cells show up as NaN). One possibility is to modify the X matrices so that they take into account the NaN's for each gene. On this note, a for loop was begun to try to exclude timepoints for a gene for which there is an NaN.
Katrina Sherbina 20:38, 15 May 2012 (EDT)
May 16, 2012
NaN's were removed from the Y matrix for each gene by using the command YY = Y(~isnan(Y(:,1)),1). This truncated the Y matrix from 43 rows to 34 rows. However, in doing so, it was not possible to solve for beta (the solution to the linear system XB=Y). Another method to deal with the NaN's was to replace them with a very large number using the command Y(isnan(Y(:,1)),1)=1e6 in order to be able to recognize the timepoints and flasks for which there were NaN's in the output. However, this messed up the B&H corrections.
The next step will build on trucating the Y matrix by getting rid of the rows with the NaN's altogether. To keep this step, the X and Xh matrix pairs must be formulated so that they are unique to each gene. A possibilty is to use a for loop to define each of the indices for each gene by the columns that do not contain an NaN for that gene. The Y matrix calculations would have to be within this loop.
Katrina Sherbina 21:33, 16 May 2012 (EDT)
May 17, 2012
A for loop was experimented with to generate different X and Xh matrix pairs for each gene to take into account those genes that have an NaN for a a flask for a timepoint in the wildtype. This loop contained a majority of the same coding that was used to set up the indices in the original two strain comparison code. The different lines of code are
for ii=1:n I = find(~isnan(a(ii,1:23))); ind15 = I(find(I>=1&I<=4)); ind30 = I(find(I>=5&I<=9)); ind60 = I(find(I>=10&I<=13)); ind90 = I(find(I>=14&I<=18)); ind120 = I(find(I>=19&I<=23));
indX = find(indx); end
These new lines of code retain the original column numbers of matrix a (the matrix containing the log fold change data) in the indices. The NaN's were then also removed from the corresponding Y vectors. At first, the loop to compute the Y matrix was put into the loop to compute the X and Xh pairs. However, the code ran for too long at that point. Then the Y loop was separated from the loop to compute the X and Xh matrices. An output was generated. However, for any of the genes that had at least one NaN, the entire row for that gene was blank. In dissecting this new code several problems were found. First, the X matrices for genes that had NaN's were not being computed correctly. Since the indices still had the original column numbers of matrix a, the ones were not put into the X matrix in the correct places. Briefly, cell arrays were looked at as a solution instead of numerical arrays because there can be blank cells in a cell array but numerical arrays do not contain blanks. However, this potential solution proved to complicate the calculations that need to be made.
May 18, 2012
(This entry is being written on May 20th regarding the work that was done on May 18th.)
The first issue that was tackled was the improper formation of the X and Xh matrices. For any of the genes that had at least one NaN, the 1's were put in the wrong positions in the X and Xh matrices. This is because the indices that were used to replace the 0's in the X and Xh matrices with 1 did not start at 1. This was necessary to later exlude the cells that had NaN's in the Y matrix. Rather than getting rid of the method to select the cells that do not have NaN's, the indices that were used to replace the 0's with 1's in the X and Xh matrices were renumbered if indx (the matrix of the indices for each time point) was less than 23 (the maximum number of columns correponding to the wildtype if none of the cells have an NaN for that gene). The following added bit of code was put into the for loop for the X and Xh matrices after the indx, indX, i1, and i2 lines of code:
if indx<23 ind15 = find(ind15); ind30 = ind30-ind15(end)+1; ind60 = ind60-ind30(1)-1; ind90 = ind90-ind60(1); ind120 = ind120-ind90(1); end
However, this may have fixed the X and Xh matrices. However, in the final output, the rows corresponding to the genes with NaN's were still completely blank.
At this point, the wrong direction was taken to find a solution to this. Some experimenting was done to produce X and Xh matrices all of which had 43 rows but some of which had something other than a 0 or 1 in the cell to designate the NaN in the original a matrix. Along this line, the NaN's were kept in the Y matrix. The goal was to be able to somehow ignore the NaN's then when calculating beta and betah.
It was considered to keep the NaN's because it was thought that the problems with the output had to do with some X, Xh, and Y matrices having less than 43 rows while others had 43 rows. Then looking back at how linear systems of equations are solved from what I learned in my linear algebra class, I realized that the number of rows could not be the reason why the some rows of genes in the output were completely blank. Even though the number of rows changed, the number of columns corresponding to the different type points never did. All the X matrices had 10 columns and all the Xh matrices had 5 columns. Therefore, there should be 5 betas being calculated (one for each timepoint) for the wildtype and for dGLN3 for all of the genes regardless of whether or not a gene had NaN's.
The next step in troubleshooting the code will be testing it with just a select number of genes that have NaN's for some timpeoints and flasks.
Katrina Sherbina 23:44, 20 May 2012 (EDT)
May 21, 2012
The two strain comparison code was tested with an output that contained only seven genes (six of which had an NaN). At this point, there were no blank rows. To fix the blank rows that appear for genes with NaN's, the computation for the y loop was put in the same loop as the computations for the X and Xh matrices. This ensured that when the betas and betahswere being caluclated that the X and Xh matrices changed accordingly (depending on the gene for which the computation was being done). Previously, the code was set up so that he betas and betahs were calculated with the last X and Xh matrix that was computed in the loop.
Several changes were made to this code to compare all strains (wildtype, dCIN5, dGLN3, dHMO1, and dZAP1). First, indices were created for dCIN5, dGLN3, dHMO1, and dZAP1 (the two strain comparison code only had matrices for the widlytpe and dGLN3). The indices for the deletion strains were constructed for each timepoint similar to how they were constructed for dGLN3 making sure to specifiy the appropriate columns in the original data for each strain.
So that the X and Xh matrices were of the correct size and had 1's and 0's positions in the correct columns and rows, the indices for each timepoint for each strain were adjusted in the for loop.
i3 = ind1202m(end); i4 = indx3(1);
ind153m = ind153-i4+1+i3; ind303m = ind303-i4+1+i3; ind603m = ind603-i4+1+i3; ind903m = ind903-i4+1+i3; ind1203m = ind1203-i4+1+i3;
i5 = ind1203m(end); i6 = indx4(1);
ind154m = ind154-i6+1+i5; ind304m = ind304-i6+1+i5; ind604m = ind604-i6+1+i5; ind904m = ind904-i6+1+i5; ind1204m = ind1204-i6+1+i5;
i7 = ind1204m(end); i8 = indx5(1);
ind155m = ind155-i8+1+i7; ind305m = ind305-i8+1+i7; ind605m = ind605-i8+1+i7; ind905m = ind905-i8+1+i7; ind1205m = ind1205-i8+1+i7;
In addition, the the calculations for t, t2, and N were expanded to include all of the indices for all of the strains. The X and Xh matrix for genes with no NaN's had dimensions 103x25 and 103x5, respectively. The X and Xh matrices for the genes with NaN's had dimensions 94x25 and 94x5, respectively.
The coputation for the Y matrix was also altered to include all of the strains:
Y = [a(ii,indx)';a(ii,indx2)';a(ii,indx3)';a(ii,indx4)';a(ii,indx5)'];
The lines of code corresponding to the dimensions of the out_data were also altered to include all of the betas and betahs.
In addition, a few plots from the all strain comparison of the time versus expression genes with both low and high p values were saved. To do so, the if statement for the plots was removed. The save statement used to save each plot generated separately was
saveas(gcf,['Fig' num2str(ii) '.jpg'])
It would be beneficial to change this line of code so that the name of the figure corresponding to the gene for which the expression was being plotted.
Also, an avi file was made of the plots of time versus expression for genes with a p value of less than 0.05 and then those with less than 0.04 (this p value was observed to be the cut off for the B&H corrected p values in the final output for the all strian comparison). The plots for the genes with a p value less than 0.04 have to be redone because a banner that popped up in the figure window was also recorded (it blocks the gene name).
Before the for loop, the following line of code was added
aviobj = avifile('allstrain.avi','fps',15,'compression','Indeo3');
Some fiddling may still need to be done with the number of frames per second.
The following lines of code were added at the end of the for loop (after the if statement for to generate the plots):
%To get the handles from the figure open. Saves the entire figure window that is open. frame = getframe(gcf); %Adds the currently open plot to the avi file. aviobj = addframe(aviobj,frame);
Right after the for loop, a command line was added to close the avi file:
aviobj = close(aviobj);
The avi file had to be opened in QuickTime player. It would not open in windows media player when either the compression format was Indeo3 or Indeo5.