# My Computational Journal Summer 2012

### From OpenWetWare

(→June 12, 2012: Corrected some logical and grammatical errors.) |
Current revision (18:33, 8 October 2012) (view source)m (→July 12, 2012: Corrected a grammatical error.) |
||

Line 1,017: | Line 1,017: | ||

Before the 20,000 iteration reverse stochastic simulation could finish, the entire simulation needed to be restarted setting the fix_P to 1 (it was 0) so that the prodution rates are not estimated. | Before the 20,000 iteration reverse stochastic simulation could finish, the entire simulation needed to be restarted setting the fix_P to 1 (it was 0) so that the prodution rates are not estimated. | ||

- | While the simulation was running during the work day (it has not finished yet), we performed within array normalization and between array normalization for S. paradoxus microarray data gathered during a cold shock experiment. In looking over the normalized data (after both normalization procedures), we noticed some very drastic log fold change values (ex. around -5 and downwards to -15). As this caused us some concern, we made boxplots of the data before normalization, after within array normalization, and between array normalization. The log fold change values after the between array normalization for t120 flask 1 (there was only one flask for t120 for analysis) were mostly negative. There is a very long tail for the box plot below 0 reaching all the way to -15. We then decided to also produce MA plots for before normalization and after within array normalization to corroborate what we saw with t120 flask 1. For t120 flask 1, there was a prominent tail with a negative slope in the before and after within array normalization MA plots. The log fold change values in this tail had increasingly negative log fold changes and increasing intensity values. What was reassuring to see was the robustness of loess normalization as this procedure did not try to somehow correct this negative tail by improperly shifting the values in the tail. For the other chips, the normalization seemed to work well getting rid of hooks. After looking at the graphs, we went back to the pictures of the microarray slides with Dr. Dahlquist. We | + | While the simulation was running during the work day (it has not finished yet), we performed within array normalization and between array normalization for S. paradoxus microarray data gathered during a cold shock experiment. In looking over the normalized data (after both normalization procedures), we noticed some very drastic log fold change values (ex. around -5 and downwards to -15). As this caused us some concern, we made boxplots of the data before normalization, after within array normalization, and between array normalization. The log fold change values after the between array normalization for t120 flask 1 (there was only one flask for t120 for analysis) were mostly negative. There is a very long tail for the box plot below 0 reaching all the way to -15. We then decided to also produce MA plots for before normalization and after within array normalization to corroborate what we saw with t120 flask 1. For t120 flask 1, there was a prominent tail with a negative slope in the before and after within array normalization MA plots. The log fold change values in this tail had increasingly negative log fold changes and increasing intensity values. What was reassuring to see was the robustness of loess normalization as this procedure did not try to somehow correct this negative tail by improperly shifting the values in the tail. For the other chips, the normalization seemed to work well getting rid of hooks. After looking at the graphs, we went back to the pictures of the microarray slides with Dr. Dahlquist. We saw that for the t120 flask 1 microarray, a majority of the spots were very green and there were some blocks that fluoresced very little and if so the spots were green. Tomorrow, we will try to extract from the MA plots which genes consist of the negative slope tail by selecting the cutoffs for the M values and A values where the tail seems to begin and finding the genes that have M values and A values within the cutoff. |

We also performed an ANOVA test with the normalized data to find which genes had significant differential expression during at least one time point. Over 1000 genes had significant differential expression according to the ANOVA (p value<0.05) but there were only 4 significantly differentially expressed genes after performing the Benjamini & Hochberg p value correction (p value<0.05). Looking at the graphs of the data and the full model from the ANOVA test for those 4 genes with B&H p values < 0.05, we saw that the t120 flask 1 data really skewed the model. The results of this ANOVA test are prone to suspicion however considering the normalized log fold changes for t120 flask 1 and the fact that there were at most 2 replicates for each timepoint. | We also performed an ANOVA test with the normalized data to find which genes had significant differential expression during at least one time point. Over 1000 genes had significant differential expression according to the ANOVA (p value<0.05) but there were only 4 significantly differentially expressed genes after performing the Benjamini & Hochberg p value correction (p value<0.05). Looking at the graphs of the data and the full model from the ANOVA test for those 4 genes with B&H p values < 0.05, we saw that the t120 flask 1 data really skewed the model. The results of this ANOVA test are prone to suspicion however considering the normalized log fold changes for t120 flask 1 and the fact that there were at most 2 replicates for each timepoint. |

## Current revision

## Contents |

## Week 1

### May 14, 2012

Today code for a simple ANOVA and a comparison of two strains was tested to see if the results produced by our adviser could be reproduced. The output without modifying the code for the comparison of two strains using wildtype and dGLN3 for the comparison could be reproduced. A few difficulties have been faced when trying to modify the simple ANOVA code. It was attempted to match the results produced for dGLN3 using the two strain comparison code with the results of the simple ANOVA code with the input specified as the log fold concentrations for the dGLN3 data. After being initially unsuccessful, the code was revisited and looked at more closely. It was found that two matrices in the matrix division had an unequal number of rows (matrix X and matrix Y). The indices (which specify columns of the input file) were modified after which matrix X had the same number of rows as matrix Y. The following code was used:

ind15 = ind15-indx(1)+1; ind30 = ind30-indx(1)+1; ind60 = ind60-indx(1)+1; ind90 = ind90-indx(1)+1; ind120 = ind120-indx(1)+1;

However, the results from the simple ANOVA for dGLN3 still did not match the results for dGLN3 from the comparison of two strains code.

Katrina Sherbina 21:31, 14 May 2012 (EDT)

### May 15, 2012

A separate set of indices were made to designate the rows of the X matrices for the reduced and the full model. This separate set of indices (one for each timpeoint) removed rows of zeros that were previously added when substituting the select 0's for 1's to create the two X matrices. As a result, it was possible to call upon the original indices to extract the log fold changes corresponding to a specific deletion to create the Y matrix. After these corrections, the two strain comparisons were performed between wildtype and each one of the deletion strains.

Further modifications were made to the two strain comparison code to be able to make a comparison of all of the strains simulatenously. To do so, indices were added to designate the columns in the input corresponding to each of the deletion strains. A separate set of indices (one index per timepoint) for each strain as in the case of the two strain comparison. The parameters p were increased to 25 (five timepoints for each deletion strain) and the constraints q were increased to 20 to yield an X and Xh matrices with the appropriate numbers of columns. Changes were made to the out_data(ii,[number]) lines to account for the increase in the number of strains being compared. In the output, the data for each of the timepoints for each of the deletion strains matched the corresponding data in the two strain comparisons.

The next task is figure out how to handle missing data. Since both GCAT and Ontario chips were used, log fold change concentrations are missing for some genes for some timepoints for the wildtype (these cells show up as NaN). One possibility is to modify the X matrices so that they take into account the NaN's for each gene. On this note, a for loop was begun to try to exclude timepoints for a gene for which there is an NaN.

Katrina Sherbina 20:38, 15 May 2012 (EDT)

### May 16, 2012

NaN's were removed from the Y matrix for each gene by using the command YY = Y(~isnan(Y(:,1)),1). This truncated the Y matrix from 43 rows to 34 rows. However, in doing so, it was not possible to solve for beta (the solution to the linear system XB=Y). Another method to deal with the NaN's was to replace them with a very large number using the command Y(isnan(Y(:,1)),1)=1e6 in order to be able to recognize the timepoints and flasks for which there were NaN's in the output. However, this messed up the B&H corrections.

The next step will build on trucating the Y matrix by getting rid of the rows with the NaN's altogether. To keep this step, the X and Xh matrix pairs must be formulated so that they are unique to each gene. A possibilty is to use a for loop to define each of the indices for each gene by the columns that do not contain an NaN for that gene. The Y matrix calculations would have to be within this loop.

Katrina Sherbina 21:33, 16 May 2012 (EDT)

### May 17, 2012

A for loop was experimented with to generate different X and Xh matrix pairs for each gene to take into account those genes that have an NaN for a a flask for a timepoint in the wildtype. This loop contained a majority of the same coding that was used to set up the indices in the original two strain comparison code. The different lines of code are

for ii=1:n I = find(~isnan(a(ii,1:23))); ind15 = I(find(I>=1&I<=4)); ind30 = I(find(I>=5&I<=9)); ind60 = I(find(I>=10&I<=13)); ind90 = I(find(I>=14&I<=18)); ind120 = I(find(I>=19&I<=23));

indX = find(indx); end

These new lines of code retain the original column numbers of matrix a (the matrix containing the log fold change data) in the indices. The NaN's were then also removed from the corresponding Y vectors. At first, the loop to compute the Y matrix was put into the loop to compute the X and Xh pairs. However, the code ran for too long at that point. Then the Y loop was separated from the loop to compute the X and Xh matrices. An output was generated. However, for any of the genes that had at least one NaN, the entire row for that gene was blank. In dissecting this new code several problems were found. First, the X matrices for genes that had NaN's were not being computed correctly. Since the indices still had the original column numbers of matrix a, the ones were not put into the X matrix in the correct places. Briefly, cell arrays were looked at as a solution instead of numerical arrays because there can be blank cells in a cell array but numerical arrays do not contain blanks. However, this potential solution proved to complicate the calculations that need to be made.

### May 18, 2012

(This entry is being written on May 20th regarding the work that was done on May 18th.)

The first issue that was tackled was the improper formation of the X and Xh matrices. For any of the genes that had at least one NaN, the 1's were put in the wrong positions in the X and Xh matrices. This is because the indices that were used to replace the 0's in the X and Xh matrices with 1 did not start at 1. This was necessary to later exlude the cells that had NaN's in the Y matrix. Rather than getting rid of the method to select the cells that do not have NaN's, the indices that were used to replace the 0's with 1's in the X and Xh matrices were renumbered if indx (the matrix of the indices for each time point) was less than 23 (the maximum number of columns correponding to the wildtype if none of the cells have an NaN for that gene). The following added bit of code was put into the for loop for the X and Xh matrices after the indx, indX, i1, and i2 lines of code:

if indx<23 ind15 = find(ind15); ind30 = ind30-ind15(end)+1; ind60 = ind60-ind30(1)-1; ind90 = ind90-ind60(1); ind120 = ind120-ind90(1); end

However, this may have fixed the X and Xh matrices. However, in the final output, the rows corresponding to the genes with NaN's were still completely blank.

At this point, the wrong direction was taken to find a solution to this. Some experimenting was done to produce X and Xh matrices all of which had 43 rows but some of which had something other than a 0 or 1 in the cell to designate the NaN in the original a matrix. Along this line, the NaN's were kept in the Y matrix. The goal was to be able to somehow ignore the NaN's then when calculating beta and betah.

It was considered to keep the NaN's because it was thought that the problems with the output had to do with some X, Xh, and Y matrices having less than 43 rows while others had 43 rows. Then looking back at how linear systems of equations are solved from what I learned in my linear algebra class, I realized that the number of rows could not be the reason why the some rows of genes in the output were completely blank. Even though the number of rows changed, the number of columns corresponding to the different type points never did. All the X matrices had 10 columns and all the Xh matrices had 5 columns. Therefore, there should be 5 betas being calculated (one for each timepoint) for the wildtype and for dGLN3 for all of the genes regardless of whether or not a gene had NaN's.

The next step in troubleshooting the code will be testing it with just a select number of genes that have NaN's for some timpeoints and flasks.

Katrina Sherbina 23:44, 20 May 2012 (EDT)

## Week 2

### May 21, 2012

The two strain comparison code was tested with an output that contained only seven genes (six of which had an NaN). At this point, there were no blank rows. To fix the blank rows that appear for genes with NaN's, the computation for the y loop was put in the same loop as the computations for the X and Xh matrices. This ensured that when the betas and betahswere being caluclated that the X and Xh matrices changed accordingly (depending on the gene for which the computation was being done). Previously, the code was set up so that he betas and betahs were calculated with the last X and Xh matrix that was computed in the loop.

Several changes were made to this code to compare all strains (wildtype, dCIN5, dGLN3, dHMO1, and dZAP1). First, indices were created for dCIN5, dGLN3, dHMO1, and dZAP1 (the two strain comparison code only had matrices for the widlytpe and dGLN3). The indices for the deletion strains were constructed for each timepoint similar to how they were constructed for dGLN3 making sure to specifiy the appropriate columns in the original data for each strain.

So that the X and Xh matrices were of the correct size and had 1's and 0's positions in the correct columns and rows, the indices for each timepoint for each strain were adjusted in the for loop.

i3 = ind1202m(end); i4 = indx3(1);

ind153m = ind153-i4+1+i3; ind303m = ind303-i4+1+i3; ind603m = ind603-i4+1+i3; ind903m = ind903-i4+1+i3; ind1203m = ind1203-i4+1+i3;

i5 = ind1203m(end); i6 = indx4(1);

ind154m = ind154-i6+1+i5; ind304m = ind304-i6+1+i5; ind604m = ind604-i6+1+i5; ind904m = ind904-i6+1+i5; ind1204m = ind1204-i6+1+i5;

i7 = ind1204m(end); i8 = indx5(1);

ind155m = ind155-i8+1+i7; ind305m = ind305-i8+1+i7; ind605m = ind605-i8+1+i7; ind905m = ind905-i8+1+i7; ind1205m = ind1205-i8+1+i7;

In addition, the the calculations for t, t2, and N were expanded to include all of the indices for all of the strains. The X and Xh matrix for genes with no NaN's had dimensions 103x25 and 103x5, respectively. The X and Xh matrices for the genes with NaN's had dimensions 94x25 and 94x5, respectively.

The coputation for the Y matrix was also altered to include all of the strains:

Y = [a(ii,indx)';a(ii,indx2)';a(ii,indx3)';a(ii,indx4)';a(ii,indx5)'];

The lines of code corresponding to the dimensions of the out_data were also altered to include all of the betas and betahs.

In addition, a few plots from the all strain comparison of the time versus expression genes with both low and high p values were saved. To do so, the if statement for the plots was removed. The save statement used to save each plot generated separately was

saveas(gcf,['Fig' num2str(ii) '.jpg'])

It would be beneficial to change this line of code so that the name of the figure corresponding to the gene for which the expression was being plotted.

Also, an avi file was made of the plots of time versus expression for genes with a p value of less than 0.05 and then those with less than 0.04 (this p value was observed to be the cut off for the B&H corrected p values in the final output for the all strian comparison). The plots for the genes with a p value less than 0.04 have to be redone because a banner that popped up in the figure window was also recorded (it blocks the gene name).

Before the for loop, the following line of code was added

aviobj = avifile('allstrain.avi','fps',15,'compression','Indeo3');

Some fiddling may still need to be done with the number of frames per second.

The following lines of code were added at the end of the for loop (after the if statement for to generate the plots):

%To get the handles from the figure open. Saves the entire figure window that is open. frame = getframe(gcf); %Adds the currently open plot to the avi file. aviobj = addframe(aviobj,frame);

Right after the for loop, a command line was added to close the avi file:

aviobj = close(aviobj);

The avi file had to be opened in QuickTime player. It would not open in windows media player when either the compression format was Indeo3 or Indeo5.

### May 22, 2012

Some changes were made to the code to write an avi file. However, despite all of the revisions, the avi file was not being made properly. For the first few seconds the plots that showed up for some of the genes looked nothing like they should. To be able to even play the file, the compression had to be set to 'None'. This is what probably caused the file to be 4 GB. They appearance of the strange graphs at the beginning of the movie may be attributed to the large size of the file.

Initially, the p values showing up in the plots in the avi file did not match the p values for the corresponding genes in the output. However, this error was fixed by adjusting the if statement in the loop that creates the plots for genes with a p value of less than 0.05.

if pval<0.05 figure(1),plot(t,Y,'bo'),hold on,plot(t,Ym,'r+','LineWidth',3),plot(t,Yh,'k^','LineWidth',3), title([b{ii+1,1} ' W = ',num2str(W),' pval = ',num2str(pval)]),... xlabel('time, minutes'),ylabel('log fold change'),legend('data','model'),drawnow hold off nsig = nsig + 1; [ii,nsig] end

The rest of today was spent dissecting the code we currently have for the deterministic model. The goal is to split the code for the general parameter estimation function into multiple parts, namely, reading the files into matlab, the least squares error function, plotting the graphs, and writing the output.

Katrina Sherbina 20:38, 22 May 2012 (EDT)

### May 23, 2012

The lines of code to create an avi file from the plots from the all strains comparison were fixed (thanks to Dr. Fitzpatrick). The changes include the following line of code before the for loop:

figure(1) set(gcf, 'DoubleBuffer','on'); yourMovieFileName = ['all_strain_comparison_p_value_less_than_0.05_Cinepak.avi'];

The frame = getframe(gcf) and aviobj = addframe(aviobj,frame) inside the if statement to generate the plots was removed. The aviobj=close(aviobj) line of code immediately after the for loop was replaced with

movie2avi(movieArray,yourMovieFileName,'compression','Cinepak','fps',5,'quality',100);

This time the movie generated with Cinepak compression code be played on Windows Media Player on an XP operating system. The size of this movie was significantly smaller than the size of the previous movie made without compression that could play with Windows Media Player.

There are three different scripts for the statistical analysis in MATLAB:

- all_strains_compare_movie.m to do a comparison for all strains
- wildtype_deletion_strain_compare.m to do a comparison between the wildtype and another strain
- two_deletion_strain_compare.m to do a comparison between two deletion strains

The two strain comparison code was divided into two (the scripts in the last two bullets above) because the for loop designed to remove NaN's is not necessary for the deletion strain data. The principle differences between these two scripts is the calculations to renumber the indices. The following is the part of the script for the wildtype to deletion strain comparisons that contains differences with the other two strain comparison script:

%These indices are changed depending on what deletion strain is being analyzed ind152 = [33,34,35,36]; ind302 = [37,38,39,40]; ind602 = [41,42,43,44]; ind902 = [45,46,47,48]; ind1202 = [49,50,51,52];

p = 10; q = 5;

alpha = 0.05; % significance level for the tests

n152 = length(ind152); n302 = length(ind302); n602 = length(ind602); n902 = length(ind902); n1202 = length(ind1202);

indx2 = [ind152,ind302,ind602,ind902,ind1202];

n = length(a(:,1));

out_data = zeros(n,12); nsig = 0;

for ii=1:n I = find(~isnan(a(ii,1:23))); ind15 = I(find(I>=1&I<=4)); ind30 = I(find(I>=5&I<=9)); ind60 = I(find(I>=10&I<=13)); ind90 = I(find(I>=14&I<=18)); ind120 = I(find(I>=19&I<=23)); indx = [ind15,ind30,ind60,ind90,ind120]; indX = find(indx); if length(indx)<23 ind15 = find(ind15); ind30 = ind30 - ind15(end); ind60 = ind60 - ind30(1) - 1; ind90 = ind90 - ind60(1); ind120 = ind120-ind90(1); end n15 = length(ind15); n30 = length(ind30); n60 = length(ind60); n90 = length(ind90); n120 = length(ind120); i1 = indX(end); i2 = indx2(1);

ind152m = ind152-i2+1+i1; ind302m = ind302-i2+1+i1; ind602m = ind602-i2+1+i1; ind902m = ind902-i2+1+i1; ind1202m = ind1202-i2+1+i1; t = [ones(n15,1)*15;ones(n30,1)*30;ones(n60,1)*60;ones(n90,1)*90;ones(n120,1)*120;... ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120];

t2 = [min(t):5:max(t)]'; N = length(indx)+length(indx2);

X = zeros(N,p); Xh = zeros(N,p-q);

Xh(ind15,1) = 1; Xh(ind30,2) = 1; Xh(ind60,3) = 1; Xh(ind90,4) = 1; Xh(ind120,5) = 1; Xh(ind152m,1) = 1; Xh(ind302m,2) = 1; Xh(ind602m,3) = 1; Xh(ind902m,4) = 1; Xh(ind1202m,5) = 1;

X(ind15,1) = 1; X(ind30,2) = 1; X(ind60,3) = 1; X(ind90,4) = 1; X(ind120,5) = 1; X(ind152m,6) = 1; X(ind302m,7) = 1; X(ind602m,8) = 1; X(ind902m,9) = 1; X(ind1202m,10) = 1; end

The calculations for the Y matrices, beta's, betah's, fstats, pvals, and plots are within the for loop. These are the same calculations in the deletion strain to deletion strain comparison script. These caluclations are the only ones in the for loop for the deletion strain to deletion strain comparison script. The following is a part of the deletion strain to deletion strain comparison script that contains differences in comparison to the previous script.

%These indices are changed depending on what deletion strain is being analyzed ind15 = [91,92,93,94]; ind30 = [95,96,97,98]; ind60 = [99,100,101,102]; ind90 = [103,104,105,106]; ind120 = [107,108,109,110];

ind152 = [120,121,122,123]; ind302 = [124,125,126,127]; ind602 = [128,129,130,131]; ind902 = [132,133,134,135]; ind1202 = [136,137,138,139];

p = 10; q = 5;

alpha = 0.05; % significance level for the tests

n15 = length(ind15); n30 = length(ind30); n60 = length(ind60); n90 = length(ind90); n120 = length(ind120); n152 = length(ind152); n302 = length(ind302); n602 = length(ind602); n902 = length(ind902); n1202 = length(ind1202);

indx = [ind15,ind30,ind60,ind90,ind120]; indX = find(indx); indx2 = [ind152,ind302,ind602,ind902,ind1202];

i0 = ind15(1); i1 = indX(end); i2 = indx2(1);

ind15m = find(ind15); ind30m = ind30-i0+1; ind60m = ind60-i0+1; ind90m = ind90-i0+1; ind120m = ind120-i0+1;

ind152m = ind152 -i2+1+i1; ind302m = ind302 -i2+1+i1; ind602m = ind602 -i2+1+i1; ind902m = ind902 -i2+1+i1; ind1202m = ind1202 -i2+1+i1;

t = [ones(n15,1)*15;ones(n30,1)*30;ones(n60,1)*60;ones(n90,1)*90;ones(n120,1)*120;... ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120];

t2 = [min(t):5:max(t)]';

N = length(indx)+length(indx2);

X = zeros(N,p); Xh = zeros(N,p-q);

Xh(ind15m,1) = 1; Xh(ind30m,2) = 1; Xh(ind60m,3) = 1; Xh(ind90m,4) = 1; Xh(ind120m,5) = 1; Xh(ind152m,1) = 1; Xh(ind302m,2) = 1; Xh(ind602m,3) = 1; Xh(ind902m,4) = 1; Xh(ind1202m,5) = 1;

X(ind15m,1) = 1; X(ind30m,2) = 1; X(ind60m,3) = 1; X(ind90m,4) = 1; X(ind120m,5) = 1; X(ind152m,6) = 1; X(ind302m,7) = 1; X(ind602m,8) = 1; X(ind902m,9) = 1; X(ind1202m,10) = 1;

n = length(a(:,1));

out_data = zeros(n,12);

nsig = 0;

The dimensions of the X and Xh matrices in this case stay the same for all genes.

The rest of today was spent continuing to make changes to the MATLAB script for the deterministic model. Nick succeeded in separating the general parameter estimation function into four parts:

- Input and labeling of the master sheet containing the network, log2 fold change concentrations for each strain, optimization parameters, production rates, degradation rates, network b's, and simulation times.
- The call for the general least squares error and optimization
- Generating graphs
- Creating the output

The input of the mastersheet in this script is different from that in the previous general parameter estimation function. In the actual input sheet, the log2 concentrations were divided into several sheets (one for each deletion strain) and inputted sheet by sheet into MATLAB. In addition, the log2 concentrations for each flask for each timepoint were inputted rather than the average log2 concentrations for each timepoint. Indicators for the timepoints, strain name, and strain index were added to the optimization parameters sheet. This required modifying the for loop for inputting the optimization parameters.

The next steps with the deterministic model include making some changes to the general least squares error script where the error = ((log2(x1) - d1).^2) line of code needs to be modified to take into account the replicates in the log fold change data for each timepoint for each strain.

Katrina Sherbina 21:22, 23 May 2012 (EDT)

### May 24, 2012

In order to modify the error calculation in the general least squares error function (in MATLAB) to take into account the change from using average log fold changes for each timpepoint to using all of the data, several modifications were made to other scripts. (To make any of these changes, I first dissected the various functions to understand the various parts more.)

The log fold change data had to be imported separately for each strain (this modification was made earlier in the week):

[log2FC(1).d,TX1] = xlsread(input_file,'wt_log2FC'); [log2FC(2).d,TX1] = xlsread(input_file,'dCIN5_log2FC'); [log2FC(3).d,TX1] = xlsread(input_file,'dGLN3_log2FC'); [log2FC(4).d,TX1] = xlsread(input_file,'dHMO1_log2FC'); [log2FC(5).d,TX1] = xlsread(input_file,'dZAP1_log2FC');

Nevertheless, averages for each timepoint were later calculated for the purposes of later plotting standard deviations on the time versus log fold change plots.

The driver to detect active genes will be taken out. All of the genes in the network are considered active genes. The assignment of a number to each active gene was done with the detect active genes script. Now the active changes are "detected" in the matrices with the log fold change data:

active = find(log2FC(1).d(2:end,1));

This array is called for in the general network dynamic function. The ode45 function in matlab approximates a solution to the differential equation defined in the general network dynamic function.

Previously, the error was calculated using the model and the average log fold changes for each timepoint from the data. However, this error calculation needs to be changed so that the unaveraged log fold changes can be used. A preliminary equation that was experiment with was

myerror = ((log2(x1) - d1([1 5 10],:)).^2)+((log2(x1) - d1([2 6 11],:)).^2)+ ((log2(x1) - d1([3 7 12],:)).^2) +((log2(x1) - d1([4 8 13],:)).^2);

The problem here is that the equation does not take into account the last replicate for the t30 timepoint (because t30 has five replicates while the other timpeoints have four replicates).

Another concern is the actual data d and d1 that is being fed into the general least squares error function and other functions. For now, both d and d1 correspond to just the wildtype data. However, we will want to use the deletion strain data for the model as well.

Katrina Sherbina 21:29, 24 May 2012 (EDT)

### May 25, 2012

The functions written in MATLAB for the deterministic model were modified to include all of the data (not just the average log fold changes for each timepoint) in the various calculations.

Modifications were made in the parameters script and the error calculation (now called errormat) in the general least quares error function. The following lines of code were added to the script:

% % The first row of the log2FC data indicating all of the replicate timepoints td = (log2FC(1).d(1,:)); % % Finds the indices in td that correspond to each timepoint in tspan. tindex(1).indx = find(td == tspan(1)); tindex(2).indx = find(td == tspan(2)); tindex(3).indx = find(td == tspan(3)); n_times = length(tspan); % % Log2FC data for the wildtype d = (log2FC(1).d(2:end,:))'; % % The average log2FC for each timepoint for each gene. means = zeros(length(log2FC(1).d(2:end,1)),n_times); for iT = 1:n_times means(:,iT) = mean(log2FC(1).d(2:end,tindex(iT).indx(1):tindex(iT).indx(end)),2); end

% % Standard deviations for each timepoint for each gene. Necessary for the graphs % % that will be produced. sigmas = zeros(length(log2FC(1).d(2:end,1)),n_times); for iT = 1:n_times sigmas(:,iT) = std(log2FC(1).d(2:end,tindex(iT).indx(1):tindex(iT).indx(end)),0,2); end

It is necessary to still find the standard deviations and the means to be able to produce graphs for each gene of the log fold change versus time with the upper and lower error bounds for each timepoint.

The error calculation in the general least squares error function was modified with the following lines of code:

errormat = 0; for iT = 1:length(tspan) for iF = 1:length(tindex(iT).indx) errormat = errormat+(log2(x1(iT,:))-d(tindex(iT).indx(iF),:)).^2; end end % This is the sum of all the errors L = sum(errormat(:));

In the general least squares error function, a modification was also made to the first figure produced (with the counter and lse_out). At first, all of the data (not the average log fold changes) was being used to specify the second subplot in the figure which fits the model to the data. When viewing the figure, the model was being fit to only three of the thirteen sets of data points in the second subplot. There were thriteen sets of data points (in columns) since there are thirteen columns in the data due to the replicates in each timepoint. Instead of using all of the data, the code to produce the subplots was modified to instead use the average log fold change for each timepoint:

if(100*round(counter/100)==counter) figure(1),subplot(211),plot(theta,'d'), title(['counter = ' num2str(counter) ' lseout = ' num2str(lse_out)]) subplot(212),plot(means','*'),hold on,plot(log2(x1)), hold off,pause(.1) end

The script specifying the plots for each gene of log fold change versus time was modified. The if else statement having to do with cssigflg was removed. This designation had to do with the concentration sigmas (standard deviations) inputted into MATLAB when they were still in the input excel spreadsheet. However, since the standard deviations are now being calculated in MATLAB, there is no need to determine the kind of the plot to be produced depending on the value of cssigflg. The following is the new code for the plots:

if igraph == 1 concd = 2.^d(:,active); error_up = (means(active,:)+1.96*sigmas); error_dn = (means(active,:)-1.96*sigmas); for ii=1:n_active_genes figure(ii+2),hold on plot(time,log2(model(:,ii)),td,d(:,active(ii)),'o',tspan,error_up(ii,:),'r+',tspan, error_dn(ii,:),'r+','LineWidth',3),axis([tmin tmax -3 3]) title(TX1{1+(ii),2},'FontSize',16) xlabel('Time (minutes)','FontSize',16) ylabel('Expression (log2 fold change)','FontSize',16) end end

The global statement at the top of each script or function was also changed due to the aforementioned changes:

global A prorate degrate active inactive tspan d wts b alpha lse_out penalty_out counter parms parmnames tindex means

Further work will need ot be done on the plots. In the plots for MAL33 and ZAP1, the upper error bounds are not shown. The y axis may not to be adjusted to be larger than [-3,3]. Also, the model for HAP5, the model goes beyond the error bounds for the t60 timepoint.

In the output excel spreadsheet, there are no log2 optimized concentrations nor network optimized B's for FHL1, SKO1, or SWI6.

In addition to dealing with the aforementioned concerns, we need to explore how to include the deletion strain data in the model.

Katrina Sherbina 20:23, 25 May 2012 (EDT)

## Week 3

### May 29, 2012

Copies were made of all of the functions and scripts in order to keep the original scripts that worked last week when making modifications to these scripts today. The ending _strains was added to each m file.

A line of code was added to the estimation driver that indicates the sheet with the data of the deletion strain for which the script is being run at a given time. Since the model is to be found for each one of the strains, it makes it easier in the long run just to import the data for the deletion strain that the code will be run for.

This change was made in the process of spending the day changing the general network dynamics function to be able to run the model for each of the deletion strains. It was necessary to delete the information in the weights matrix and the degradation rates array that corresponded to the gene that was deleted in the deletion strain of interest. One of the lines of code added, finds the index of the gene deleted in a given strain by matching it to the sheet number in the input corresponding to the log fold change data for the deletion strain. To be able to incorporate this line, the variables num and indices were added to the global command line.

When running the entire code, the first error come up was a mismatch of dimesnions in the line WXB = -W*zz+B which computes the exponent in part of the differential equation that is a sigmoid function. This error was due to the fact that B (which corresponds to b) had 18 rows while the other two arrays had 21. This occurred because b was being recomputed in the general least squares error function. As a result, the recomputed b was renamed to bb.

However, another error occurred at line 43 in the LSE_strains script. The matrix dimensions do not agree. It's uncertain why these errors are occuring because after comparing each line of code Nick and I seem to be using the same scripts.

More troubleshooting will be done tomorrow to address the newest error.

Katrina Sherbina 20:57, 29 May 2012 (EDT)

### May 30. 2012

The errors that were encountered yesterday were fixed with some changes to the general network dynamic function and the calls to various functions. The reason why there was a mismatch in dimensions in the line WXB = -W*zz+B in the network dynamic function is that B was the same array as b. Intially, b was a 21x1 vector of 0's. However, in the general least squares error function b changed to a 18x1 vector of 0's because there 18 of the 21 total number of genes in the network regulate the expression of other genes. However, B needs to be a 21x1 vector of 0's to be able to compute WXB. For this to be possible, a line of code was added to the general network dynamics function instead of renaming the recalculated b in the general least squares error function:

B = zeros(n_genes,1);

As a result, n_genes was added to the global line.

The other error that occurred was solved by changing the function calls. There are currently two sets of scripts (one set is the original set that was composed last week and the other is the set for which the sheet number of the strain being worked on is specified in order to import the correct log fold change data). However, within the second aforementioned set of scripts, the LSE script and a few other functions called for functions from the first set of scripts. All function calls were changed to those corresponding to the scripts in the second set of script previously discussed.

However, with these changes, another problem arose. After running the fmincon function in the LSE script, the numbers composing the w1 array did not change as expected. Some of the numbers differed only in the ninth or so decimal place. This resulted in many of the genes having the same network optimized weights. This was not the case when the scripts from last week were executed.

To explore this problem, I started troubleshooting from the set of scripts from last week making changes that lead to the newest set of scripts one by one. After doing so, it seems that the problem that was observed arose when the general dynamic function is changed from what it was originally to the one constructed yesterday. This problem was evident in the first subplot of figure 1 plotting all of the theta values. Most of the 71 theta's stayed in practially the same line centered at 0. Hoever, what is odd is that the new general dynamic function produces the same dz vector as the old network dynamic function. In addition, the same x is found after applying the ode45 function to the new general network dynamic function as when applying the ode45 function to the original general network dynamic function. Despite this, the w1 vector (which is being recalculated by fmincon from the original vector corresponding to w0) is not being recalculated correctly by the fmincon function. This problem will be troubleshooted further tomorrow.

Towards the end of the day, the MATLAB scripts to do the ANOVA and p value corrections for the all strain and two strain comparisons were revisted. The ability to pause at a plot and to save a plot as a .jpg was added to the lines of code to produce the plots of the data and model for each gene having a p value of less than 0.05:

pause on if pval<0.05 figure(1),plot(t,Y,'bo'),hold on,plot(t,Ym,'r+','LineWidth',3),plot(t,Yh,'k^','LineWidth',3), title([b{ii+1,1} ' W = ',num2str(W),' pval = ',num2str(pval)]),... xlabel('time, minutes'),ylabel('log fold change'),legend('data','model'),drawnow hold off nsig = nsig + 1; [ii,nsig] drawnow %Allows you to pause at a graph for as long as you do not press a key on the keyboard. pause %Save as a .jpg saveas(figure(1),[b{ii+1,1} '.jpg']) %%Save as a .fig %saveas(figure(1),[b{ii+1,1} '.fig']) %%Captures the figure being currently displayed to put into the movie. %movieArray(nsig) = getframe(gcf); end

Katrina Sherbina 21:22, 30 May 2012 (EDT)

### May 31 2012

The errors arising with w1 vector were were troubleshooted. First, the code including the new general network dynamics function was run through the fmincon function for 5 different networks: an identity matrix, a matrix with two diagonals of 1, a matrix with a diagonal of 1 and alternating 1's and 0's above the diagonal, a matrix with a diagonal of 1's and the first row of 1's, and a matrix with a diagonal of 1's and the first column of 1's. The results were as follow.

-The output (w1 vector) produced with the network matrix with a diagonal of 1's and the first row of 1's had the most number of entries that were not identical (the first 21).

-With the exception of the network matrix just mentioned, the number of unique entries within the w1 vector for the most part increased as the number of edges in the network increased:

- identity matrix: 21 edges, 1 unique entry in w1 vector
- matrix with two diagonals of 1: 35 edges, 2 unique entries in w1 vector
- matrix with a diagonal of 1's and the first column of 1's: 41 edges, 2 unique entries in w1 vector
- matrix with a diagonal of 1 and alternating 1's and 0's above the diagonal: 121 edges, 11 unique entries

-Both the network matrix with a diagonal of 1's and the first row of 1's and the network matrix with a diagonal of 1's and the first column of 1's had 41 edges. However, there were 21 unique entries for the former network and only 11 unique entries for the later network in the w1 vector.

Then, I ran each of the matrices through the code but setting the thresholds to 1.00E-10 because one of the messages that popped up in the command window was that the optimizaiton was terminated because "the magnitude of directional derivative in search direction less than 2*options." I obtained pretty much the same results. Between some corresponding entries in the corresponding w1 vectors, there may have been a difference in the fifth decimal place.

I then ran each of these five networks using the original general network dynamics function. All of the entires in the w1 vectors for each of the networks were unique. At which point, the general network dynamics function was then revisted because that is most likely whyere the problem lay.

It was observed that the parms_used variable changes within the for loop in the original network dyanmics code but it stays at 0 in the new general network dynamics code. Therefore, the line parms_used = parms_used + nAii; was added inside the for loop to calculate W in the new general network dynamics function. In the output, the network optimized weights for each gene were different (they were the same for a majority of the genes when the w1 vector was not being properly computed as described before). However, the new w1 vector obtained did not match the one calculated using the old general network dynamic function. Then the output of the general network dynamics function was looked at more closely. It was observed that the computed dz was different with in the new general network dynamics code (with parms_used not fixed) in comparison to the dz computed by the original general network dynamics code. This will need to be looked into.

In addition, the plots generated in the two strain comparison and all strain comparison code will need to be changed to fit a polynomial through the data for each strain being compared and to include the standard name as well as the systematic name for each gene for which a plot is being produced.

Katrina Sherbina 21:40, 31 May 2012 (EDT)

### June 1, 2012

Yesterday, using two different functions to specify the general network dynamics were produced different results (different w1 vectors, different b's, and different network optimized weights). Looking at yesterday's output again today, it was observed that the w1 vector had 21 0's at the end. Also, all of the network b's outputted were 0. This indicates that the b's were not changing despite the intial parameters indicating that b is not fixed. The problem lay in the general network dynamics function. To stop the b's from being fixed, the following line was added before the for loop in the function:

B(i_forced) = b;

To check whether or not the two different general network dynamics function were producing the same results, the two outputs after solving the differential equation with ode45 for the original general network dynamics function and the new general network dynamics function were plotted on the same graph (x versus t). The outputs of both functions matched.

The entire new code for the deterministic model was rerun. This time a majority of the b's in the output were nonzero with the exception of three for genes that did not regulate other genes. This result was expected.

The original code for the deterministic model was rerun to compare with the output of the new code. For the most part, the b's and the optimized weights in the output were similar in magnitude and the same in signs to those in the output of the new code.

The rest of the day was spent trying to figure out how to run the model for each of the sets of deletion strain data. Rather than creating multiple for loops within each of the scripts and functions, one for loop was experimented with in the estimation driver that would run all of the scripts and functions for each of the deletion strain data sets. The first for loop tried with the input was

input_file_name = 'New_Input';

S = ['wt '; 'dcin5'; 'dgln3'; 'dhmo1'; 'dzap1']; strain = cellstr(S);

input_file = [input_file_name '.xls']; [type, sheets] = xlsfinfo(input_file);

Common_Parameters;

for ii = 1:length(sheets) for jj = 1:length(strain) if strcmp(sheets(ii),strain(jj)) sheet = strain(jj); num = ii; Strain_Parameters; LSE_strains; Graphs_strains; Output_strains; end end continue end

The original Parameters script was split into two:

- The Common_Parameters script imports all parameters from the input sheet that are the same for all strains.
- The Strain_Parameters script imports the parts of the input specific to the strain (like the log fold changes) and makes computations that are specific to the strain.

The Common_Parameters script was created to keep the code more efficient by not reinputting parts of the input sheet that are the same for all strains each time the for loop is executed.

This for loop would run through the whole deterministic model code for the widltype producing all of the specified graphs and the output sheet. However, instead of moving on to the next (deletion) strain data set, the loop was terminated and an error appeared for the line strcmp(sheets(ii),strain(jj)) indicating that the matrix dimensions did not match.

The for loop was then modified so that it created an array of sheet numbers corresponding to the sheet number of each of the strains indicated and an array of sheet names corresponding to the sheet name that contained the data for each of the strains. Then another for loop was created that ran all of the scripts and functions for each pair of sheet numbers and sheet names:

input_file_name = 'New_Input';

S = ['wt '; 'dcin5'; 'dgln3'; 'dhmo1'; 'dzap1']; strain = cellstr(S);

input_file = [input_file_name '.xls']; [type, sheets] = xlsfinfo(input_file);

Common_Parameters;

sheet_name = cell(length(strain),1); sheet_number = zeros(length(strain),1);

for ii = 1:length(sheets) for jj = 1:length(strain) if strcmp(sheets(ii),strain(jj)) sheet_name(jj,1) = strain(jj); sheet_number(jj,1) = ii; end end end

sheet_name = cellstr(sheet_name);

for ii = 1:length(sheet_name) sheet = char(sheet_name(ii,1)) num = sheet_number(ii,1) Strain_Parameters; LSE_strains; Graphs_strains; Output_strains; end

When running this script, the model was computed for each of the strain data sets. The output for the wildtype with this script was the same compared to the output for the wildtype that was computed earlier in the day with the code that just ran for one strain. The network optimized weights and the network b's were the same in both outputs.

The outputs for the deletion strain data differed from that of the wildtype as expected. The network b was 0 for the gene which was deleted to obtain a given output.

The graphs that were produced of log fold change versus time plotted the model for each of the strain data sets on the same plot for each gene. A next step would be to also separate these graphs so that a separate set of plots for each gene is produced for each strain data set.

In addition, the graphs from the statistical tests need to be revisted to fit a line through the data points. Rather than fitting a polynomial using Dr. Fitzpatrick's code, the symbol for the plot of the data set will be removed in the code so that a continuous line is produced through the data points.

Katrina Sherbina 21:22, 1 June 2012 (EDT)

## Week 4

### June 4, 2012

Today two problems became apparent with the code that I had left off on last Friday (June 1st):

- While the the b's corresponding to the deletion strains were being zeroed out, the rows and columns corresponding to the deletion strains in the weights matrix were not being zeroed out. There were non zero values for these genes supposedly deleted in the model in the network optimized weights.
- The process to go through the model with the deletion strain data treated the data for each strain separately when in revamping the code we should be using all of the data for all of the strains simulatneously.

To start fixing the second of the two problems, the second of the two for loops in the estimation driver was modified taken out of the driver and put into the general least squares error function. The data without the log fold changes (d) and the errormat calculation were put into this loop. Also, the call for the ode45 function was also put into this loop. In doing so, the LSE was being computed as a sum of the difference between the model and data squared over flask, times, genes, and strains whether than just over flasks, times, and genes as before.

Know I'm a little bit confused about the how the output should look. In making this change, it seems that the rows and columns corresponding to the genes deleted in each of the deletion strains are being zeroed out for all of the deletion strains at once. Therefore, I'm not sure how to have a separate output of the results of running through the model for each deletion strain.

Katrina Sherbina 20:39, 4 June 2012 (EDT)

### June 5, 2012

We spent the majority of the day in Dr. Dahlquist's lab scanning the microarrays that were prepared from the mRNA degradation experiment with GenePix Pro. We used the normalization procedure that was developed last summer to normalize all of the chips using both within array and between array normalization. See the protocal in the following link: Dahlquist:Microarray Data Processing in R. Nick and I both normalized the microarray data separately to make sure that we obtained in the same results. Both of our final outputs are the between array normalization were the same.

Katrina Sherbina 20:22, 5 June 2012 (EDT)

### June 6, 2012

Dr. Fitzpatrick asked me to work on a problem aside from the deterministic model involving the data from the 72-hr growth experiment performed a couple of weeks looking at the growth rate of S. cerevisiae wildtype and S. paradoxus in both 13 degrees C and 30 degrees Celsius. When Dr. Dahlquist crunched the numbers, she found that S. paradoxus grew better than S. cerevisiae in both temperatures. My task today was to use a linear model to determine whether or not the difference in the growth rates between S. cerevisiae and S. paradoxus are significantly different. The result of this analysis performed in SPSS using the linear regression tools was that the growth rate (the slope of the data consisting of a fixed component, a temperature dependent component, and a strain dependent component) was siginficant. The next step in analyzing the data is to determine whether or not it is significant that S. paradoxus grew faster in the cold that what would be expected considering its fast growth rate in the warm? I will be working on constructing a matrix for the linear model for this question tomorrow.

The other tasks of today included working more on the deterministic model. The code from Monday was redudant in that the tindices and d's (the log fold change data) were computed twice in both a for loop in the Parameters_all_strain_data code and the general_least_squares_error_sum_over_strains function. To remove this redudancy, the for loop in the Parameters script was redone as follows

for ii = 1:length(sheet_number) sheet = sheet_name(ii,1); num = sheet_number(ii,1); td = (log2FC(ii).d(1,:)); tindex(1).t(ii).indx = find(td == tspan(1)); tindex(2).t(ii).indx = find(td == tspan(2)); tindex(3).t(ii).indx = find(td == tspan(3)); d = log2FC(ii).d(2:end,:); data(ii).d = d; for iT = 1:n_times means(ii).m(:,iT) = mean(data(ii).d(:,tindex(iT).t(ii).indx(1):tindex(iT).t(ii).indx(end)),2); sigmas(ii).s(:,iT) = std(data(ii).d(:,tindex(iT).t(ii).indx(1):tindex(iT).t(ii).indx(end)),0,2); end end

Then tindex and data was added to the global statement. The tindex and d computations were then removed from the for loop in the general least squares error code.

The code were it stands right now uses all of the data for all strains in finding the model. The issue has been trying to write separate outputs and graphs depicting the model for each strain. Dr. Fizpatrick suggested making a for loop in the LSE line of code to output the model for each strain. This loop currently includes the model outputted when using the ode45 function to solve the differential equation defined by the generla network dynamics, the code to make the log FC graphs versus time for each gene, and calls to other scripts to make the output. Both the graphs and the output were crated after some tweaks to make sure that vectors and matrices were the right dimensions. However, each of the outputs produced (one for each strain) were exactly the same in the network b's and optimized weights. In fact, they all looked at the output for just the wildtype. In addition to working on the linear model described earlier, tomorrow will be spent also trying to correct the outputs. We will check to make sure that the problem is in the way the output is being written and not in how the model is being computed for each strain.

Katrina Sherbina 21:34, 6 June 2012 (EDT)

### June 7, 2012

Part of the morning was spent testing a second hypothesis regarding the 72 hour growth experiment using a linear model. Using SPSS, I tested whether or not the change in the growth rate between 13 degrees Celsius and 30 degrees Celsius conditions for S. paradoxus was significantly different from the change in the growth rate between these temperatures for S. cerevisiae. This involved modifying the matrix created yesterday to include 8 parameters (4 intercepts, 2 slopes for cold conditions, and 2 changes in slope between the cold growth rate and hot growth rate). Both a full model and a hypothesized model was computed in SPSS where the hypothesized model had one less parameter (one less change in growth rate) than the full model. The full model adn hypothesized model residual sum of squares was used to calculate and F statistic and p value. The result was that the change in growth rate for S. paradoxus as a result of shifting from hot to cold is significantly different from that of S. cerevisiae. SPSS was not able to compute the p value because it was so small. An external p value calculator was used but it could only compute that the p value was less than 0.0001.

I also went back to the tast of outputting graphs of the log fold changes versus time for each gene for different strain data sets. I tried Dr. Fitzpatrick's suggestion to get rid of the symbol in the designating the data poitns in plot line of code in the wildtype deletion strain compare and two deletion strain compare scripts. However, this produced bizarre graphs that made it impossible to tell what the data values were. Therefore, I went back to the polynomial fit script that Dr. Fitzpatrick used in the past and I modified it to display a polynomial fit for two strains (wildtype and deletion strain or deletion strain and deletion strain). Dr. Dahlquist wanted the title of the graph to include the standard name of the gene. However, at this point, the title contained the systematic and standard names of the gene, the f stats for the gene for both sets of strain data, and the p values for the gene for both sets of strain data. The title could not fit onto one line and threw off the dimensiosn of the plot. Tomorrow, I will look into breaking the title into two lines or maybe having a subtitle with the statistics to keep the systematic and standard name in the same line as the title.

I did not put much work inot the deterministic model today. Tomorrow, I will be getting the code that Nick finished that made it possible to get different outputs showing the different model for each strain. We will probably end up cleaning up the parameters script to take advantage of the strcmp command to import optimization parameters among other small changes.

Katrina Sherbina 21:40, 7 June 2012 (EDT)

### June 8, 2012

I spent a part of the morning working on the polynomial fit code so that the title would just include the systematic and standard names of the genes while the caption beneath the x axis label would have the F statistic and p value for each strain being analyzed. While I was able to format the title and the caption, the polynomial fit code was put aside and the original two deletion strain compare and wildtype deletion strain compare script was revisted. Both codes were changed similarily to output graphs for each gene with a significant change in expression that plotted the data points, continuous lines through models of both strains, and the hypothesized model. In order to do so, the times needed to be separated to designate which corresponded to which strain. This was added inside the loop in the wildtype deletion strain comparison and outside the two deletion strain comparison:

ts1 = [ones(n15,1)*15;ones(n30,1)*30;ones(n60,1)*60;ones(n90,1)*90;ones(n120,1)*120]; ts2 = [ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120]; t = [ts1;ts2];

[tsort,is] = sort(t);

Inside the loop for both sets of code, the way Y was calculated was separated for each strain:

Ys1 = a(ii,indx)'; Ys2 = a(ii,indx2)'; Y = [Ys1;Ys2];

Also, calculation for Ym (the full model) was split into two: one for each strain compared.

Ym1 = X(ind15(1):ind120(end),1:5)*beta(1:5,1); Ym2 = X(ind152m(1):ind1202m(end),6:10)*beta(6:10,1);

The plot command was then changed:

pause on if pval<0.05 figure(1),plot(ts1,Ys1,'ro'),hold on,plot(ts2,Ys2,'mo'),plot(ts1,Ym1,'r','LineWidth',2),... plot(ts2,Ym2,'m','LineWidth',2),plot(tsort,Yh(is),'k','LineWidth',2),title([b{ii+1,1} '/' b{ii+1,2},... ' W = ',num2str(W),' pval = ',num2str(pval)]),xlabel('time, minutes'),ylabel('log fold change'),... legend([strain1 ' data'],[strain2 ' data'],[strain1 ' model'],[strain2 ' model'],'hypothesized model'),drawnow hold off nsig = nsig + 1; [ii,nsig] drawnow % Save as a .jpg saveas(figure(1),[b{ii+1,1} '.jpg']) end

Katrina Sherbina 19:18, 8 June 2012 (EDT)

## Week 5

### June 11, 2012

Today, we began working on the two abstracts we will be submitting to the conference held by the Society for Mathematical Biology. Right now, I have a first draft of the abstract for the poster that will deal with the work we have done with the deterministic model.

The other part of today was spent continuing to try to debug the code for the deterministic model. The outputted weights of the current code for the model seem to be matching the outputted weights for when just the wildtype data was used as was seen in the graph Nick created plotting the weights against a numerical index for the current weights output and the output for each of the deletion strains in which the data for one strain was being used at a time. While we do want one set of network weights and network b's for all of the strains, the weights should not be identical to those for the previous wildtype only model. We're not sure exactly where the problem is in our code. We've already determined that the rror calculation in the general least squares function that fmincon calls upon is being updated for each strain data set, for each timepoint, for each replicate. We will be troubleshooting this problem further tomorrow.

In addition, Chiddy today emailed me two sets of microarray data for the mRNA decay experiment that differ in the way the PMT was calculated. The goal is to be able to normalize this data and compare it to the normalized microarray data we obtained last week to see which way is better to calculate the PMT when scanning the microarrays. The problem is that our current normalization assumes that the expression of most of the genes remains unchanged. However, in this mRNA decay experiment, the end result is that all of the mRNA has decayed so that the spots on the microarrays should be getting progressively more intensely green. We will need to talk about how to properly normalize the data so that we can determine whether or not to perform replicates of the experiment to be able to later determine mRNA decay rates for the transcripts of the genes being analyzed.

### June 12, 2012

I focused today mainly on analyzing the most recent microarray data from the first mRNA decay experiment that Dr. Dahlquist performed. There were three sets of data differing in how the PMT was set: manually for each array, automatically by the program for each array, or one PMT for each array based on an average. We decided to hold off on the within array normalization and between array normalization (the protocol that was created last summer) because these kind of normalization procedures are based on the assumption that a majority of the genes will not have a significant change in expression. However, we do expect to see a change in the mRNA decay experiment.

I began by first making MA plots for the non-normalized data (one plot for each GPR file). I used a script similar to the one that was used last summer where the M values and A values were calculated by hand from the RG file that was created when reading in the targets. I did have to change the M and A matrices (after the tapply function was performed) to data frames so that I could delete the controls Arabidopsis and 3XSSC. To deleted the controls, I first specified the rownames (gene names) within R rather an importing a csv file of the gene names as was done previously:

RGt<-tapply(RG$R[,1],as.factor(GeneList),mean) geneID <-row.names(RGt) MG2<-as.data.frame.matrix(MG) #lfc is the matrix of log fold changes before the tapply function has been done. colnames(MG2)<-colnames(lfc) rownames(MG2)<-geneID #The same thing was done to make AG (intensities) a data frame.

Then I used subsetting to get rid of the controls:

MG3<-subset(MG2,row.names(MG2)!="Arabidopsis") MG4<-subset(MG3,row.names(MG3)!="3XSSC") #The same thing was done to get rid of the controls in AG.

In plotting the A values versus the M values I noticed that the plots liked like what I would expect them to look after within array normalization. A majority of the plots seemed centered at 0.

I also did box plots of the M values for each timepoint for each of the data sets. Again, I saw that a majority of the box plots were almost centered at 0.

Then, as Dr. Fitzpatrick advised, I did a linear regression of the mean M (lnG-lnR) for each timepoint for each data set versus time. I tried to do this at first in R. First, I recalculated the M values by finding the log2((RG$G-RG$Gb)/(RG$R-RG$Rb) for each of the data points before performing the tapply function. Since some of the values were NaN, Inf, or -Inf, I ran into a few snags using colMeans. The designation na.rm=T did not seem to work because there were Inf and -Inf values. I tried to change these values to NaN within R but could not get a for loop consisting of an if statement to work. So I exported the data and made the change in excel. I then reimported the data. I could successfully get the equation for perofrming the linear regression on meanM versus time for each data set. However, I had difficulty trying to plot the actual data points and the regression. Finally, I just made plots with trendlines of meanM versus time for each data set in Excel. It was hard to see any trend in the mean M values over time. The slopes for two plots (automatic PMT and average PMT) seemed close to 0. However, it is difficult to infer anything from these plots sine the R^2 value for each regression was week.

Tomorrow, I will try to make plots of the M values (lnG-lnR) of each gene versus time for each data set.

Katrina Sherbina 20:59, 12 June 2012 (EDT)

### June 13, 2012

Todya, I continued to analyze the mRNA decay experiment microarray data. We decided to do plots of the M values (lnG-lnR) versus time for each gene. However, observing any kind of trend or biase among the arrays this way was already considered a stretch from the onset but we decided to try anyway. I made the plots specifically for the average PMT data. First, I exported the M values calculated in R and loaded them into MATLAB as a .csv file. Within MATLAB, I constructed a loop that plotted the M values (from the average PMT data) versus time (timepoints in the experiment) for each gene and also a line of best fit through the data points for each gene using the polyfit function. As expected, there was no observable bias across all of the genes. One of the problems may have been that there were no replicates so the lines of best fit probably did not fit the data points well. After sharing these graphs with Fitzpatrick, there is not much else to interpret from this data. We need to talk with Dr. Dahlquist about the experiment and the data more at the next lab meeting.

The rest of today was spent continuing to work on the abstract for the mathematical biology conference in July that is due this Friday.

Katrina Sherbina 20:47, 13 June 2012 (EDT)

### June 14, 2012

Today, I revised the abstract due tomorrow for the mathematical biology conference more with Dr. Fitzpatrick. In addition, I began putting together a PowerPoint to present Monday on the progress of our work and the general summary of what we have done in research since we started last summer.

Katrina Sherbina 20:34, 14 June 2012 (EDT)

## Week 6

### June 18, 2012

Today we spent a majority of the day in a lab meeting presenting the work we've done since last summer to other members of the lab and a professor and student from Chapman University.

We did begun running more tests with the deterministic model to verify that is working. We started by changing the input sheets for the most recent code so that the a single input sheet only has the mean and standard deviation for one strain. Nick ran the code with these input sheets because my computer is not nearly as fast as his. We intiially foresaw having errors with the mean and standard deviation lines of code because we changed from using all of the log fold change data to the mean log fold change data. However, there were no error messages. Nevertheless, tomorrow, I would like to look more closely at the code to see if it is using the means and standard deviations were are inputting that it is using like we think.

Another test that we would like to run is a forward simulation in which we would use the model we obtain through the code to get data (in contrast to the back simulation we did to find the model so that the parameters are optimized). We would want to use a smaller network to run the forward simulation. However, I think this means we would have to run the code as is with the back simulation with the network so that we can make the proper comparisons between the the results.

One of our next big tasks is to determine whether or not the sigmoid function we use in the deterministic model models an AND (production only occurs if all regulators of a target gene are acting) or an OR (production occurs if at least one regulator is acting on a target gene). Ideally, we want to have a model in which production occurs if at least one regulator is acting. In order to do so, we will be using a smaller network (maybe something with two genes/regulators) and probably graphing the sigmoid function taking into account the regulatory weights in this network on MATLAB. Additionally, we woudl want to see how the sigmoid function would look if one regulator acted as a repressor and the other as an activator.

There is still more work to do regarding the mRNA decay experiment. First, as Dr. Dahlquist mentioned, it would be best to change the y labels on all of the grpahs that I outputted last Friday for each gene so that it is the same for all graphs. It would also be interesting to use a general linear model to determine if any of the slopes plotted through the linear regression for each graph are significantly different from 0 or not.

Katrina Sherbina 20:07, 18 June 2012 (EDT)

### June 19, 2012

Today, I reworked the most current deterministic model we had so that it can do a forward simulation. This means that, instead of comparing a model to experimental data to optimize parameters as in a back simulation, parameters (weights and b's) are fixed and log 2 data is calculated by the model. First, I created a new input consisting of a 3x3 network matrix (ACE2, CIN5, and ZAP1) with 5 connections, a network weights matrix of randomly selected weights, and randomly selected b's. A random number generator in Exxcel was used to determine the weights and b's. I was able to output three outputs (wildtype, dCIN5, dZAP1) of different log2 optimized data deleted for the appropriate gene in a given deletion strain output. This simulation outputted results after some minor changes to the dimensions of some of the matrices and vectors calculated in the execution of the code and after calculating the log 2 of the data calculated by the model. The whole code that I used for the forward simulation consisted of the estimation_driver which called on Parameters (same as in the current back simulation code), forward_simulation (where the w0, wts, x0, tspan1, and deletion indices vectors/matrices were calculted and the call to the ode45 function), and the general network dynamics function (which is the same as in the back simulation code).

Initially, I wasn't sure how to compare the fixed parameters of the forward simulation with those that would be optimized by the back simulation since the forward simulation produces data for each time point but no replicates while the back simulation uses replicate data. In the end, I decided to use the 15, 30, and 60 simulation time point log2 optimized data from the forward simulation as the data input for the back simulation code. Again, some tweaks were made to the back simulation code to make sure that the dimesnions of matrices and vectors matched and that the model outputted log 2 base data. Looking at the output graphs of the log fold change versus time, the data points fit their respective models what seems like exactly. My concern is that there are only two sets of data points and two corrsponding model curves. However, there are technically three strains so there should be three sets of data points and corresponding model curves. My guess is that only the data points and model curves for the deletion strains are being graphed but I will need to verify this tomorrow. I did get three output files for the back simulation (one for each strain). The log 2 optimized data were different in each output file. However, for each output, it did not match the log 2 optimized data in the output file of the corresponding strain for the forward simulation. The optimized network weights and optimized b's were the same in each of the output files from the back simulation. However, neither matched the network weights and b's that were fixed for the back simulation. Tomorrow, I will confer with Dr. Fitzpatrick to make sure that I have the correct idea of how to compare the results of the forward simulation with that of the back simulation.

Katrina Sherbina 21:14, 19 June 2012 (EDT)

### June 20, 2012

(Aaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhh!) Today, I continued working on verifying the results of the forward simulation with those of the back simulation. I worked with two different networks because with the first network I was obtaining a model for dCIN5 that was the same to that of the wildtype. With the new network, there were differences in the model calculated for dCIN5 and that for the wildtype. However, I was still not getting the same optimized parameters from the back simulation (as I fixed in the forward simulation) using the data from the forward simulation. Dr. Fitzpatrick suggested two things which I tried: dealing with only one strain rather than using all strains to define the network dynamics and adding more time points. I focused on doing the forward simulation with just the wildtype and simulation times starting from one minute to 30 minutes with increments of one minute. Again, I could not get the parameters optimized by the back simulation using the forward simulation model data to match with the fixed parameters in the forward simulation.

Katrina Sherbina 20:05, 20 June 2012 (EDT)

### June 21, 2012

In trying to find a solution to another problem, I may have found the reason why the optimized parameters in the back simulation did not match the fixed parameteres from the forward simulation. The matrix W (which is a reconstruction of the network weights in the general network dynamics function) was being constructed incorrectly. This was due to how wts (a vector of all of the non-zero network weights) was being constructed in the foward_simulation code. The wts was constructed by taking the non-zero values from the network weights by moving down the columns and finding those values that correspond to the positions indicated by [ig,jg] (which give the position in the network weights that has a non-zero value). The matrix W was then constructed from the wts row by row even those the wts vector was constructed by moving from column to colum in the network weights. This is did not show up as a problem that changed the final result in the back simulation because in this case all of the non-zero network weights are 1's. To solve this problem, [ig,jg] was sorted by the column jg so that the pairings of row and column designations for each position were ordered in such a way that the rows (ig) increased from 1. In so doing, the W matrix constructed in the general network dynamics function matched the network weights matrix. No changes needed to be made to the loop that put together the W matrix in the general network dynamics function. With this change, I will try to run the foward and back simulations and compare the parameters.

The correct construction of the W matrix also corrected the output we were getting from the new Michaelis-Menten model that we have begun trying out with the forward simulation. Originally, the matrix from the new function corresponding to the construction of this model had non zero values for those weights that were negative in the W matrix. This was a problem because the function (which depends on the x's and w's) should be 0 for a repression corresponding to a negative weight. However, with the aforementioned correction, the outputted matrix from this function only had non-zero values corresponding to the non-negative non-zero weights in the W matrix.

A new loop was constructed in the general network dynamcis that brings in the output from the Michaelis-Menten function into a sum that is used to calculated the production term of the differential equation. The result of the loop is an column vector in which each cell corresponds to a sum over all of the regulators affecting a target gene. For those genes with no inputs, there is an NaN in the vector(because for those genes a division by 0 occurred in the loop). For those genes that were only influenced by repressors, there was a 0 in the vector (since there was a 0 for those regulators in the matrix from the Michaelis-Menten function). I am no wondering if it would be accurate to make those NaN's a 0 because since the corresponding genes are being repressed there is not change in their expression because they are not being expressed (at least that is the assumption I think we are working on now).

Katrina Sherbina 21:39, 21 June 2012 (EDT)

### June 22, 2012

Today, we continued working on incorporating the Michaelis-Menten function into the general network dynamics function. It is wtihin the general network dynamics function that the Michaelis-Menten function is called calculate the production part of the the differential equation (calculated by a loop in the general network dynamcis). The problem that we were originally having was that the array for the production part of the differential equation had NaN's for the three genes that did not have any inputs because for those genes division by zero was occuring in the loop that was calculating the production array. In turn, the model outputted by the ode45 function had all NaN's. To solve this problem, an if statement was inserted after the production loop that calculated dz only when there were no NaN's in the production. If there is an NaN, then the corresponding value in dz remained 0. Additionally, any mention of b's was removed from the scripts for the Michaelis-Menten model.

A forward simulation was run for both the sigmoid function model and the Michaelis-Menten model. The input for the first forward simulation for the sigmoid model had randomly selected b's for the genes (0's for the three genes that do not have inputs). The input for the second forward simulation for the sigmoid model set all of the b's to zero. A comparison was done of the log2 optimized values from both models and both simulation runs. The results were that values were not the same between the two models for each run of the simulation.

Then a back simulation was run with both the sigmoid function and the Michaelis-Menten model. For the sigmoid model, the b's and production rates were not fixed. For the Michaelis-Menten model, the b's were fixed but the production rates were not. In the output for the sigmoid function model, there were production rates for all but two genes that were controlled only by repressors. This does make some biological sense. A quick comparison was done between the log fold change graphs produced for the two models. Some strange graphs were outputted in both cases (a few of the graphs had sharp turns).

Katrina Sherbina 20:09, 22 June 2012 (EDT)

## Week 7

### June 25, 2012

The morning was spent comparing the log2 expression versus time graphs for the target genes for the sigmoid model and the Michaelis-Menten model. We figured out the some of the strange graphs for the sigmoid model appeared as such because in the graphs code was plotting the log2 of the log2 of the model. With these corrections, many of the graphs for the target genes looked similar between the two models (on occasion those for the Michaelis-Menten model had more of a curve). The graphs between the models were different for the three genes that do not have inputs in the network. For the Michaelis-Menten model, the plots for these genes appeared as a line with a negative slope. In continuing to edit the Michaelis-Menten function, we will try to put in some kind of if statement so that there will be some level of transcription of those genes. This may involve altering the dz array in the general network dynamics so that for those three genes the value is calculated as the difference between the production rate constant and the degradation rate constant.

In addition, we tested the back simulation with the Michaelis-Menten model with seven different alpha values. An L curve was plotted of the penalty versus the LSE to determine which is the best alpha value to use.

We gabn running the back simulation with the Michaelis-Menten model for all of the strain data simulataneously with an alpha value of 0.05 (determine from the L curve). While the simulation did finish, I realized that the lines of code in the general network dynamics that do the appropriate deletions for the deletion strains were commented out. This simulation will need to be run again tomorrow with those lines of code uncommented.

Katrina Sherbina 20:18, 25 June 2012 (EDT)

### June 26, 2012

For part of the day, we were working on putting a legend on all of the expression graphs that were outputted so that the legend for each graph identified what colored curve corresponded to what strain model. One of the first problems was that the legend mislabelled by plot by assigning strain names to the symbols representing the data and the model curves for each strain. However, the once this was no longer a problem, the next recurring problem was that a legend would output for each graph for one turn of the outermost loop showing the strain name corresponding to the model curve being plotted. However, as soon as the next iteration of the loop began, the previous legend was replaced by a new legend which identified the strain name corresponding to the next model curve that was being plotted. In these cases, the legend command was in the innermost loop. When trying to put the legend command outside the outermost loop in the graph script, it was not possible to attach a legend to the completed graphs that labelled the model curve with the appropriate strain name for each graph. More work will need to be done tomorrow to figure out how to incorporate a legend ocmmand in a loop so that, if at all possible, the legend will update (rather than replace the exisitng legend) with the correct strain designation for the next model curve being plotted on the already open graph.

The rest of the day involved running a ccuple of simulations (more were not possible due to the time constraint and the number of iterations it took for a simulation to finish) of the entire back simulation with the Michaelis-Menten model. Several changes were made to the general network dynamics function so that all target genes (even if they have no inputs or are regulated solely by repressors) still have some base level of transcription. To do so, first the dz array was computed as dz = P(:).*prod - deg where P is a copy of the prorates arrays. This is necessary so that the deletions function correctly as eah iteration of the optimization calls on the general network dynamics function. However, an if stated was inserted in the general dynamics such taht for those cells in dz that correspond to target genes with no input or only regulated by repressors, dz was calculated (there is a 0 in the prod array corresponding to those genes), dz is calculated as dz(i) = P(i) - deg(i). This if statement is imbedded in another if statement that only calculates the particular cell in dz as aforementioned only when the target gene corresponding to that cell is not the gene that is being deleted in whatever strain for which the data is currently being used in the optimization. However, there is now some question as to whether or not that outermost if loop is necessary if we are looking for one set of parameters that satisifies all of the data for all of the strains.

In running this simulation with all of the changes mentioned above, there were optimized production rates for each target gene for each output (one for each strain) that was outputted. However, there was no optimized production rate for HAP5 when Nick finished his simulation (the code is the same with the exception that the alpha value he used was 0.01 while the alpha value I used was 0.05). In addition, the optimized production rates in his outputs were different from the corresponding optimized production rates in my outputs. Tomorrow, we will have to compare our codes to see if there are other differences that would cause this discrepancy. I doubt that the different alpha values would explain these different outputs.

Katrina Sherbina 20:34, 26 June 2012 (EDT)

### June 27, 2012

The biggest challenge today was trying to optimize the production values correctly. We want there to be a constant production function for the three genes that have no input. The production rates for the other genes should be optimized. We tried to modify how the production rates are appended to the w0 array with the if statment if fix_P = 0 in the LSE by changing the for loop in this if loop to for ii = i_forced. However, the result of running the simulation with this correction was that the optimized production rates for the no input genes and Hap5 were 0. After chaning the loop back to what it was before the correction, we investigated altering the loop that calculates prod in the general network dynamics. For the no input genes, we set the cell in prod corresponding to that gene to 1 no matter the summation (prod) that was calculated for that gene. This way we reasoned that dz would be calculated the same way for the no input gene no matter how many iterations fmincon makes. As a result, the value corresponding to the no input gene in w0 which is being optimized would stay the same through all of the iterations of the optimization. However, in running the simulation with this modification (with an alpha value of 0.01), we ran into the same problem as yesterday: the production rate was optimized to 0 for Hap5. We are not sure why this is occurring because the same code run at an alpha value of 0.05 did produce a non-zero value for Hap5. We are wondering whether an optimized production rate of 0 for Hap5 is biologically accurate since Hap5 is controlled by only one gene in the network which seems to be a repressor. We began to think about whether or not there is a way to modify how w0 is constructed in the LSE script so that the production rates of the no input genes are not optimized but remain constant. We have not yet been able to think of a way to do this.

Katrina Sherbina 20:41, 27 June 2012 (EDT)

### June 28, 2012

At the start of the day, we kept working on getting the production rates to optimize correctly. I tried to fix the production rates for the no input genes in the general network dynamics by setting the the production rates in the prorate array (which is recalculated with every iteration of the optimization) for the no inputs genes as the rates in the original prorate array taken from the input sheet. However, in taking to Dr. Fitzpatrick, this turned out to be unnecessary. The code as we have had it for a couple of days does optimize the production rates correctly. The production rate of Hap5 is always optimized to 0 in all of the simulations because it has a very low degradation rate. Any production rate will make the expression of Hap5 shoot up in the expression graphs because of the low degradation rate.

The rest of today was spent running the back simulation with the Michaelis-Menten function for all of the strain data with seven different alpha values (0.4, 0.2, 0.1, 0.05, 0.04, 0.02, 0.01). We first ran the simulation for 0.02 and 0.04 on the computers that we usually work on. However, the simulations were taking too long so we moved to the PC's in the math department. From here, we ran into some problems trying to generate the L curve of the lse_out versus the penatly_out for the different alpha values. We ran the simulation for the five remaining alpha values and then generated the L curve. The curve that was graphed had several kinks. We reasoned this might have happened because the first two simulations and the last five simulations were run on different versions of MATLAB. Therefore, we reran the simulation with all seven values on the same version of MATLAB in the math department. However, the graph that was outputted still had kinks. The results for 0.04 were especially strange because it had a lower penalty_out and lower lse_out than those for 0.05. We're not entirely sure what is causing the kinks to appear because identical code was used to run the simulation for each of the seven alpha values. Tomorrow we will be following up on Dr. Fitzpatrick's suggestion to run the simulation with one of the alpha values and then use the production rates and weights from that simulation as the intial production rates and weights for the simulation with the next lowest alpha value in the list and then continue with this process for the remaining alpha values.

Katrina Sherbina 20:25, 28 June 2012 (EDT)

### June 29, 2012

We reran the simulations with seven different alpha values using all of the strain data at once to generate the model. We began with running the simulation with an alpha value of 0.4 and then used the optimized production rates and weights from the simulation as the intial parameters for the back simulation with an alpha value of 0.2. We replaced the intial parameters this way for the remaining seven alpha values and generated an L curve of the lse_out versus the penalty_out for each of thes seven alpha values. The L curve seemed to have a turn around the data points for the 0.05 and 0.04 alpha values. As a follow up to the L curve, the optimized weights and production rates were plotted for each simulation performed for each alpha value. After emailing Dr. Fitzpatrick, we decided to use an alpha value of 0.04 for the back simulations for each individual strain (wildtype, dCIN5, dGLN3, dHMO1, and dZAP1). After running these simulations, the optimized production rates for each individual strain model and the all strain model were plotted. The optimized weights for each individual strian model and the all strain model were plotted on a different graph. In the optimized production rates graph, all of the models had a similar trend and similar peak locations. In the optimized weights graph, the models had similar peak locations for the most part with some minor differences as evident when comparing the model generated from the dZAP5 data with the other models.

Katrina Sherbina 20:03, 29 June 2012 (EDT)

## Week 8

### July 2, 2012

After journal club, the remainder of the afternoon was spent outputting gene expression graphs that compare the different models calculated by the sigmoid function simulation and the Michaelis-Menten function simulation. This involvevd modifing a graph script Nick previously developed to plot the weights and production rates for each of the different simulations for comparison. At first, the graphs we were outputting (expression versus time for each gene in the network) had the log2FC data, the model for each strain from the all strain simulation, and the model for each strain for the individual strain simulations. The log2FC data was the same for all of the graphs. For most of the graphs, the model was the same for each strain from the all strain simulation with the exception of those genes that have been deleted. At this point, the graphs were only plotting results from the Michaelis-Menten function and were already pretty difficult to look at. It would have been impossible to make any sense of the graphs if the different models produced from the sigmoid function were plotted on the same axes. As an alternative, following Dr. Fitzpatrick's advice, I began to modify the code to create a graph for each gene in the network of the log2FC data, strain model from the sigmoid all strain simulation, strain model from the Michaelis-Menten all strain simulation, strain model from the sigmoid individual strain simulation, and strain model from the Michaelis-Menten individual strain simulations on one set of axes for the wildtype strain, dGLN3 strain, and dHMO1 strain. This makes for three separate sets of graphs (one for each strain). However, I have run into a few set of problems which I will need to resolve tomorrow. The first being that I have used the latest results we obtained from the sigmoid function model for all strain and individual strain simulations. However, the models from these results are 13x21 matrices while the models from the Michaelis-Menten model simulations are 21X13 matrices. In addition, I am using the same syntax to call the model from the sigmoid and Michaelis-Menten simulations: log2FC.model. However, in the sigmoid simulation output, log2FC.model does not seem to be the log transformed model because in the graphs the model curve begins at 1. It may be prudent to compare the scripts that output the model for the sigmoid function and Michaelis-Menten function simulations to make sure that in both cases the log-transformed model is beign outputted. Also, the simgoid function simulations were all run at an alpha value of 0.05 while the Michaelis-Menten simulations were all run at an alpha value of 0.04. I will be asking Dr. Fitzpatrick tomorrow whether or not we can make a valid comparison if the two different types of models were generated using different alpha values.

Katrina Sherbina 21:22, 2 July 2012 (EDT)

### July 3, 2012

We continued working on the comparison graphs of the models from the sigmoid and michaelis-menten all strain and one strain simulations for the wt, dGLN3, and dHMO1 strains. To fix the matrix dimension mismatch problem I was having yesterday, we reran the sigmoid model all strain and individual strain simulation (we did each of the five individual strains for the latter). We kept the sigmoid simulation at an alpha value of 0.01 and the Michaelis-Menten simulation at an alpha value of 0.04 because these are the optimal alpha values (as determined by looking at an L curve of the lse_out versus penalty_out) for each of the respective types of simulations. At the end, I had three sets of graphs one set for each strain. Each set of graphs contained an expression versus time graph for each gene in the network. On this graph, the data for the strain being analyzed was plotted as well as the curves for the sigmoid model all strain simulation, sigmoid model individual strain simulation, Michaelis-Menten model all strain simulation, and Michaelis-Menten model individual strain simulation. The all strain simulation model curves are designated by a single asterisk (*) in the graph legend and the individual strain simulation curves are designated by a double asterisk (**). In each of the graphs, a similar trend was observed in the individual strain simulations for the sigmoid model and Michaelis-menten model. A similar (but somewhat different trend from the aforementioned) was observed between the all strain simulations for the sigmoid model and Michaelis-Menten model.

I used the same method to insert a legend in graphs aforemented to insert a legend into the expression graphs versus time for each gene in the network outputted in the sigmoid model and Michaelis-Menten model simulations. To implement this method, the following lines of code were added before the for loop to generate each plot in the graphs script in both model simulations:

if length(Strain) == 1 Targets = {[Strain{1} ' data'],[Strain{1} ' model']}; else Targets = cell(1,(length(Strain)*2)); for i = 0:((length(Strain))-1) Targets{2*(i)+1} = [Strain{i+1} ' data']; Targets{(i+1)*2} = [Strain{i+1} ' model']; end end

The Targets are calculated two different ones depending on if an individual strain simulation is being run or an all strain simulation is being run.

Then, the legend command in the for loop that actually generates the plots was changed to the following:

legend(Targets,'Location','NorthEastOutside');

For the remainder of the day, we worked on making changes to the stochastic model code that Dr. Fitzpatrick resent us. This code currently consists of a general network parameter type code for a stochastic simulation and another code that sets up the parameters and creates the model. We are currently working on the first of the two codes that the second code calls on to make the model. We are stuck on how to define production and degradation. From what the code was when it was sent, the production has been changed to be the product of the weight and expression multipled by the product of the weight and expression that is greater than 0. The degradation is just defined by the degradation rate inputted from the input file. With the way it is defined now, we get expression graphs that are difficult to justify biologically (there are large kinks in the graphs).

Katrina Sherbina 20:18, 3 July 2012 (EDT)

### July 5, 2012

We regenerated expression graphs for each of the genes in the network using the two stochastic codes we have with an x axis from 0 to 60 minutes (the maximum was previously greater). In comparing, the graphs to those generated by the deterministic model, some of the genes had similar trends between the two models. However, some of genes seemed up-regulated in one and down-regulated in the other model. However, the stochastic simulation was done without any least squares optimization procedure. Therefore, we have begun to change our current deterministic code to run a stochastic simulation. Namely, the ode45 function is being replaced with the function stochastic_sim_fn which performs the simulation. This function then calls on the function network_rates which defines the the jump probabilities. The fmincon function is being replaced with a function that was developed last semester called fstochmin which will perform the minimization. Right now, the fstochmin function calls on the general_stochastic_ls which is essentially a modified version of the general_least_squares_error function. In contrast to the general_least_squares_error function, the geeneral_stochastic_ls then calls on the stochastic_sim_fn. Currently, we are troubleshooting the fstochmin function. At the moment, there is a dimensions mismatch in the line

theta(i+1,:) = alpha*theta(i,:) + (1-alpha).*theta1

We have changed some of the dimenions of the vectors and matrices in this minimization code including that of theta (it was at first transposed). The dimensions mismatch is probably occuring because theta has been untransposed. We have also changed the network_rates function a little. Originally, the pro and deg variables were being calculated in the log2 space. We decided that this may be difficult to do so we have changed it so that the variables are no longer being defined in the log2 space. In the general_stochastic_ls function, the log2 is taken of the model determined by the call to network_rates since the data to which the model is being compared is in the log2 space. We will need to deal with the degradation rates because they are calculated in the ln space while the model and data are being calculated in the log2 space. However, this would probably be a simple fix consiting of adding a lines of code in the parameters that converts the degradation rates from ln to log2.

Katrina Sherbina 20:45, 5 July 2012 (EDT)

### July 6, 2012

The dimensions of several variables in the fstochmin function were changed to make sure that what needed to be an array was an array and that what needed to be a scalar was a scalar. In addition, a line of code was added at the end of the for loop that just states theta1 so that we can see the progress of the optimization. In addition, some small changes were added to the way L and penalty_out are calculated at the end of the general_stoch_ls function.

We began the entire stochastic simulation from the beginning before lunch. Now at the end of the work day, the simulation has not yet stopped. Interestingly, a similar simulation that is still running on Nick's computer is showing the same kind of subplot that was generated during the deterministic simulation to see how the model is fitting the day with each step of the optimization. Howvever, even though I have the same line of code to specify this subplot, no figure has appeared over the course of the simulation. Instead, I can only see how the theta1 values are changing. At the moment, there all seem to be within -1.5 and 1.5. I will keep the simulation running through out the weekend and check on it when I come back to work Monday.

Katrina Sherbina 20:41, 6 July 2012 (EDT)

## Week 9

### July 9, 2012

The simulation that I started running last Friday ended with an error message yesterday night that the penalty_out term was not defined. After putting the penalty_out in the global statement, we reran the simulation with just the wildtype strain. The simulation began before lunch and when we came back after lunch there was yet another error message. Our primary concern today was why it was taking so long for the stochastic code to finish running. To debug, we changed iestimate to 0 so that we only run the forward simulation. We also changed the time that was being used in the stochastic simulation function from the 101 times from 0 to 60 min to simtime (0 to 60 min by increments of 5 minutes). The forward simulation finished on my computer within a matter of minutes. However, the model that was outputted was not correct. This was particularly evident in the expression graphs outputted as few if any actually began at 0 as they should when taking the log2 of the model in which the first row should be 1's corresponding to time 0 minutes. At this point, there is an error occuring in the stochastic sim fun when the model is being computed. When running the code I compiled, the first model outputted through one iteration of the for loop in the simulation does have 1's in the first row for the model. However, by the time the last model is outputted, the first row has changed to values other than 1's. When running the code Nick compiled (which seems the same with the exception of a few of the variables being changed), the first row in the first model outputted consists of all 0.1's. In the last model that is outputted, the first row has changed from the first model.

I have some questions about the line of code in the stochastic sim fun that calculates the model:

model = model + interp1(tlist,Nlist,time)/nmc;

With the way this line of code is set up, it seems like the rows that have already been calculated for the previous model change with the new model that is computed through the next iteration of the for loop. However, in taking out the summation of the interpolation and the previous model, the final model is still not what we are looking for it to be.

Katrina Sherbina 20:07, 9 July 2012 (EDT)

### July 10, 2012

The line that computes the model in the stochastic sim fun is correct. With Dr. Fitzpatrick, we did a series of debugging procedures on the forward simulation. The forward simulation calls only the paramters, stoch LSE, stochastic sim fn and the network rates function. The problem that stumped us for some time was that Nlist (which is computed in the stochastic sim fu) was not being zeroed out after one iteration of the for loop from 1 to the number of Monte Carlo's (nmc). This did not seem to be a problem when the original simple stochastic simulation and network rates function were run. The problem arose in taking out the loop from 1 to the nmc and making it a separate function. In order to zero out Nlist, we needed to add a line of code before the for loop (x0 = N0;) and replace the assignment Nlist(1,:) (in the for loop) so that it is x0' not N0'.

After running through the forward simulation for just the wildtype strain, it we saw that Nlist was being zeroed out because the final model outputted had 1's in the first row, as we expected. The entire forward simulation on my computer took over an hour to finish. In looking over the results, the log2 optimized concernations of the three input genes did not change much over the time course of the simulation (0 to 60 in increments of 5 minutes). The log2 optimized fold changes for SWI6 were for the most part negative and the same magnitude throughout a majority of the simulation. Those for FHL1 and SKO1 were positive and of the same magnitude through the simulation. In looking at the expression graphs that were outputted, specfically those for FHL1, SKO1, and SWI6, Dr. Fitzpatrick suggested that we make sure that there is some basal level of transcription accounted in the code. In order to do so, the network rates function was altered so that jump_rates(ii,1) and jump_rates(ii,2) were calculated twice (for when upj and downj are calculated and for when pro and deg are calculated). In addition, Dr. Fitzpatrick suggested increasing the number of iterations n that occur of the for loop in the fstochmin function to see if we can get a better fit of the model to the data. Thus, we increased n from 1000 to 5000.

We began the reverse simulation for just the wildtype strain on two computers. There is a marked difference in the speed of the calculations between the two computers as determined by observing the counter change as the fstochmin function goes through multiple iterations calculating theta1. By now, Nick's computer is over halfway done with the fstochmin iterations and my computer is still below 1000. When we begun running the simulation for all the strain data, we will most likely have to work on the computers in the math department.

Toward the end of the day, I also began setting up the matrices to use GLM tests to determine whether or not a difference in growth rate in 13 degrees between the wildtype and a deletion strain is significant when comparing to the growth rate at 30 degrees. I have thus far completed the matrices for wt versus GAT1, wt versus GZF3, and wt versus URE2. Each matrix has 8 parameters (4 slopes and 4 intercepts). Before I move on to the other matrix, I will check with Dr. Fitzpatrick to make sure that I am understanding the hypothesis for the GLM correctly.

Katrina Sherbina 20:25, 10 July 2012 (EDT)

### July 11, 2012

The reverse simulation for the wildtype strain using the stochastic model finished some time this morning. When looking at the expression graphs generated by the simulation, it was evident that the model did not fit all of the genes very well. For some genes, the model curve was above that of the data. In future graphs that we generate, it may be useful to include the upper and lower error bounds as they were once shown in the expression graphs from the original deterministic model simulation to make a better judgement of the goodness of fit of the model. To try to find a model that fits the data better, we repeated the simulation with different values of a (calculated in the fstochmin function) using n = 1000 for the stochastic optimization. After outputting the expresson graph for each gene in the network for each value of a on the same set of axes, we decided to use a = 0.05. Nick started a simulation with a = 0.05 and n=20000 which will hopefully be done by tomorrow.

Another comparison we made today the percent difference between the input network weights and the network weights optimized by the stochastic simulation that finished this morning. It seems that for a majority of the genes the optimized network weights were not much different from the input weights. However, for some genes (particularly YAP6) the weights changed sign. Along the lines of improving the model, Dr. Fitzpatrick also suggested changing the inital guesses we use for the network weights from the optimized network weights from the determinstic model to those weights multiplied by the production rate. However, this is a simulation that we will probably consider tomorrow. In waiting for simulations to finish, we worked on the presentation for the SURP symposium.

Katrina Sherbina 20:44, 11 July 2012 (EDT)

### July 12, 2012

Before the 20,000 iteration reverse stochastic simulation could finish, the entire simulation needed to be restarted setting the fix_P to 1 (it was 0) so that the prodution rates are not estimated.

While the simulation was running during the work day (it has not finished yet), we performed within array normalization and between array normalization for S. paradoxus microarray data gathered during a cold shock experiment. In looking over the normalized data (after both normalization procedures), we noticed some very drastic log fold change values (ex. around -5 and downwards to -15). As this caused us some concern, we made boxplots of the data before normalization, after within array normalization, and between array normalization. The log fold change values after the between array normalization for t120 flask 1 (there was only one flask for t120 for analysis) were mostly negative. There is a very long tail for the box plot below 0 reaching all the way to -15. We then decided to also produce MA plots for before normalization and after within array normalization to corroborate what we saw with t120 flask 1. For t120 flask 1, there was a prominent tail with a negative slope in the before and after within array normalization MA plots. The log fold change values in this tail had increasingly negative log fold changes and increasing intensity values. What was reassuring to see was the robustness of loess normalization as this procedure did not try to somehow correct this negative tail by improperly shifting the values in the tail. For the other chips, the normalization seemed to work well getting rid of hooks. After looking at the graphs, we went back to the pictures of the microarray slides with Dr. Dahlquist. We saw that for the t120 flask 1 microarray, a majority of the spots were very green and there were some blocks that fluoresced very little and if so the spots were green. Tomorrow, we will try to extract from the MA plots which genes consist of the negative slope tail by selecting the cutoffs for the M values and A values where the tail seems to begin and finding the genes that have M values and A values within the cutoff.

We also performed an ANOVA test with the normalized data to find which genes had significant differential expression during at least one time point. Over 1000 genes had significant differential expression according to the ANOVA (p value<0.05) but there were only 4 significantly differentially expressed genes after performing the Benjamini & Hochberg p value correction (p value<0.05). Looking at the graphs of the data and the full model from the ANOVA test for those 4 genes with B&H p values < 0.05, we saw that the t120 flask 1 data really skewed the model. The results of this ANOVA test are prone to suspicion however considering the normalized log fold changes for t120 flask 1 and the fact that there were at most 2 replicates for each timepoint.

Katrina Sherbina 21:07, 12 July 2012 (EDT)

### July 13, 2012

Today I spent the day working on the PowerPoint presentation for the SURP Symposium and the poster about the deterministic model for the Mathematical Biology Conference. Intially, the the gene expression graphs in the PowerPoint came from running the MATLAB simulation to solve the differential equation and optimize parameters with just wildtype data and with just dCIN5 data. Today, I outputted expression graphs for the 21 genes based on the widltype model and dCIN5 model produced from the simulation using data from all strains. In the intial graphs, there seemed to be a differences in the expression profile of SWI6 (a no input gene) between the wildtype model and the dCIN5 model. However, none of the input genes had a difference in their expression profiles in the models produced for the wildtype and dCIN5 strains from the simulation run with data from all strains. This is what we expected because by definition the expression of the genes with no inputs in the model should not change if a gene encoding a transcription factor in that same model is deleted. These graphs provide evidence that using the data from all strains produces a more robust model than using data from just one strain.

Katrina Sherbina 20:58, 13 July 2012 (EDT)

## Week 10

### July 18, 2012

Sometime last week, Dr. Dahlquist sent revised growth curve data for several deletion strains and S. paradoxus. Since we spent Monday working on posters for the Mathematical Biology Conference and the presentation for the SURP Symposium (which was yesterday), I returned to the statistical analysis of the growth curve data. I already had matrices set up for GLM testing for wildtype versus URE2, GAT1, and GZF3. However, I redid these matrices with the revised growth curve data that Dr. Dahlquist sent. In addition, I worked on setting up the matrices for the other deletion strains as well as S. paradoxus. Each matrix has a dependent variable column and eight other columns corresponding to eight parameters: slope for wildtype during cold shock, slope for wildtype during optimal temperature (which is more like a delta representing the change in the growth rate between cold and optimal temperatures), intercept for wildtype during cold shock, intercept for wildtype during optimal temperature, and those corresponding slopes and intercepts for the deletion strain. Each matrix is set up to compare the growth rate of one deletion strain at a time to the growth rate of the wildtype. Tomorrow, I will be finishing the matrices for the remaining deletion strains so that i can perform the GLM testing before research ends for the summer.

I also spent a part of today continuing to work on the determinstic model poster for the Mathematical Biology conference. Namely, I redid the "AND" gate, "OR" gate, and sigmoid model representations in 3D. From what Dr. Fitzpatrick said today, the graph we called the sigmoid model actually was a representation of the production of a target gene in the case of an activator and repressor with an "OR" gate type. The original representation of the "OR" gate type actually corresponded to the production of a target gene in the case of two activators with an "OR" gate. I kept these two graphs and substituted the "AND" gate graph with two representations of the production of a target gene in the case of two activators with the sigmoid model and in the case of an activator and a repressor with the sigmoid model.

Katrina Sherbina 20:52, 18 July 2012 (EDT)

### July 19, 2012

Today, we finished analyzing the growth curve data from 2009 to this summer to determine if to determine whether or not a difference in growth rate in 13 degrees between the wildtype and another strain (deletion strain or S. paradoxus) is significant when compared to the growth rate at 30 degrees. The following strains were compared to the wildtype: dCIN5, dGLN3, dRPN4, dZAP1, dHMO1, dINO2, dOPI1, dRIM101, dGAT1, dGZF3, S. paradoxus, dURE2, dGAT1, dGZF3, dHAP3, and dHAP5. There were two sets of data (from two different growth curve experiments) for dGAT1 and dGZF3. To test the hypothesis, a general linear model was constructed for all 16 comparisons in SPSS to obtain the hypothesized model residuals and the full model residuals. In the case of the hypothesized residuals, only 7 of the 8 parameters were used in the calculation (all except the slope for the deletion strain or S. paradoxus, depending on the comparison, for 30 degrees Celsius). For the full model residuals, all 8 parameters were used in the calculation. For all 16 comparisons, very large F statistics were calculated from the hypothesized residuals and full model residuals (all above 100 and some into the 1000s). As was expected from the F statistics, very small p values were obtained for all 16 comparisons. The only p value that SPSS could compute was that for the comparison between wild type and dZAP1 (on the order of 10^-14). For the other comparisons, Graph Pad was used to try to obtain a p value. The best that it could do was calculate the p value to be less than <0.0001 for all of the comparisons. Therefore, the difference in growth rate between the 13 degrees and 30 degrees for the wildtype and the other strains is significant.

Katrina Sherbina 20:06, 19 July 2012 (EDT)