# Computational Journal

### From OpenWetWare

m (→added June 14, 2012 journal) |
(→June 18 journal entry) |
||

Line 685: | Line 685: | ||

[[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 21:28, 14 June 2012 (EDT) | [[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 21:28, 14 June 2012 (EDT) | ||

+ | |||

+ | |||

+ | ==Week 6== | ||

+ | ===June 18, 2012=== | ||

+ | Today was spent reading some papers that Dr. Fitzpatrick emailed us last Friday, which was spent putting together a ppt on all the work we had done up until now for journal club. Journal club took a good 3 hours today because our presentation was on a years worth of work. Once we got back, we decided to run some confidence tests to make sure the code is really doing what we want it to do. The first test was to run the new code for all the strains simultaneously, but instead using one individual strain at a time, along with using all of the means instead of the flask data. This was pretty quick, especially computationally, the only problem was that our newest version of the code did not have all the code it needed. And since we are testing the code and aren't allowed to change it, we can't be sure if this causes problems or not, for example, the standard deviations are calculated in MATLAB in the new code, but the old input format called for having a sheet with the sigmas present. Since this was so, the graphical output did not have the standard deviations present, and some other place in the code may have been affected. The next test was to make sure the the optimized model would output the same data if the optimized parameter values were put back into MATLAB. This was difficult though because we need to make a new model of 3 gene with no connections and rewrite all of the input sheets. This has not been run yet, some problems that may occur are size problems, if MATLAB is hardcoded at any point for # of data points or time or flasks, the other problem is that we aren't sure if we need to run an initial optimization with our new network, or if we can use the optimized values from our 21 gene network run with this new network. The problem with the latter idea is that this may change the optimization of the network if it is specific for the 21 gene network and the 50 connections it has already. The final goal for today was to figure out if our code was running in an AND format, where the exact number of TFs that are required to transcribe the mRNA are required to make the model, or if it is in an OR format, where at least one TF has to bind. Biologically, the latter idea would make more sense, even though some genes do require more than one TF, those would likely be case specific. This needs to be checked for both activators or repressors, and for a combination of the two to give us a good idea of what we are running. At this point Dr. Fitzpatrick assumes that it is somewhere in between the two formats. The last two confidence tests will be run tomorrow, hopefully we can get this minor problems worked out too. | ||

+ | |||

+ | [[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:06, 18 June 2012 (EDT) |

## Revision as of 20:06, 18 June 2012

## Contents |

# Summer 2011

## Week 1

### May 20, 2011

After attempting to find a match between each .gpr file and the individual LogFC columns in the original raw data some problems occurred. A single .gpr file was chosen to run a few tests on to try and match with one of the columns. The file that was selected in this case was t15/t0 for flask 3 of the Dahlquist wildtype strain. Flask 3 was chosen for a dye swap of Red/Green to Green/Red, however, when computing the LogFC for each spot on the chip, it was discovered that the LogFC was performed using the Red intensity divided by the Green intensity. This will have to be confirmed with Dr. Dahlquist to see if this should be changed at all or if the Excel sheets obtained from May 19th, 2011, a journal for this day does not exist. However, using Red/Green for the fold change, a column was obtained that matched the LogFC originally performed in the .gpr file. The next step was to try and perform this in the statistics processing software R. The original code given was changed to focus on the median, due to the fact that R was formatted to process the mean foreground intensity over the median background intensity. Once this was done, R successfully created a list of LogFC, of which the first five were focused on to compare to the original .gpr file. Of these five, two were duplicates giving three individual genes to be focused on in the file. The first five data points given by R matched up with the first five genes given in the .gpr file, so we know that this was functioning correctly. The main problem that arose was that when the LogFC given by the .gpr file and R and the LogFC found in the raw data were compared, an inconsistency occurred. These numbers did not match, and after running a seperate test on the .gpr file for t30/t0 for the same culture, the same problem occurred.

## Week 2

### May 25, 2011

Spent the morning trying to match up the Log Fold Changes before global normalization and after global normalization from the .gpr files. Ideally the graph should have a linear trend line if global normalization had subtracted the same constant from every spot. However, some of the files output a fuzzy region around the origin, this can be seen in the separate powerpoint with the graphs. The individual red/green intensities were graphed next. The normalized red intensities was graphed against the non-normalized red intensities and the same for the green intensities. These graphs should all be a linear line as well, however some awkward points occurred. The fuzzy region from the LogFC graph is thought to be a resultant from the "error" messages that are scattered throughout the .gpr files. These are a result of the intensity ratio being negative and the Log of any negative number does not exist. The number of "error" messages in each file seemed to correlate with the fuzzy section around the origin, as the number of "error"'s went up, the fuzziness increased to an extent were it almost was not linear anymore.

### May 26, 2011

This morning was spent troubleshooting a line of code given by Dr. Fitzpatrick, M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean), meant to average the duplicate spots in the .gpr files after they were locally normalized using loess. We had to come back later fix this code using a for loop so that R would normalize all of the .gpr files instead of the one we had specified, 5 given that that was the first column with gene names when we were working on this yesterday. However, this was a .gpr file so the first four columns were just 1's, were as the MA data frame that was created did not incoporate these columns into the data frame. Once this was all figured out, I went to lunch and to check on the S. cerevisiae cultures (see lab notebook for details). Once we got back, we spent the remainder of the time trying to make sure the loess normalization worked, after talking with Dr. Fitzpatrick for a bit we switched our determined this was best done by graphing the normalized vs. nonnormalized log fold changes, after they had both been averaged using the tapply function. We spent a short amount of time trouble shooting this, mostly clerical errors when typing the source code, after which we were able to graph the relationship, which came out to be a relatively linear line, showing that loess normalization was working correctly. With this we are able to go back and do the loess normalization on the data given by the rest of strains.

### May 27, 2011

We ran into a few problems at our first attempts at running the code even though we had had successful previous attempts on the previous day. The main problem we found was for the first for loop, the MM must have the [,i] afterwards, or else the output sheet will only be one column, and not all the columns that were put into the loop. After a few more typos that were taken care of this was the outcome:

par(mfrow=c(2,1)) targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) plotMA(RG,main="before within array normalization") MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:20) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"wt_M_avg.csv",sep=",") plotMA(MA,main="after within array normalization") MAB<-normalizeBetweenArrays(MA,method="scale",targets=NULL) b0<-length(MAB$M[1,]) N2<-tapply(MAB$M[,1],as.factor(MAB$genes[,5]),mean) b1<-length(M1) MMB<-matrix(nrow=b1,ncol=b0) MMB[,1]=N2 for(i in 1:9) {MMB[,i]<-tapply(MAB$M[,i],as.factor(MAB$genes[,5]),mean)} write.table(MMB,"between_array_norm.csv",sep=",")

The point of all this code is to take the .gpr files that we were given, put them into the R statistics database, create an MA plot for before normalization and then normalize them using the locally weighted scatterplot smoother (Loess), note the e in loess may stand for estimated seeing as how the we in lowess stands for weighted. After this was done the data was placed into a matrix, because R has trouble translating a MAlist to a data frame, this matrix was then averaged for every duplicate gene using the line of code discussed on May 26th. This was all written into an excel table and saved along with the accompanying picture of the MAplots. This data was then put through a between array normalization, placed into a separate matrix, boxplots were done of the before and after between array normalization, these were all saved. Some notes on the code is that all the for loops must be changed for the number of .gpr files that are put into the target text file. Work is still progressing on the GCAT chips, first of all, the flask numbering for the GCAT chip as well as flask 4 must be double checked because of some discrepancies as well as a want to not have to run all the data twelve times. However, I was able to place both targets onto R, normalize them, and merge them into a single MAlist. This can be done for the RG list before, at the moment, I am only testing this code to make sure it will accomplish what I need by testing segments and checking the attributes of the outcomes. Once the flask discrepancies can be sorted out with Dr. Dahlquist, a majority of the "New Master List" can be assembled and the GCAT can be focused on from there.

## Week 3

### June 1, 2011

How to use Microsoft Access to match tables:

- Open microsoft access
- Make sure all columns are labeled.
- Save desired files as tab delimited text files.
- File -> Get External Data -> Import your text files.
- Choose ID's to be Field Name
- Check first box for first row is field names
- ID column can be text and is ok with duplicates, will kick it back out if it doesn't like it.
- All other columns should be numbers
- Repeat for other text file.
- Choose own access key
- Create a new query and add both text files
- Click and drag from one ID field name to the other
- Right click on line and choose join properties
- Choose option three, to keep all "Ontario" data and add only new "GCAT" data, arrow should appear showing relationship.
- Select all the field names and click and drag them to Field on the Query, it will automatically assign them.
- Query -> Make-Table Query and name your table
- Click run to get output, select all and copy and paste values into an excel file, make sure format is correct

Spent today working through the GCAT chips, once this all got finished, Dr. Dahlquist showed us how to us Microsoft Access to combine the GCAT and Ontario chips, as well as how to ignore any extra genes on the GCAT chips. After this I spent the afternoon working on constructing a powerpoint with all the MA plots and boxplots as well as a process description and the code. The validation graphs still need to be graphed and placed on the spreadsheet, I will try and get this done tonight so it is ready for June 2nd.

### June 2, 2011

Spent the day going over our results with Dr. Fitzpatrick to fully understand them and make sure everything was scaled correctly. We discovered that the MA plots that were originally outputted by us through R are in fact only the first .gpr file for each strain. As we went back to fix all of those and get all of the graphs on the same page. This was solved by running a for loop for our MA plot maker:

for(i in 1:20) {plotMA(MA[,1])}

Next we decided to match up all the boxplots next to each other. This was a particularly tough time and upsetting once we made the discovery that we had to write out the whole code:

boxplot(MA$M[,1],MA$M[,2],MA$M[,3],MA$M[,4],MA$M[,5],MA$M[,6],MA$M[,7],MA$M[,8],MA$M[,9],MA$M[,10],MA$M[,11],MA$M[,12],MA$M [,13],MA$M[,14],MA$M[,15],MA$M[,16],MA$M[,17],MA$M[,18],MA$M[,19],MA$M[,20])

We made a correction to this code, changed the MA$M to an MAB$M in my case, to plot out the after within array normalization MA plots. After this we attempted to go through the "multtest" tutorial that Dr. Fitzpatrick had given us but ran into a few problems installing the packages. These problems persisted even after we thought we had solved them all, however, Dr. Dahlquist should have the mean to help us install these packages. The last piece of code that I generated was the boxplots for the before within array normalization data so we could see the centering effects. This was probably the most annoying code I have written, thankfully I only made one error which was easily corrected. The code is:

boxplot(log2((RG$R[,1]-RG$Rb[,1])/(RG$G[,1]-RG$Gb[,1])),log2((RG$R[,2]-RG$Rb[,2])/(RG$G[,2]-RG$Gb[,2])),-log2((RG$R[,3]-RG$Rb[,3])/(RG$G[,3]-RG$Gb[,3])),-log2((RG$R[,4]-RG$Rb[,4])/(RG$G[,4]-RG$Gb[,4])),log2((RG$R[,5]-RG$Rb[,5])/(RG$G[,5]-RG$Gb[,5])),log2((RG$R[,6]-RG$Rb[,6])/(RG$G[,6]-RG$Gb[,6])),-log2((RG$R[,7]-RG$Rb[,7])/(RG$G[,7]-RG$Gb[,7])),-log2((RG$R[,8]-RG$Rb[,8])/(RG$G[,8]-RG$Gb[,8])),log2((RG$R[,5]-RG$Rb[,9])/(RG$G[,9]-RG$Gb[,9])),log2((RG$R[,10]-RG$Rb[,10])/(RG$G[,10]-RG$Gb[,10])),-log2((RG$R[,11]-RG$Rb[,11])/(RG$G[,11]-RG$Gb[,11])),-log2((RG$R[,12]-RG$Rb[,12])/(RG$G[,12]-RG$Gb[,12])),log2((RG$R[,13]-RG$Rb[,13])/(RG$G[,13]-RG$Gb[,13])),log2((RG$R[,14]-RG$Rb[,14])/(RG$G[,14]-RG$Gb[,14])),-log2((RG$R[,15]-RG$Rb[,15])/(RG$G[,15]-RG$Gb[,15])),-log2((RG$R[,16]-RG$Rb[,16])/(RG$G[,16]-RG$Gb[,16])),log2((RG$R[,17]-RG$Rb[,17])/(RG$G[,17]-RG$Gb[,17])),log2((RG$R[,18]-RG$Rb[,18])/(RG$G[,18]-RG$Gb[,18])),-log2((RG$R[,19]-RG$Rb[,19])/(RG$G[,19]-RG$Gb[,19])),-log2((RG$R[,20]-RG$Rb[,20])/(RG$G[,20]-RG$Gb[,20])))

## Week 4

### June 8, 2011

Spent the morning lecturing about statistics and the Family-Wise Error Rate(FWER) and the False Discovery Rate(FDR). After a quick intro to MTP we started working through R to get a better understanding of what MTP needed to process through R. This was probably the most tedious point of the day because the output in R consisted of thousands of NaN's(not a number). This was frustrating to say the least. Once Dr. Fitzpatrick looked at what was going wrong we were able to step through the errors and output the pvalues that we wanted, both raw and adjusted, for all the data as well as an average. We used this line of code to do this:

seed<-99 group<-c(rep(0,3)) m1<-MTP(X=matrix,Y=group,test="t.onesamp",standardize=FALSE,B=100,method="?")

The method still needs to be changed for the FDR, without specifying it defaults to one we don't want. The matrix object in this line of code is just the GPR files saved as a CSV excel sheet and written into R. The test also needs to be one sample otherwise it defaults to a two sample t test. See the help section in R or type in ?MTP and look at the examples for more help. Tomorrow will be the real test to see if we can get this to work for the rest of the data points, we also need to compare the raw pvalues that are output with the calculated pvalues from our new normalized master spreadsheet to see how they correlate.

### June 9, 2011

Spent most of the day waiting for R to output the adjusted p-values that we wanted. This took a good deal of time because there are 6000 different genes to calculate the p-value for. We adjusted the number of iterations so that we could get a more accurate result, because with a B=100, the p-values that come back are only shown up to two decimal places, while a B=1000 gives you three decimal places. Once all of these were calculated by R, we placed them into a spreadsheet to compare them to the p-values we calculated by hand. After this was done, I spent some time trying to graph the results and compare the number of rejects from each type of adjustment, I managed to get a graph but I am still researching to see if there is a way to specifiy our data some more and manage to have a better graph. Another thing that I need to research is the use of the TPPFP and the gFWER adjustments. As to what I need to learn, I need to figure out where they are applied, exactly why and in which situations, and how to implement these into R. Hopefully the materials sent by Dr. Fitzpatrick and the paper I need to read for Journal Club will give me a better understanding of what is going on.

### June 10, 2011

Spent the day comparing the different MTP results using different adjustment methods. Attempting to see a correlation of the data however none has been seen yet. This may be due to the methods, however another and possibly more likely cause is that our calculations are not right. We need to make sure we are comparing the same strains, the same time points, and the same methods, otherwise there is a high possibility that no correlation will be seen between the excel calculated data and the R calculated data. After seeing little to no relationship between the data, we started compiling a spreadsheet consisting of the 15 genes/transcription factors that we are focusing on along with all their time points for every strain and the averages and standard deviations for every time point. After going through some of the data some errors were found, one we need to go back and refill our GLN3 data because we started out with only a fraction of the GPR files. We also need to compare the CIN5 values we were using to test our methods today with other CIN5 data, we managed to compare it to wild type data instead and took that as no correlation.

## Week 5

### June 13, 2011

Spent the morning going over the Matlab model that I did this previous semester. After this large review we started making the individual input sheets for Matlab from the yeast regulatory sheet we made on Friday. After a few minutes of this we discovered a mistake in GCAT's in the Master data sheet. So all the Matlab files had to be erased and re-pasted with the corrected data. While going back and fixing the GCAT data we discovered a few more errors in the number of replicates for ZAP1 as well as extra rows due to the controls still being present in the spreadsheet. Just to make sure that nothing was wrong with the rest of the data, we went back and re made the master spreadsheet by combining all of the individual files we got from R. Once all this was done we had to re make the yeast regulatory network spreadsheet, with all of the averages and standard deviations. After this we were able to make the Matlab input sheets for all of the deletion strains. We will start on the Matlab tomorrow morning.

### June 14, 2011

Spent the day reviewing the yeast regulatory network and the Matlab simulation that I focused on this past six months. After making another powerpoint with all of the expression graphs that Matlab outputs for each gene in the network, we went on to graph the relationship between the lse and the norm values at different alpha levels. The Matlab iterations took up most of the time for the workday so I did not spend a lot of time on the actual code, and the concepts were mostly a review for me.

### June 15, 2011

This morning consisted of running more Matlab simulations to try and find the switching point on the lse vs norm graph. The wild type graph produced a smooth curve, however, the dGLN3 graph had a awkward jump at the alpha value of 0.002. We chose this point to focus on for more statistical analysis. The afternoon was spent finding the relative error between each alpha value and 0.002, this was done for the optimized production rates, optimized weights, and the optimized thresholds. Having to do the relative error for every alpha value in all three categories caused this step to take the rest of the day.

### June 17, 2011

Spent the morning comparing the estimated production rates, network weights and network thresholds with the seperate p-values that we simulated the other day. After comparing the graphs and seeing that an alpha value of 0.002 was a good choice and that the trends in the graphs are similar, we decided to re-build the regulation networks and run the simulations with them. The afternoon consisted of building the 10 different excel files, one for each strain and 1 for both the old regulation matrix and the new one with the genes in the strain names added in to see the effect. The first simulation took the rest of the day and the rest will be done over the coming week.

## Week 6

### June 20, 2011

In the morning we convened with Dr. Dahlquist to work out the details in the Matlab models that we are going through this next week. As we made the gene regulation networks in Yeastract we decided to figure out where Yeastract was getting these references for the relationships between the genes. As we went through the references given on the website, Dr. Dahlquist asked us to figure out what the different options in the gene regulation matrix menu were. The two options were to test with direct or indirect evidence, or to test for the number of binding sites on the promoters, anywhere from 1-9. The latter gave us less and less as the number of binding sites to specify increased, so we knew this wasn't right. We talked with Dr. Dahlquist about the direct and indirect options to decide on which to use, they both compare the given transcription factors and genes to a dataset, and decided that a direct set of matrices and a direct & indirect set of matrices were the best option. However, one thing we are still working is exactly where they are getting this dataset from. We have started searching the SGD, or Saccharomyces Genome Database, with any luck we will find it in here. We spent the afternoon on something different than the reference checking.

We put together more data sheets for the direct and the direct & indirect matrices. The first five matrices for the direct evidence matrix had already been put together, however, Dr. Dahlquist wanted us to check on the AFT1 synonym, RCS1, that we had been using in our list of genes and transcription factors to see if this would provide a change in the original direct matrix. This change was evident so we had to go back and recreate all of the original datasheets we had done by changing RCS1 to AFT1 in the list and reforming the matrix in Yeastract. Once these had all been corrected we spent the rest of the day creating the old 21 gene regulatory network matrix so that we could create the target excel sheets to put in Matlab. These corrections and the formatting for these sheets took the remainder of the day, we should start the simulations tomorrow in the Math PC computer lab.

## Week 8

### July 5, 2011

Spent the morning going over the steps very thoroughly with Dr. Dahlquist and Dr. Fitzpatrick, so that we could iron out all the discrepancies in our process and have the best chance of avoiding a typo error in our code. We also started compiling the individual codes and tutorials for the GCAT and Ontario chips as well as the processes to output their graphs, both boxplots and MA plots. After going through and matching up our code so that we were as sure as possible that we would get consistent data we went through R and started outputting all of the data. However, we started finding some problems as we went through the code. Mainly that Katrina and I were getting different values, but only different by values of E-10. While this is not a huge difference, one would expect the same code and inputs to output the same data. The bigger problem came when we tried the process with the GCAT chips, of which Katrina was missing the GPR files and any segments of the code so I copied my own over to her flash drive. The same differences arose when we went through this data and this started to make me feel a bit uneasy, it got worse when I tried outputting my own data to see if R was just not computing each run the same, however the output I got matched up perfectly, in value and sign, to the first run of the data. These differences got even larger after between array normalization where the value of the differences increased to E-04. We took our time going through the code to ensure that no typos were present. We also checked the dye swap csv file and the target files to see if we had them all in the correct order, which we did. Currently we cannot find out where this problem is coming from, hopefully Dr. Fitzpatrick or Dr. Dahlquist will be able to weigh in and help. The code for the GCAT is as follows, the code for the Ontario chips will be located approximately 10 or so spaces below. The process for between array normalization is only located on the GCAT code. I added some descriptions of the steps just to break up the code so it is not overwhelming to look at.

>Read the top and bottom chips into R separately and rbind them >to model the data.frame after ontario chips, 14000 spots in each column

targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RG<-rbind(RT,RB)

>normalize within each array

MA<-normalizeWithinArrays(RG,method="loess", bc.method="normexp")

>tapply an average function to average duplicate spots so that only unique >spots remain in the data

M<-matrix(nrow=6403,ncol=9) for(i in 1:9) {M[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,4]),mean)}

>write the table to put through microsoft access

write.table(M,"GCAT_WA.csv",sep=",")

>read csv file back in once all logFC are in a single sheet, model after master data >then normalize between arrays

B<-read.csv(file.choose(),sep=",") BX<-as.matrix(B) BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

>write out as a table

write.table(BA,"Master_data.csv",sep=",")

- Ontario Code

>Read Target Files into R and Loess Normalize

targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")

>Average Duplicate spots so that each GPR has only unique spots left

M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>Read dye swap and perform dye swap on ontario chips

ds<-read.csv(file.choose(),sep=",") MN<-matrix(nrow=6191,ncol=94) for(i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}

>Write file into excel to delete controls (first two rows)

write.table(MN,"Ont_WA_ds.csv",sep=",")

## Week 9

### July 11, 2011

Today was spent researching how to create an index in R consisting of the gene names or just a simple numerical index, so that as the data moves through R, the labels are still present after each major step. So far I have managed to add a column of names and a row of headers to a data frame. There are still some problems arising as I move through. One being that a matrix converts every cell to a character instead of keeping the distinction of a character versus a numeric. The next major step is finding a line of code to keep these indices connected to their corresponding cells in R, we have found a few examples but have nothing to post on R for these at the moment, we are still working through getting the correct format down. One thing I might want to check on later is keeping everything as a matrix, because some functions have problems with data frames, however we may have moved past this problem with our change in the scale normalization step. We have changed the line of code from:

BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

to:

for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

More research will need to be done to compile a code for the new format of target files we have, that being a csv file with the corresponding columns:FileName,Header,Strain,TimePoint,Flask,Dyeswap. This is so that everything we need is present in R so that we won't make any human errors by adding typos to our data. Here is the compiled code I have so far:

>Read the target file, and name file, separate the index, dye swap, and row ID's Targets<-read.csv("New_Target.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] Index<-Names[,1] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:15) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M1) MN<-matrix(nrow=n3,ncol=n2) MN[,1]=M2 for(i in 1:15) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),]

bn<-readTargets("matrix_names.txt") b0<-length(MP) b1<-length(MP[,1]) BH<-matrix(nrow=b1,ncol=b0,dimnames=list(bn)) for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

### July 12, 2011

Spent the morning working through the Ontario code and fixing a few bugs, we managed to get the names to stick and so we moved onto the GCAT chips. Katrina worked on getting the GCAT chips through R using the new segments of the code worked into the original. After reformatting the entire code a bit we have compiled a mostly finished segment of code however there is still one part to work out. That is getting R to perform what microsoft access did, that is get rid of all the GCAT spots we don't want, without having the actually touch the data yet. Katrina has been working on subsets for the remainder of the afternoon, while I was going through on my own to try and compile subsets I stumbled across a function that may do exactly what we want. That is the function, merge(), will combine two data frames and match up the cells based on criteria you provide. However, this is where the hard part comes in, which is figuring out how to tell R the criteria. I have been getting an error, that tells me that the memory required for the outcome cannot be allotted in R. I have tried using less gpr files for the Ontario chips, seeing as how I was using all 94 originally, as well as saving the workspace so that the data is saved under in a separate folder. However, none of these have seemed to work yet, trying to fix the memory or R to allot more memory may be my best bet at this point, this will be explored more tomorrow, for now I will post the compiled code thus far: Note: The last four lines of code are currently being tempered with, they are not permanent and may not be present next time.

>Read the target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) MN[,1]<-M2*ds[1]/mad(M2) for(i in 2:94) {MN[,i]<-ds[i]*MM[,i]/mad(MM[,i])}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),] write.table(MP,"ONT_07122011.csv",sep=",",col.names=NA,row.names=TRUE)

>switch to GCAT directory and start on the GCAT chips

targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) RR[,1]=R1 for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

NAM<-as.data.frame(Names[,2]) Q<-merge(P,GD,match(P$row.names,GD$row.names),all.NAM=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),by.MP=row.names(MP),by.GD=row.names(MP),all.MP=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),all.MP=all)

### July 13, 2011

Continued on with the merge() code this morning to figure out the secondary commands and try and get it to work. After browsing through a few hundred emails, forums, and instruction manuals, I managed to find an example that showed how to work out the parts I was still having trouble with. My next problem came trying to get only the Ontario IDs on both the Ontario and GCAT chips. This is done by adding the all.x=TRUE line to the inside of merge(). However, R seemed to just ignore the command and I could not figure out why. After trying a few examples to try and make it work, I moved on to creating a subset of the data frame that consisted of only these IDs. Katrina had a doctors appointment so it took us a while to do a comparison because I had to wait for her to come back. After a simple typo that lead us to thinking our outputs were different in only the GCAT chips, we managed to confirm both of our methods were doing the same thing. As to which one we will us I do not know at the moment. The last hour of the afternoon was spent fixing up our presentation for journal club tomorrow. As usual I will post my compiled code at the end, this code is correct and as long as the input files are consistent then everything should run smoothly. The only break that should occur is done by the programmer when they need to switch directories. This can be managed by paying attention to the headers posted above the code, they should list when the change occurs. Here is the code:Note: I originally had a hard delete to get rid of the ONT controls, I will be working in a subsetting procedure to do this instead.

>Read the ONT target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,Ontario_ID!="Arabidopsis") ont2<-subset(ont1,Ontario_ID!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Deleting unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)} write.table(XX,"Master_07132011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

### July 14, 2011

Still working out all the kinks in our code so that when the final data sheet is outputted it will be the master data sheet, that is, without all of the averages, tstats and pvals. Everything has been solved up to keeping the column and row names through out R. The last step we are on is ordering the flasks in the correct order. So far we have developed a hard sort, where we tell R where the columns are going to go. In theory this is fine, however, we are trying to develop a way for R to do this on its own so that we can minimize human error. This is for in the future when more chips are added so that the hard sort does not have to be corrected or added to every time a new deletion strain is added to the data set. This is not as big a problem as previously thought because only the GCAT chips are out of order so only they would have to be changed, or even ignored if the GCAT chips are 1-9 and the Ontario chips are 10-..., this way the strains would only have to be added and the final number would have to be adjusted for the total number of chips being loaded into R. There is one package I found, gregmisc, that contains a function matchcols() and is supposed to match up the columns of one data frame based on the criteria of a vector. I am trying to work out the problems with this code to see if it is even a viable option. The other problem with this code is that it was developed using 2.13.1 meaning that some previous function being called could have changed and our data is no longer consistent. To determine this I need to compare the within array data between versions 2.7.2 and 2.13.1, the between array step has already been removed so this is no longer a problem. However I have a financial problem so this must be taken care of tomorrow. As usual, here is my compiled code, note: the last few lines with no description above them are still being worked with:

>Read the target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,row.names(MO)!="Arabidopsis") MP<-subset(ont1,row.names(ont1)!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-read.csv("GCAT_Targets.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f) RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f) RGG<-rbind(RT,RB)

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[1:9,2] Grow<-GNames[,2] GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Getting rid of the microsoft access step and disposing of unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] colnames(XX)=colnames(Z) for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

XY<-XX[,c(1,100,2,3,4,96,101,5,6,7,97,102,8,9,98,103,10,11,12,99,104,13,14,15,16:95)]

XZ<-read.csv("Ordered_headers.csv",sep=",") XV<-as.vector(XZ) GW<-matchcols(colnames(XX),with=c(as.character(XZ)),method="and") C<-grep(XZ[,1],colnames(XX),ignore.case=FALSE,fixed=TRUE)

F<-sort(XX,by=colnames(XZ))

write.table(XY,"Master_07142011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

>optional within array data

write.table(Z,"within_array_07142011.csv",sep=",")

### July 15, 2011

This morning consisted of finishing up the ordering of the chips once they are in R. After going through my code separately to make sure everything worked we compared our outputs. To our relief, they all turned out to match perfectly, which means even though Katrina and I have different codes we still both get the same output. The afternoon was spent putting the code on openwetware, this is located under the Dahlquist Lab protocols page in the LMU section. A description of each step was added so now there are rougly as many comments as there are lines of code. Once this was done we went through and did the Pvals individually. We also did the Bonferroni correction to the p-values, which is done by multiplying each p-value by the total number of spots. We also did the Benjamini Hochberg correction, which is similar in that you multiply each spot by the total number of spots, but you then divide by the location of the spot in the index, i.e. the first spot is Pval x 6189/1, the second is Pval x 6189/2, and so on. After this was all done we had two options, to do the F stats and find the p-values with them, or to start working on the multcomp package. While Katrina did the F stats for her data, I worked on learning about two pieces of code, on located in the base package and one located in the multcomp package. The first was lm(), which fits the data to a linear model that you set, however to apply this to our data I will need to know what the model is. The second is glht() which is the general linear hypotheses, this is a much more detailed process and I am still working on getting through the pieces inside the parentheses. The other part I am working on this is the linfct() which is located inside the parentheses in glht(), these lines of code just require some time spent with them to figure out what they do and how to adjust them correctly for our data set. As usual I will post the compiled code, for now this seems to be the final version, and changes will be commented on later. Here is the code:

>Read the target file, and name file, separate the index, dye swap, and row ID's

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,row.names(MO)!="Arabidopsis") MP<-subset(ont1,row.names(ont1)!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-read.csv("GCAT_Targets.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f) RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f) RGG<-rbind(RT,RB)

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[1:9,2] Grow<-GNames[,2] GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Getting rid of the microsoft access step and disposing of unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] colnames(XX)=colnames(Z) for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

XZ<-read.csv("Ordered_headers.csv",sep=",") XV<-as.data.frame.matrix(XX) XY<-XV[,match(XZ[,1],colnames(XV))]

write.table(XY,"Master_07142011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

## October

### October 4th, 2011

To analyze the significant changes in our gene expression, p-values were calculated using an F-distribution. The first step to calculating this F-distribution is by calculating the sum of the squares of the null hypothesis, or SSH. The null hypothesis being that no genes experience any significant change in expression, therefore the population mean(μ) is 0. To calculate the SSH each genes log fold change was squared and summed over every flask i; i=1, 2, 3, 4, 5; and every time point j; j=t15, t30, t60, t90, and t120. SSH=Σ_{i}Σ_{j}(Y_{ij})^{2}. The second step is to calculate the sum of squares of the alternate hypothesis, or SSF, the difference from the SSH being that the hypothesis states there is at least one significantly changed gene. This is represented by subtracting the population mean from the log fold change before squaring it. The population mean can be calculated by averaging the log fold change for each time point, these values were subtracted from each log fold change for each gene at their respective time point. These values are then squared, and summed over every flask i, and time point j. SSF=Σ_{i}Σ_{j}(Y_{ij}-μ_{j})^{2}. Finally, the F-distribution is calculated by subtracting the SSF from the SSH and dividing by the SSH, then by multiplying this value by the number of flasks(F) subtracted from the number of trials(N) divided by the number of flasks. This will give you the F-distribution with degrees of freedom F, N-F. [(SSH-SSF)/SSF * (N-F)/F] ~ F(F,N-F). This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F. Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains. While the Benjamini Hochberg correction is our main focus due to its robustness when compared to the Bonferroni, both the Benjamini and Hochberg and Bonferroni were calculated to compare the results of signifcant change in gene expression. The Bonferroni was calculated by multiplying all of the p-values by the total number of hypotheses, in this case the 6189 genes being expressed. The Benjamini and Hochberg was calculated by first sorting all of the p-values from least to greatest, and then multiplying by the total number of hypotheses and divided by its position it was sorted in.

### October 25th, 2011

The stem clusters were obtained using the STEM package and instructions located under Assignments: Week 12. Wildtype, dGLN3, and dCIN5 were the three strains that were examined, dHMO1 had an absence of genes in this case and the uncorrected values are to be used in obtaining this profile. During the lecture, Dr. Fitzpatrick asked if I had gone over the Hill plots yet, and in turn, I thought about it and suggested using these plots to somehow estimate the cooperativity, or weight, between each transcription factor. This will have to be elaborated on more in the future. The master data was also formatted such that an extra sheet was added at the end to summarize all of the values we were looking for. This order that the sheet was formatted in was this: first, the index; second, the Gene Symbols; third, the normalized Log fold changes obtained through R; fourth, the average Log fold changes; fifth, the F-distribution, the calculation of which is described above on October 4th as well as the Dahlquist Dry Labs protocol; sixth, the p-values, seventh, the Bonferroni corrections; and finally eighth, the Benjamini & Hochberg corrections. And in turn, each of these columns was grouped by their strain, such that these eight groupings exist for the wildtype, then dCIN5, then dGLN3, and so on. While the all of the columns up to the Bonferroni corrections were sorted in R, the Benjamini & Hochberg corrections were sorted in excel. This poses a problem seeing how excel and R rank symbols, such as a hyphen inside of the Gene Symbol. This can be fixed by a simple step, once all of the data, not including the B&H corrections, is in the new sheet with the Gene Symbol column, sort said column from A→Z. This will allow the hypenated terms to be sorted in excel, and allow the terms associated with these symbols to be sorted along with them. Once everything is sorted properly, the B&H corrections can be added to the excel sheet.

## December

### December 6th, 2011

Most of the day was spent using a MATLAB tutorial on the website <http://www.math.ufl.edu/help/matlab-tutorial/matlab-tutorial.html#SEC5>. Using this tutorial, a previously prepared matlab worksheet was examined to better this understanding. This script was to be used in creating a network of significant genes, as well as a network using the transcription factors. The first step in this process, however, is troubleshooting the code. Suffice to say, this took a majority of the time due to MATLABs "user-friendly" coding and layout. At the moment a matrix is trying to be created using several lines of code:

for i = 1:size(ind2) outarray{i,1}=b{1+ind2(i),7}; outarray{i,2}=IDX(i); outarray{i,3}=A{ind2(i),1}; outarray{i,4}=B{ind2(i),2}; outarray{i,5}=A{ind2(i),3}; outarray{i,6}=A{ind2(i),4}; outarray{i,7}=A{ind2(i),5}; outarray{i,8:12}=C{IDX(i),1:5}; end

This compilation of code is currently in the process of being tweaked. The first difference from the original section is that outarray{i,4} should read:

outarray{i,4}=A{ind2(i),2);

The second is that the whole section of outarray{i,#}=A can be condensded into a smaller line:

outarray{i,3:7}=A{ind2(i),1:5};

The third piece of code that I am currently tweaking with is that originally the A was a numerical matrix that would not group together with the rest of the columns. This was a problem so A is a cell array that is made from a:

A = cellstr(num2str(q(:,1:5)));

While this did create a cell array in matlab, this did not fully separate all of the columns so several iterations of this code are being used. At the moment this seems like the correct piece of code to use, however when trying to split the columns up and place them individually, an error stating incorrect dimensions is produced and no further progress can be made.

Nicholas A. Rohacz 20:34, 6 December 2011 (EST)

# Summer 2012

## Week 1

### May 14, 2012

Today was spent going through the new ANOVA code that Dr. Fitzpatrick sent to us. This code will be posted later on as revisions are made and a finalized version is reached. For now we are attempting to understand all of the lines of code. Some problems that we have run into are what is happening in the two strain comparison code that causes 15 columns of timepoints to be written in the output file. While the output file could be reproduced using the exact files Dr. Fitzpatrick used, the WildType, dGLN3 and his code, we began running into problems once the strains were changed, or some of the timepoints. Another problem that we ran into was when beta = X\Y <-note this is not division, this is a backslask matrix operator in matlab. The Y array ends up as one column, while the for loop it is located in calls for it to be computed for every gene. We are checking with Dr. Fitzpatrick for exactly what this is. The final problem we had with the code was that the first column of many of the simple ANOVA outputs for different strains stayed the same. That is the values for the whole column did not change once the strains changed, but the other columns in the output did. This also showed us that we were running into a problem, because none of the columns from a simple dGLN3 ANOVA output matched the two strain comparison output. Thus we were unable to identify the unknown columns mentioned above.

Nicholas A. Rohacz 20:36, 14 May 2012 (EDT)

### May 15, 2012

Katrina and I spent the day covering two tasks, the first was to combine all of the deletion strains into the two-strain comparison code so that we could analyze all of the strains at once, a side point to fix was to take into account any deleted values, arrising from the difference in genes between the GCAT and Ontario microarray chips being processed. The second task was to begin coding the stochastic model sheets, the first step being figuring out how to use the code {fminsearch} in matlab. While Katrina focused on the first task I began figuring out the coding process for optimizing some test code created a few weeks back labeled my-F and my-F-pt1. These were created to visualize the minimization process using some sample equations. Most of the day was spent dechipering the instructions left by Dr. Fitzpatrick. After I had to ask him about some of the inputs he wanted in the code, he gave me a better heading and I was able to make a little way, but the biggest problem I have run into is putting a function handle on the my-F function such that I can call it from the newly created sheet. While trying to balance this problem with my code, another problem began to arise, the coding began to stop recognizing some of my parameters, or began altering the outcome of my-F-pt1, which was working fine before this started. I am working on putting a handle on the function from my-F, or finding away to code around it.

Nicholas A. Rohacz 20:08, 15 May 2012 (EDT)

### May 16, 2012

Spent the day working out the fstochmin.m code. After attempting different ways for hours to get the equation connected to a function handle(@), I managed to work out the code. The fstochmin.m function is now being called by a separate script, my-F-pt1.m, which also graphs the minimization of the equation in my-F.m. The next step to focus on removing or ignoring the NaNs from the input file and indices such that the output will have all of the correct data points and none of the blank rows that have been resulting from the NaNs. Some ways around this are possibly using accumarray() or something similar, or to use the ignoreNaN.m function I found online. The latter will take some time to figure out how to tailor the function specifically for our needs. Other options are also being researched at the moment by Katrina. I have been leaving early for the past few days due to MCAT studying. The test occurs this coming thursday, the 24th, so after that these entries will be more complete.

Nicholas A. Rohacz 20:26, 16 May 2012 (EDT)

### May 17, 2012

Dr Fitzpatrick emailed me corrected versions of my code last night. No major changes, mostly just rearranging the code around. my-F-pt1.m is now the fstochmin.m function, this plots the minimization as it happens. This was the only sheet with a change to specify the theta values in their new role. The old fstochmin.m was renamed to my-f-driver.m, which just called on the functions and outputted the final values from fminsearch() in the fstochmin.m function. This script held the parameters to be called on in the function. The last function, new-F.m, simply held the function to be minimized, as well as any necessary parameters. The day was spent trying to remove the NaN values from the input data sheets when calcuting the F distribution and Pvals. This became increasingly more challenging as we discovered that matlab, in fact, sees NaN as a real value. While not a integer, it is still classes as being numeric, and thus is throwing off our code. While Katrina is trying to build new regression matrices for the two strain comparison code, I spent my time focusing on the ignoreNaN.m function I discovered. However, this function uses accumarray to piece together the indices, and applies a sum or prod function to the values. After being unable to find a replacement function for accumarray, I decided to help Katrina with debugging her current code and attempting to create separate regression matrices. We have had no luck yet and my MCAT books are calling my name. Some ideas for tomorrows work are to somehow interpolate the missing NaNs back into the Y matrix to somehow account for the empty spaces. Another is to specify exactly which columns are supposed to be used in the calculations for the beta values.

Nicholas A. Rohacz 19:57, 17 May 2012 (EDT)

## Week 2

### May 21, 2012

Today Katrina managed to get past our problems of having the NaNs from the GCAT chips delete the entire row of wildtype data for those specific genes. An updated version of this code will be posted once the two piece comparisons have been compiled for the all strain comparisons. Dr. Fitzpatrick gave me some more equations to work on in the stochastic model, on being a sum of squares equation and the other being a line fit of a least squares error (LSE) equation. The latter requires much more extra coding than the first equation, which involved solely changing the equation function in matlab. I am working on getting through this coding but my MCAT is this Thursday and I am freaking out mildly whenever I look at a practice test. Thus my studying will take priority so that I do not sit down for my test on Thursday and not have a single idea for any question because I am too nervous to think.

Nicholas A. Rohacz 18:04, 21 May 2012 (EDT)

### May 22, 2012

The day was spent looking over the estimation code that was used in last weeks of the class journals and assignments posted in my openwetware. Albeit, this version of the code has been revised several times, the outcome is still the same in that it maps out the gene network using a differential equation and the ODE45 solver in matlab. The first goal in this code is to partition out the general-parameters-estimation-function.m that was received into separate parts. These being part 1: the input parameters and xls spreadsheets, part 2: the least squares error (lse) function, part 3: generating the graphs, and part 4: generating the output xls files. So far only the first partition has been started, this is due to confusion that arose when going over the code. Much of this confusion was solved by Dr. Fitzpatrick telling us that we no longer use the threshold(tau) pieces of the function and instead are using the threshold(b) options throughout the function. The other part that was not removed by Dr. Fitzpatrick that was the process of "rolling" up the network matrix into a vector that, combined with a new thresholds(b) vector, could be used by the lse and other functions. Much of this code will be continued tomorrow and hopefully a few working partitions can be separated.

Nicholas A. Rohacz 20:37, 22 May 2012 (EDT)

### May 23, 2012 (MCAT DAY TOMORROW!!!!)

Today was spent revising the estimation code that we've been working on this week so far. Some of the major changes today consist of changing the input from the average Log2FC to the originals and averaging in matlab. This is important for later on in the LSE function. There are more changes to be done to this function, such as some changes to the model (possibly) and the averaged log2FC may be incomplete as is. The main problem with today is that I have my MCAT tomorrow and I have had trouble focusing all day. My plan is to come in this memorial day weekend and get some more work done. Hopefully I do well on my MCAT.

## Week 3

### May 28, 2012

Memorial Day, No Work.

### May 29, 2012

The day was spent catching up on a the changes that Katrina made to the code during the days I missed last week. Some of the changes were similar to mine but featured better ways to call up necessary inputs so I replaced much of my code with hers. The second part of the day before journal club was spent making minor tweaks to the code so that I could more easily understand it when I needed to find out certain parts or try and debug a problem. The final half of the day after our Vu and Voradhsky journal club was to continue to alter the code such that we could use it on multiple strains. Some of the major changes today include telling matlab when the deletions happen and altering a good deal of the code to handle all the timepoints and strains instead of just one strain at a time using the average Log2FC. As of now the plans for tomorrow are to debug Katrinas code, which is almost identical in function to mine with some minor difference but will not run all the way through like mine will, and to work on correcting the output because so far the optimized parameters are coming out as the same number continuously or as all 0's. Work on altering the code for multiple strains will also continue tomorrow.

Nicholas A. Rohacz 20:39, 29 May 2012 (EDT)

### May 30, 2012

The entire day was spent trying to debug a certain problem in the code. When we tried to optimize our parameters using the fmincon function, the input all checked out ok, along with everything else that had been called. But after the minimization, the optimized weights only contained seven different values, the rest of the values were same, about 0.00653. This is a problem because it means that our code is not using our entire dataset. Katrina and I spent the morning and afternoon trying to solve this problem but to no avail. Tomorrow I have my last day of traveling this summer during work, after that I will be here the entire time. I will be looking over this code on my computer because I have a version of MATLAB available on there.

Nicholas A. Rohacz 19:40, 30 May 2012 (EDT)

### June 1, 2012

Turns out that we were missing a line of code from our sheets. Even more disheartening, the line of code we were missing was present in Dr. Fitzpatricks original code he sent us. After that was all worked out in my absence, Katrina informed me of a problem with the dz output, which is the production part of the differential equation minus the degradation part of the equation. It turned out though that running the code in pieces gave different outputs while running it in sequence gave the same, even with the old code that Dr. Dahlquist and Dr. Fitzpatrick used originally. The remainder of the day was spent working in the multiple strain deletions for the code. The line of code that ended up working for us was known as strcmp(...) which allowed us to compare individual parts of an array with what we wanted. Katrina is working on getting MATLAB to identify the necessary sheets it wants. I went with the hard code route and just told MATLAB what sheets it was looking for. The hard coding worked for multiple deletions, however I can't get the weights to delete themselves with each deletion strain, but I can get the b values to delete themselves for each deletion strain. The output files also save for each individual deletion strain, allowing for comparisons. The next step is to get the weights to delete themselves with each deletion. I did discover that the way I have the code written, that I recall all the parameters, arrays and matrices each time I go through a successive loop, MATLAB doesn't stack the deletions on top of each other, such that the dCIN5 deletion is not present when it is going through any of the successive loops. While this may slow down the processing a little bit it helps with the problem previously stated. A good idea to look into would be to separate the calls for the parameters, matrices and arrays that need to stay the same with each deletion, and separate any that are changing. This will be looked into Monday.

Nicholas A. Rohacz 20:17, 1 June 2012 (EDT)

## Week 4

### June 4, 2012

The morning was spent working out how to get the deletions in the network of edges to occur step wise and get all the necessary outputs. It wasn't until lunch that we learned we might have being doing something entirely different. In all honesty I don't necessarily understand where we are going at this point and I will raise my questions to Dr. Fitzpatrick when I'm not dying from high blood sugar.

Nicholas A. Rohacz 19:50, 4 June 2012 (EDT)

### June 5, 2012

The morning was spent helping Dr. Dahlquist run the first microarray slides of the summer. The strain was BY4741 of S. cerevisiae, and in this case two extra timepoints were recorded, t5 and t240. After the microarray data was final outputted, we made our way back to UHALL only to realize that we had forgotten which slides matched with timepoints. On the way back up to Seaver again we made a wonderful discovery of a gathering of people trying to view the transit of Venus across the Sun in front of the building. After partaking in this experience I've been waiting for for the past week. Once notifying Dr. Dahlquist and the rest of our lab mates, we gathered the necessary information and made our way back to UHALL to normalize the data. Upon comparison between our two separate trials we found that our outputs matched and our processes were still ok, after a few minor tweaks to deal with the lack of GCAT chips and the different number of microarray slides we processed. Tomorrow we will have an impromptu meeting with Dr. Fitzpatrick about our degradation rate paper because we are actually going to calculate them and then back to the deletions and making the stochastic model.

Nicholas A. Rohacz 20:13, 5 June 2012 (EDT)

### June 6, 2012

This morning was spent going over the Journal Club paper on decay rates. Towards the end of our meeting with Dr. Fitzpatrick, he decided to take a look at another paper that was closer to what we were doing in lab, using DNA levels to measure gene expression, whereas the Grigull, old journal club, paper focused on mRNA levels to determine decay rates. It was then decided that we would switch our journal club paper and go over the new one tomorrow morning. The rest of my afternoon was spent fixng the code to output the multiple deletion strains. The biggest problem we are running into is that the output graphs are graphing the model estimated by ODE45 and the general network dynamics function. The plot line of code calls for the model outputted by the ODE45 function, however, we have not worked a switch in anywhere into the code, thus everything outputs correctly on the graph except for the blue model curve fitted to the data points. The next problem is that nowhere in this code does the general network dynamics function call for a switchable input, thus I am not sure how to get the deletion strains to run through that code. Dr. Fitzpatrick told me that if I put a loop after the general network dynamics function estimation and put the ODE45 line of code inside that loop is should work, but this ran into a problem that it was only looking at the wildtype data still. Much of the work this afternoon was debugging the function as new lines of code were added and new data structures were created to house all of the output data and input data in MATLAB. I am working on getting this to work, at the same time we now have a new journal club to read for tomorrow.

Nicholas A. Rohacz 20:09, 6 June 2012 (EDT)

### June 7, 2012

This morning I finished debugging the multiple strains code, with the help of Dr. Fitzpatrick, and managed to start getting outputs. My problem originally had been I was unable to identify the exact input for the ode45 function in MATLAB. It turns out though that the ode45 function picked its own data according to separate conditions, the one I was failing to specify yesterday was the deletions. After some quick debugging there everything worked and we managed to output graphs. This afternoon was spent assembling the graphs into a powerpoint. The format of the powerpoint was that each deletion strain was separated for the old and new code and both were compared to the multiple strain code. This means that the powerpoint had an astonishing 106 slide, 21 for each gene, x5 for each strain, plus the title slide. Tomorrow will likely be spent working on the stochastic model again to try and build the stochastic model for our experiment.

Nicholas A. Rohacz 21:26, 7 June 2012 (EDT)

### June 8, 2012

Today was spent debugging the code and working on our journal club while matlab ran simulations. Hopefully, the last problem we are having is with the outputs. When the model.mod is being written to the outputs, it is log2 fold changed to standardize the values. However, this is changing the values in the first row, for gene ACE2, and is not writing out the last row, for ZAP1. After numerous attempts at writing the output a separate way to try and maintain the values once they've been log2FC, I've narrowed down the problem to likely resulting from one of the loops being too nested in the graphs sheets. The likely places for this are in the LSE sheet. Another likely error to look at would be the loops in the Graphs sheet and see if they are rewriting the values or performing multiple log2 fold changes on the model.mod optimized values. This problem will be looked at over the weekend and on Monday, before and after journal club if I can't figure out the problem before that.

Nicholas A. Rohacz 20:10, 8 June 2012 (EDT)

## Week 5

### June 11, 2012

This morning I managed to fix the problem we had last week where the outputted log2 optimized fold changes were incorrect for some of the strains. The problem was that the dZAP1 strain was being saved as all of the outputs. Once this was specified to change, all of the optimized log2FC's were being outputted correctly. This afternoon consisted of journal club and a continuation of debugging the matlab code, along with starting the abstracts for the symposium this summer in Tennessee for the Society of Mathematical Biology. We are doing two posters, one on the deterministic model, and one on the stochastic model. There is a third poster, of all the statistical testing, but Nicki and Chiddy are going to handle that poster along with the new S. paradoxis information. The problem we are working on now is to get the code to output a single set of weights and bs from our multiple inputs, that works for all the inputs. Right now it seems as thought all the other optimized outputs match up with the wildtype optimized outputs, when running each strain individually using the new code that takes into account all the time points and replicates. It took a while to really understand that we are looking for this output, which seemed to be happening, but it probably won't match up with any of the individual outputs that we already are outputting. That is the main focus of tomorrow, to output a single optimized weights and bs that fit for every model. And by fit, we are looking for the differences between strains to not flip signs and have drastic changes in magnitude.

Nicholas A. Rohacz 20:42, 11 June 2012 (EDT)

### June 12, 2012

This morning Dr. Fitzpatrick emailed us a cleaned up version of the code I had sent him the previous evening. With the tune up, the outputted Wts and Bs looked better and didn't have as many drastic sign flips and magnitude changes. The All strain Wts also differed slightly from the Wildtype Wts which is better than when we were getting the same line. Whether this is what we are looking for or not is up to Dr. Dahlquist and Dr. Fitzpatrick but as of now it looks good. The rest of the morning was spent fixing the output sheet and renaming some of the variables that were deleated in the clean up. This afternoon was spent putting together some comparison graphs, and even developing a piece of code for the process. This came in handy because now I am performing some alterations to the base parameters, such as the alpha value, and some of the scaling taking place in the code. Work also began on the Stochastic Model abstract for the SMB symposium in Tenn. this summer. I have a general understanding of the stochastic model and what it is trying to represent, I just have a few questions for Dr. Fitzpatrick the rest of the experiement, i.e. whether or not its the same in most aspects, what parts are different, and some questions on the stochastic process.

Nicholas A. Rohacz 20:11, 12 June 2012 (EDT)

### June 13, 2012

Today was spent running alpha value comparisons. The first part of the comparisons was to get some scaling factors in the code for the penalty and the lse out values. The lse out values were scaled by 1/# of data points, that is every gene for every time point for every flask in every strain. This was under the variable name errormat, or the error function, in the general least squares error script and was scaled by the length of this array and not the actual values. The penalty was scaled by 1/# of free parameters, that is every parameter that was not fixed and therefore was changing throughout the estimation code. This was given by the variable named theta, and the penalty was scaled by the length of this theta array. Once these scaling factors were successfully incorporated into the code, different alpha values were tested to determine a "best fit". This "best fit" was determined by graphing the lse out vs penalty out to make an L curve. Where this L curve goes from a mostyl vertical slope to a mostly horizontal slope is where the optimal alpha value is. While the L curve for all strains is complete, the L curve for each individual strain is almost done. A problem arose at the last second where some of the values cause the curve to double back on itself, making the graph look like a hairpin. This is likely do to one of the original scaling factors being present in the code still for some, but not all of the output mat files. This just requires me to re run the code and obtain the correct outputs again, once this is done the graphs shouldn't take more than a few seconds.

Nicholas A. Rohacz 20:47, 13 June 2012 (EDT)

### June 14, 2012

This morning was short because I accidently slept through my alarms after the 3am growth curve shift. The rest of the day was spent looking at the abstract Dr. Fitzpatrick had drawn up, a majority of it was similar to the deterministic model, the rest of it was mainly what I had questions on before I began writing it. The other project for today was to test some extra alpha values, but closer to the 0.01 value we picked using the L curves, instead of the wide range, 1-0.0001, used in those curves. This took a while mostly because the individual strain code that Katrina and I had was actually a jumbled compilation of our work before we got the all strain estimation code running. This created a lot of cleaning up that needed to be done before the alpha comparisons could be made and sent to Dr. Dahlquist and Dr. Fitzpatrick.

Nicholas A. Rohacz 21:28, 14 June 2012 (EDT)

## Week 6

### June 18, 2012

Today was spent reading some papers that Dr. Fitzpatrick emailed us last Friday, which was spent putting together a ppt on all the work we had done up until now for journal club. Journal club took a good 3 hours today because our presentation was on a years worth of work. Once we got back, we decided to run some confidence tests to make sure the code is really doing what we want it to do. The first test was to run the new code for all the strains simultaneously, but instead using one individual strain at a time, along with using all of the means instead of the flask data. This was pretty quick, especially computationally, the only problem was that our newest version of the code did not have all the code it needed. And since we are testing the code and aren't allowed to change it, we can't be sure if this causes problems or not, for example, the standard deviations are calculated in MATLAB in the new code, but the old input format called for having a sheet with the sigmas present. Since this was so, the graphical output did not have the standard deviations present, and some other place in the code may have been affected. The next test was to make sure the the optimized model would output the same data if the optimized parameter values were put back into MATLAB. This was difficult though because we need to make a new model of 3 gene with no connections and rewrite all of the input sheets. This has not been run yet, some problems that may occur are size problems, if MATLAB is hardcoded at any point for # of data points or time or flasks, the other problem is that we aren't sure if we need to run an initial optimization with our new network, or if we can use the optimized values from our 21 gene network run with this new network. The problem with the latter idea is that this may change the optimization of the network if it is specific for the 21 gene network and the 50 connections it has already. The final goal for today was to figure out if our code was running in an AND format, where the exact number of TFs that are required to transcribe the mRNA are required to make the model, or if it is in an OR format, where at least one TF has to bind. Biologically, the latter idea would make more sense, even though some genes do require more than one TF, those would likely be case specific. This needs to be checked for both activators or repressors, and for a combination of the two to give us a good idea of what we are running. At this point Dr. Fitzpatrick assumes that it is somewhere in between the two formats. The last two confidence tests will be run tomorrow, hopefully we can get this minor problems worked out too.

Nicholas A. Rohacz 20:06, 18 June 2012 (EDT)