Conor Keith Week 11

From OpenWetWare
Jump to navigationJump to search

Purpose

  • The purpose of this assignment is to perform statistical analysis on microarray dated obtained from the gene GLN3 to determine the genes response to cold-shock at any timepoint.

Workflow

  • The dataset I used for this assignment analyzed microarray data dGLN3, the change in expression of the gene GLN3
  • The spreadsheet used was named "Master_Sheet_dGLN3", I saved the file as "dGLN3spreadsheet.xlsx"
  • There were 6188 replicates individual strain ID's and there were three trials for each timepoint 15, 30, 60, 90, and 120 minutes.

Statistical Analysis Part 1

  1. I Create a new worksheet, and named it "dGLN3_ANOVA"
  2. I then copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for dGLN3 and pasted it into my new worksheet. I then copied the columns containing the data for my strain and pasted it into my new worksheet.
  3. At the top of the first column to the right of my data, I created five column headers of the form dGLN3_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
  4. In the cell below the dGLN3_AvgLogFC_t15 header, I typed =AVERAGE( to obtain the mean of the data at time point 15.
  5. I then highlighted all the data in row 2 associated with dGLN3 and t15, and pressed enter.
  6. This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
  7. I then copied the command to obtain the means for the rest of the gene Id's at t=15 min.
  8. I then repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
  9. In the first empty column to the right of the dGLN3_AvgLogFC_t120 calculation, I created the column header dGLN3_ss_HO.
  10. In the first cell below this header, I typed =SUMSQ(
  11. I then highlighted all the LogFC data in row 2 for dGLN3 (but not the AvgLogFC), pressing the closing paren key (shift 0),and pressing the "enter" key.
  12. In the next empty column to the right of dGLN3_ss_HO, I created the column headers dGLN3_ss_(TIME) as in (3).
  13. I made a note of how many data points you have at each time point for your strain.
  14. In the first cell below the header dGLN3_ss_t15, I typed =SUMSQ(<range of cells for logFC_t15>)-<number of data points>*<AvgLogFC_t15>^2 and hit enter.
  15. I Repeated this computation for the t30 through t120 data points.
  16. In the first column to the right of dGLN3_ss_t120, I created the column header dGLN3_SS_full.
  17. In the first row below this header, I typed =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
  18. In the next two columns to the right, I created the headers dGLN3_Fstat and dGLN3_p-value.
  19. The total number of data points is denoted 'n'
  20. In the first cell of the dGLN3_Fstat column, I typed =((n-5)/5)*(dGLN3_ss_HO>-<dGLN3_SS_full>)/dGLN3_SS_full> and hit enter.
  21. In the first cell below the dGLN3_p-value header, I typed =FDIST(dGLN3_Fstat>,5,n-5) replacing the phrase dGLN3_Fstat with the cell designation and the "n" as in (13) with the number of data points total.

Results

  • I ended up with negative sum of squares results and negative values for F-stats, as well as errors for my p-value results.
  • After talking to my lab partner and Dr. Fitzpatrick, we deduced that this was due to the large number of missing values from the dataset, which caused us to have to divide by zero.
  • With Dr. Fitzpatrick's permission, I loaded the data into Stata to attempt to replace missing values with the mean of each ID at each individual time point.
  • After many failed attempts, I was unable to create a loop which would allowed me to perform this operation, as stata does not treat its data as tables, rather as multiple variables.
  • I then transformed the data into a matrix and tried to create a loop which would allow me to replace the missing values with row means at each time point. This attempt failed as well.
  • I then uploaded the data into matlab and attempted to do the same, this attempt failed as well.
  • I then tried to replace the missing values in my excel spreadsheet with "null" and have excel ignore the missing string values in the analysis.
  • I was unable to find a macro which would tell excel to ignore missing values in calculations.

Conclusion

  • I was unable to obtain the correct values for my statistical analysis due to a missing values problem which resulted in numerous errors. In order to get the proper results I need to solve the missing values problem.

Acknowledgements

  • For this assignment I worked with Nika and we communicated in class and over text. Other than what was previously noted this assignment was completed by me and me alone without help from an outside source.

Conor Keith 01:06, 6 April 2017 (EDT)

References

Assignment Pages:

Individual Journal Entries :

Shared Journal Pages: