Leanne Kuwahara-Week 4/5

From OpenWetWare
Jump to navigationJump to search

Purpose

  • To analyze the microarray data of the yeast mutant dGLN3.

Materials

  • Microarray data Filename: BIOL388_S19_microarray-data_dGLN3_LK (URDATED)
  • STEM GO Tables (Excel) Filename: STEM_dGLN3_Tables_LK
    • Strain: Δgln3
    • Time points and Replicates:
      • t=15min: 4 replicates (data not provided for replicates 0 and 1)
      • t=30min: 4 replicates (data not provided for replicates 0 and 3)
      • t=60min: 4 replicates (data not provided for replicate 0)
      • t=90min: 4 replicates (data not provided for replicates 0 and 4)
      • t=120min: 4 replicates (data not provided for replicates 0 and 3)
        • t15, t30, & t60 (CS @ 13C)
        • t90 & t120 (Recovery @30C)

Overview of DNA Microarray Analysis

  • Steps 1-3 have been performed using GenePix Pro software (Runs the microarray scanner).
  • Steps 4-5 were performed for using R (see: Microarray Data Analysis Workflow)
  • Statistical analysis will be performed in Microsoft Excel (steps 6 & 7)
  • STEM software used for clustering and mapping (steps 8 & 9)
  • Modeling will be performed in MATLAB (step 10 & 11)
  1. Quantitate the fluorescence signal in each spot
  2. Calculate the ratio of red/green fluorescence
  3. Log2 transform the ratios
  4. Normalize the ratios on each microarray slide
  5. Normalize the ratios for a set of slides in an experiment
  6. Perform statistical analysis on the ratios
  7. Compare individual genes with known data
  8. Pattern finding algorithms (clustering)
  9. Map onto biological pathways
  10. Identifying regulatory transcription factors responsible for observed changes in gene expression
  11. Dynamical systems modeling of the gene regulatory network

Protocol

  • Data was provided in the form of an Excel spreadsheet by Dr. Dahlquist and Dr. Fitzpatrick. Each row on the worksheet was assigned a gene, and the "Mast Index" (column A) numbers the rows sequentially, so that they may be used to sort the genes into their initial order. the "ID" (column B) contains the gene identifier from Saccharomyces Genome Database. Column C contains the standard name for each corresponding gene. Subsequent columns contain the log(2) ratio of Red:Green fluorescence from each microarray experiment. Column headings strain_log(2) fold change_time(min)-replicate.

Statistical Analysis Part 1: Witin-stain ANOVA

  • Purpose: To determine if any genes had a gene expression change that was significantly different than zero at any timepoint.
  1. New "dGLN3_ANOVA" Excel worksheet was created.
  2. All data from the "Master_Sheet" worksheet for dGLN3 was copied and pasted into dGLN3_ANOVA.
  3. Five columns were added labeled dGLN3_AvgLogFC_(TIME)
    • (TIME) is t15, t30, t60, t90, or t120 minutes.
  4. The formula =AVERAGE(D2:G2) was used to calculate the average of the log fold change data for the TFC3 gene at t15 minutes. The remaining genes and timepoints were autofilled in Excel.
    • To Auto Fill Cells: Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
  5. The column dGLN3_ss_HO was created, and the formula =SUMSQ(D2:W2) was used to square the sum of the data points collected for TFC3. remaining genes were autofilled in Excel.
  6. The column dGLN3_ss_(TIME) was created for each time point. The formula =SUMSQ(D2:G2)-COUNTA(D2:G2)*X2^2 was used for TFC3 at t15 minutes. Remaining genes and time points were autofilled in Excel.
    • The COUNTA function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values).
  7. The column dGLN3_SS_full was created. The sum for the range of cells containing "ss" for TFC3 was calculated through Excel using the formula =SUM(AD2:AH2).
  8. The columns dGLN3_Fstat and dGLN3_p-value were created.
  9. Under the dGLN3_Fstat column, the formula =((20-5)/5)*(AC2-AI2)/AI2) was used for gene TFC3. Remaining genes were autofilled in Excel.
    • "20" represents the total number of replicates from all timepoints.
    • "5" is the number of timepoints.
  10. Below the dGLN3_p-value header, the formula =FDIST(AJ2,5,20-5) was used for gene TFC3. Remaining genes were autofilled in Excel.

Sanity Check: Filter p-values less then 0.05

  • Excel was used to filter the data according to the p-value (dGLN3_p-value). Criterion was set so that only data with a p value less than 0.05 was present.
    • A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion.

Calculate the Bonferroni and p-value Correction

  • Purpose: To make adjustments to the p-value to correct for the multiple testing problem. (multiplies p-value by number of hypotheses tests run)
  • The "less than 0.05" filter on the dGLN3_p-value was removed for the following steps.
  1. Added 2 columns labeled dGLN3_Bonferroni_p-value.
  2. The formula =AK2*6189 was used for TFC3 in the first column labeled dGLN3_Bonferroni_p-value. The remaining genes were autofilled in Excel.
  3. Corrected p-values greater than 1 were replaced with the number 1 using the formula =IF(AL2>1,1,AL2) for TFC3 in the second dGLN3_Bonferroni_p-value column. The remaining genes were autofilled in Excel.

Calculate the Benjamini & Hochberg p-value Correction

  1. Created a new worksheet titled "dGLN3_ANOVA_B-H".
  2. The "MasterIndex", "ID", and "Standard Name" columns were copied and pasted into the first columns (A, B, & C).
    • For the following, use Paste special > Paste values.
  3. The unadjusted p-values from the dGLN3_ANOVA worksheet was copied and pasted into column D.
  4. Columns A, B, C, and D were selected and sorted by ascending values on Column D.
    • To Sort Data: Click the "sort" button from A to Z on the toolbar. Select sort by column D, smallest to largest.
  5. The Column labeled "Rank" was created in cell E1. This will be used to create a series of numbers in ascending order from 1 to 6189 in this column.
    • This is the p-value rank, smallest to largest. Type "1" into cell E2 and "2" into cell E3.
    • To autofill: Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
  6. To calculate the Benjamini and Hochberg p-value correction, a column labeled dGLN3_B-H_p-value was created in cell F1. The formula =(D2*6189)/E2 was used for TFC. The remaining genes were autofilled in Excel.
  7. A new column labeled dGLN3_B-H_p-value was entered into cell G1. The formula =IF(F2>1,1,F2) was used for TFC3. The remaining genes were autofilled in Excel.
  8. Columns A through G were selected and sorted them by the MasterIndex in Column A in ascending order.
  9. Column G values were copied and pasted into the dGLN3_ANOVA sheet.

Clustering and GO Term Enrichment with stem

  • Purpose: To cluster gene expression profiles based on the similarity in the change of gene expression over time. This suggests that these clusters are by the same set of transcription factors.
  1. Preparing the microarray data file for loading into STEM
    • A new worksheet was created in Excel named "dGLN3_stem"
    • All of the data was selected from the "dGLN3_ANOVA" worksheet and the values were pasted into the "dGLN3_stem" worksheet
      • The column labeled "Master_Index" was renamed to "SPOT". The column named "ID" was renamed to "Gene Symbol". The column named "Standard_Name" was deleted
      • The B-H corrected p-value was filtered to be > 0.05
        • Once the data was filtered, all data was deleted and the filter was removed. This ensures that we will cluster only the genes with a "significant" change in expression and not the noise
      • All data was deleted EXCEPT for the Average Log Fold change columns for each timepoint (i.e. dGLN3_AvgLogFC_t15, etc.)
      • These columns were renamed using the time and units (i.e. 15m, 30m, 60m, 90m, & 120m)
      • This spreadsheet was saved as a Text (Tab-delimited) (*.txt) file for STEM analysis
  2. Download and extract the STEM software. Click here to go to the STEM web site
    • Click on the download link and download the stem.zip file
    • Unzip the file
      • This will create a folder called stem
    • Select stem.jar in the stem folder to launch the STEM program
  3. Running STEM
    1. In section 1 (Expression Data Info) of the the main STEM interface window, click on the Browse... button to navigate to and select the stem.txt file
      • SelectNo normalization/add 0
      • Check the box next to Spot IDs included in the data file
    2. In section 2 (Gene Info) of the main STEM interface window, leave the default selection for the three drop-down menu selections for Gene Annotation Source, Cross Reference Source, and Gene Location Source as "User provided"
    3. Click the "Browse..." button to the right of the "Gene Annotation File" item. Browse to your "stem" folder and select the file "gene_association.sgd.gz"
    4. In section 3 (Options) of the main STEM interface window, make sure that the Clustering Method says "STEM Clustering Method". Do NOT change the defaults for "Maximum Number of Model Profiles" or "Maximum Unit Change in Model Profiles between Time Points"
    5. In section 4 (Execute) click on the yellow Execute button to run STEM
  4. Viewing and Saving STEM Results
    • A new window will open called "All STEM Profiles (1)"
    • Each box corresponds to a model expression profile
      • Colored profiles have a statistically significant number of genes assigned (they are arranged in order from most to least significant p value)
      • Profiles with the same color belong to the same cluster of profiles
    • The number in each box is simply an ID number for the profile
    1. Click on the button that says "Interface Options..."
      • At the bottom of the Interface Options window that appears below where it says "X-axis scale should be:", select "Based on real time"
    2. Take a screenshot of this window and paste it into a PowerPoint presentation
    3. Click on each of the SIGNIFICANT profiles (the colored ones)
      • This opens a window showing a more detailed plot containing all of the genes in that profile
    4. Take a screenshot of each of the individual profile windows and save the images in your PowerPoint presentation.
    5. At the bottom of each profile window, there are two yellow buttons "Profile Gene Table" and "Profile GO Table". For each of the significant profiles, click on the "Profile Gene Table" button to see the list of genes belonging to the profile. In the window that appears, click on the "Save Table" button and save the file as "dGLN3_profile#_genelist.txt"
      • Replace the number symbol with the actual profile number
    6. For each of the significant profiles, click on the "Profile GO Table" to see the list of Gene Ontology terms belonging to the profile. In the window that appears, click on the "Save Table" button and save the file as "dGLN3_profile#_GOlist.txt"
      • Replace the number symbol with the actual profile number.
        • The file has been uploaded to OWW and linked in the Results section as "dGLN3 STEM Figures and Tables"
    • At this point, all of the primary data from the STEM software has been saved and it's time to interpret the results!!!
  5. Analyzing and Interpreting STEM Results
    1. Profile 45 was chosen for further analysis, because this cluster of genes was up-regulated during the early timepoints of cold-chock (15min and 30min), and down-regulated during recovery. Additionally, this profile contained clusters of genes encoding for ribonucleoproteins (RNPs), which I have studied previously.
    2. Gene profile 45 was used to determine the number of genes assigned, the number of genes expected, and the p-value for the enrichment of genes
      • This p-value determines whether the number of genes that show this particular expression profile across the time points is significantly more than expected.
    3. The GO list file for profile 45 was copied and pasted into Excel
      • This list shows all of the Gene Ontology terms that are associated with genes that fit this profile
    4. The data was filtered so that only genes with a corrected p < 0.05 was shown
    5. The six most significant (but not redundant) genes were defined at http://geneontology.org
      • The GO ID (e.g. GO:0044848) was entered into the search field on the left of the page.
      • In the results page, click on the button that says "Link to detailed information about Category Name"
      • The definition will be on the next results page, e.g. here
  • In your research presentation, you will discuss the biological interpretation of these GO terms. In other words, why does the cell react to cold shock by changing the expression of genes associated with these GO terms? Also, what does this have to do with the transcription factor being deleted?

Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes

  • Purpose: To explore the regulatory transcription factors responsible for the change in gene expression of profile 45 using YEASTRACT.
  1. The gene list for profile 45 was opened in Excel
    • Cluster 45 was chosen because it was induced during cold shock and repressed during recovery
  2. Copy the list of gene IDs onto your clipboard
  3. Launch a web browser and go to the YEASTRACT database
    • On the left panel of the window, click on the link to Rank by TF
    • Paste your list of genes from your cluster into the box labeled ORFs/Genes
    • Check the box for Check for all TFs
    • Accept the defaults for the Regulations Filter (Documented, DNA binding plus expression evidence)
    • Do not apply a filter for "Filter Documented Regulations by environmental condition"
    • Rank genes by TF using: The % of genes in the list and in YEASTRACT regulated by each TF
    • Click the Search button
    • In the results window that appears, the p values colored green are considered "significant", the ones colored yellow are considered "borderline significant" and the ones colored pink are considered "not significant"
    • The table of results from the web page was copied and pasted it into a new Excel workbook to preserve the results (YEASTRACT_dGLN3_LK)
      • The Excel file was uploaded to Box
  4. YEASTRACT was used to define a gene regulatory network of 15-20 transcription factors that regulate other transcription factors
    • The 18 most significant transcription factors were chosen. HAP4 was also included for analysis
    • The YEASTRACT database Generate Regulation Matrix was used for analysis
      • Copy and paste the list of transcription factors you identified into both the "Transcription factors" field and the "Target ORF/Genes" field
    • Use the "Regulations Filter" options of "Documented", "Only DNA binding evidence"
      • Click the "Generate" button
      • In the results window that appears, click on the link to the "Regulation matrix (Semicolon Separated Values (CSV) file)"

Visualizing Your Gene Regulatory Networks with GRNsight

  • Purpose: to analyze the regulatory matrix files you generated above in Microsoft Excel and visualize them using GRNsight to determine which one will be appropriate to pursue further in the modeling
  1. Properly format the output files from YEASTRACT
    • Open the CSV file in Excel
      • It will not open properly in Excel because a semicolon was used as the column delimiter instead of a comma
    • To fix this, Select the entire Column A. Then go to the "Data" tab and select "Text to columns"
    • In the Wizard that appears, select "Delimited" and click "Next"
    • In the next window, select "Semicolon", and click "Next"
    • In the next window, leave the data format at "General", and click "Finish"
      • This should now look like a table with the names of the transcription factors across the top and down the first column and all of the zeros and ones distributed throughout the rows and columns. This is called an "adjacency matrix"
        • A "1" means there is a connection between the transcription factor in that row and column
        • A "0" means there is NO connection between the two transcription factors
    • This file was saved as a Microsoft Excel workbook (.xlsx) (dGLN3_RegulationMatrix_LK_Excel) and uploaded to BOX
  2. Transpose (col<-->row) the adjacency matrix so that is is usable in GRNmap (the modeling software) and GRNsight (the visualization software)
    • A new worksheet was created in Excel named "network". Go back to the previous sheet and select the entire matrix and copy it
    • In "Network" worksheet, "Paste special" >> "Transpose" and paste data
      • This is necessary because we want the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side
    • The labels for the genes in the columns and rows need to match. Thus, delete the "p" from each of the gene names in the columns. Adjust the case of the labels to make them all upper case
    • In cell A1, enter the text "rows genes affected/cols genes controlling"
    • For ease of working with the adjacency matrix in Excel, we want to alphabetize the gene labels both across the top and side
      • Select the entire adjacency matrix
      • Custom sort column A alphabetically, being sure to exclude the header row
      • Sort row 1, from left to right, the same way
  3. Visualize what these gene regulatory networks look like with the GRNsight software
    • Go to the GRNsight home page
    • Select the menu item File > Open and select the regulation matrix .xlsx file that has the "network" worksheet in it that you formatted above
    • GRNsight will automatically create a graph of your network
    • Click the "Grid Layout" button to arrange the nodes in a grid
    • The network was pasted into PowerPoint presentation BIOL388_S19_pptSlide_LK
  4. Pruning Out Transcription Factors
    • Nodes (genes) with the least amount of connections (except GLN3, ZAP1, and HAP4) were deleted from the network to have a total of 15 TF in the regulatory model. the TF's TUP1, YHP1, STB5, and UME6 were excluded from the model. GCR2 only had one connection, however it was included in the model because it acts on GLN3.
    • Go back to the Excel workbook and network sheet and delete both the row and column with the gene's name. Then re-upload the edited file to GRNsight to visualize it. Use this final version in your PowerPoint and subsequent modeling


Results

Sanity Check: Number of genes significantly changed

  • Purpose: ensure that data analysis was performed correctly.
    • Determine the number of genes that are significantly changed at various p-value cut-offs.
  • In dGLN3_ANOVA worksheet the unadjusted p-value was filtered for the following criterion:
  1. Genes that have a p < 0.05:
    • 2135 genes (34.5%)
  2. Genes that have a p < 0.01:
    • 1204 genes (19.5%)
  3. Genes that have a p < 0.001:
    • 514 genes (8.31%)
  4. Genes that have a p < 0.0001:
    • 180 genes (2.91%)
  • p < 0.05, means that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
    • Another way to state what we are seeing with p < 0.05 is that we would expect to see a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p-values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p-values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, the data was filtered as follows:
  1. Genes that have a p < 0.05 for the Bonferroni-corrected p value:
    • 45 genes (0.727%)
  2. Genes that have a p < 0.05 for the Benjamini and Hochberg-corrected p value:
    • 1185 genes (19.1%)
  • The p-value cut-off is a movable confidence level.
Comparing results with known data:
  • The expression of the gene NSR1 (ID: YGR159C) is known to be induced by cold shock.
  • Unadjusted p-value of NSR1: 0.000506764
  • Bonferroni-corrected p-value of NSR1: 3.136364678
  • B-H-corrected p-value of NSR1: 0.008167616
  • The average Log fold change at each timepoint for NSR1 in the experiment:
    • t15 = 3.506225
    • t30 = 4.5319
    • t60 = 2.7592
    • t90 = -1.85025
    • t120 = -1.867425

STEM analysis

  • General function of genes within:
    • Profile 45: ribosome and RNA processing
    • Profile 9: amino acid metabolism
    • Profile 2: protein modification
    • Profile 22: protein transport
    • Profile 48: cell division
    • Profile 7: regulation of membranes
    • Profile 31: intracellular organization
      • Profile 45 was chosen for further analysis because the trend of up-regulation during cold shock and down-regulation during recovery was very clear
  • Profile 45 was assigned 406.0 genes
  • 29.9 genes were expected to be in the profile
  • The p-value for the enrichment of genes in this profile is 0.00
    • This p-value determines whether the number of genes that show this particular expression profile across the time points is significantly more than expected
      • There are 259 GO terms associated with this profile at p (uncorrected) < 0.05
  • The GO list also has a column called "Corrected p-value"
    • This correction is needed because the software has performed thousands of significance tests
    • There are 26 GO terms associated with this profile with a corrected p value < 0.05

GO Term Definitions:

  • Six Gene Ontology terms from the filtered corrected p < 0.05 list (column H) were defined
    • Selected terms reflected those that were the most significant, but not redundant
      • For example, "RNA metabolism" and "RNA biosynthesis" are redundant with each other because they mean almost the same thing
  1. Ribosome assembly: The aggregation, arrangement and bonding together of the mature ribosome and of its subunits
  2. Ribonucleoprotein complex export from nucleus: The directed movement of a ribonucleoprotein complex from the nucleus to the cytoplasm
  3. snRNA metabolic process: The chemical reactions and pathways involving snRNA, small nuclear RNA, any of various low-molecular-mass RNA molecules found in the eukaryotic nucleus as components of the small nuclear ribonucleoprotein
  4. Nucleolar metabolic process: Any constituent part of a nucleolus, a small, dense body one or more of which are present in the nucleus of eukaryotic cells. It is rich in RNA and protein, is not bounded by a limiting membrane, and is not seen during mitosis
  5. Nitrogen compound metabolic process: The chemical reactions and pathways involving organic or inorganic compounds that contain nitrogen
  6. snoRNA binding: Interacting selectively and non-covalently with small nucleolar RNA

YEASTRACT

  • 30 of the transcription factors are "significant"
  • The transcription factors GLN3 and ZAP1 were both significant.
    • GLN3
      • % in user set: 31.02%
      • % in YEASTRACT: 10.52%
      • p-value: 6.10E-14
    • ZAP1
      • % in user set: 30.52%
      • % in YEASTRACT: 7.99%
      • p-value: 5.78E-6
  • HAP4 was not classified as significant

Transcription Factors Used to Generate a Regulation Matrix on YEASTRACT

  1. Sfp1p
  2. Msn2p
  3. Yox1p
  4. Ace2p
  5. Gln3p
  6. Yap1p
  7. Pdr3p
  8. Pdr1p
  9. Swi5p
  10. Mig2p
  11. Asg1p
  12. Rim101p
  13. Gcr2p
  14. Zap1p
  15. Hap4p (not significant)

Conclusion

The purpose of this assignment was to analyze the microarray data of the yeast mutant dGLN3. An ANOVA test was run in Excel and the p-values were corrected for via the Bonferroni method and the Benjamini-Hochberg method. The Bonferroni-corrected p-value was the most stringent, and resulted in the least amount of genes remaining after filtering the data to include only those whose p<0.05 (0.727%). The data was also compared to with the NSR1 gene to check that the analysis was run correctly. The data was then visualized and analyzed using STEM software. The program produced seven significant clusters: profiles 45, 48, and 31 were induced during cold shock, then repressed during recovery, while profiles 9, 2, and 7 showed an opposite trend. Profile 22 appeared to have no change in gene expression during cold shock, but was up-regulated during recovery. Profile 45 was chosen for further analysis, because this cluster of genes was demonstrated a trend of up-regulation during the early timepoints of cold-chock (15min and 30min), and down-regulation during recovery. Profile 45 consisted of genes involved in ribosome and RNA regulation, and had 406.0 genes assigned (29.9 genes expected) and a p-value of 0.00. The up-regulation of this cluster is likely to inhibit translation and prevent protein synthesis during stress. Profile 45 was then analyzed through YEASTRACT to determine the transription factors responsible for regulation of these genes. 30 Transcription factors were classified as significant, and of these the 18 most significant were chosen for further analysis on GRNsight. HAP4 was also included in the analysis, although it was not a significant transcription factor. GRNsight was used to visualize which transcription factors were interacting and to form a regulatory matrix. TUP1, YHP1, STB5, and UME6 were excluded from the matrix because they had the least amount of connections in the model. GCR2, while having only one connection, was included in the model as the one connection was with GLN3.

Acknowledgements

-Texted a few times to compare numbers and clarify aspects of the assignment.
  • Spoke with Dr. Dahlquist during office hours to fix an issue with my Excel spreadsheet, resulting in incorrect or missing p-values.
Except for what is noted above, this individual journal entry was completed by me and not copied from another source.

References

  • Dahlquist, K. & Fitpatrick, B. (2019). "BIOL388/S19: Week 4" Biomathematical Modeling, Loyola Marymount University. Accessed from:Week 4 Assignment Page

Links