Kristen M. Horstmann Lab Notebook: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎September 20, 2016: updated info on the SBE toolbox)
Line 382: Line 382:
****Acyclic: Assumes G to be a directed acyclic graph and that weights of the edges are nonzero entries in sparse matrix G. Time complexity is O(N+E) where N and E are the number of nodes and edges respectively
****Acyclic: Assumes G to be a directed acyclic graph and that weights of the edges are nonzero entries in sparse matrix G. Time complexity is O(N+E) where N and E are the number of nodes and edges respectively
****Dijkstra: Default algorithm. Assumes weights of the edges to be positive values in sparse matrix G. Time complexity is O(log(N)*E), where N and E are the number of nodes and edges respectively.
****Dijkstra: Default algorithm. Assumes weights of the edges to be positive values in sparse matrix G. Time complexity is O(log(N)*E), where N and E are the number of nodes and edges respectively.
**graphconcomp and graphmaxflow may also be worth exploring in the bioinformatics toolbox
**GRNmap is already calculating shortest path so we may be able to use this as a control to check against any errors. Also may be able to calculate betweenness centrality after finding the shortest path. Look intoo the math on this


Still could not find betweenness centrality in the bioinformatics toolbox... Will begin searching through the article above.  
Searched through the article above to explore if Systems Biology and Evolution toolbox could be right for the GRNsight team to download and use for further data analysis.
May be able to calculate after finding shortest path. May be able to use as a Control against GRNsight. In bioinformatics, also search thorugh graphconncomp, graphmaxflow in the bioinformatics toolbox.
Main functions  listed in the article:
* Betweenness centrality, clustering coefficient, and closeness centrality, bridging centrality.
*Statistics include local average connectivity, core number, graph mean distance, graph efficiency, etc.
* Random networks can be generated using Erdos-Reyni, small world, and ring lattice algorithms
* Can also simulate evolution of a network via node duplication, node loss, and edge rewiring
The paper notes that it is similar to toolboxes such as "Functional Genomics Assistant" and "Mathworks Bioinformatics Toolbox." Notes that MBT has basic graph theory algorithms, but no functions but statistical analysis. Paper goes into detail about computer memory and graphs with nodes of 10,000-80,0000. They were able to create a random network with 10,000 nodes and 450,000 edges with using 2 GB of memory, but the all of the core functions finished within 10 minutes.
 
I would give a light recommendation to downloading this toolbox and testing it out. It seems like the people who were devlooping it were anticipating that systems biology needs both statistical and network analysis. Only issue foreseen is when exploring the GitHub page, there's a link to a manual as to how to use the program, and it seems like a Texas A&M ID is necessary for login. If we decide to proceed with using this program, I can try to contact the authors to get this users manual.

Revision as of 13:20, 21 September 2016

October 11, 2015

Finally made electronic notebook, will be posting work from previous weeks.
- Began work on updated Microarray data analysis workflow: Dahlquist:Microarray Data Analysis Workflow - Completed steps 4-5, began step 6, Statistical Analysis, by creating the Excel sheet. Workflow was followed up until heading "Within-Strain ANOVA." Excel sheet was saved on Google Drive There were 41,998 deletions of #VALUE! in the Excel sheet.
-Sanity check for GLN3
Sanity Check
p < 0.05......2255
p < 0.01......1325
p < 0.001......616
p < 0.0001......259
Bonferroni p < 0.05......106
B-H p < 0.05......1356


For NSR1
unadjusted p-value......0.00050676
Bonferroni p-value......1
B-H p-value......0.00660287
AvgLogFC_t15......3.506225
AvgLogFC_t30......4.5319
AvgLogFC_t60......2.7592
AvgLogFC_t90......-1.85025
AvgLogFC_t120......-1.867425

-Sanity check for ZAP1


p < 0.05......2559
p < 0.01......1683
p < 0.001......953
p < 0.0001......521
Bonferroni p < 0.05......251
B-H p < 0.05......1859

For NSR1
unadjusted p-value.....6.06E-08
Bonferroni p-value......0.000374851
B-H p-value......6.15E-06
AvgLogFC_t15......3.8996
AvgLogFC_t30......3.7238
AvgLogFC_t60......3.962775
AvgLogFC_t90......-2.156
AvgLogFC_t120......0.0542


-Sanity check for SWI4 -Encountered some issues with SWI4 as the data for it was slightly different than the others, so the equations typed in may have been slightly altered

October 12, 2015

dZAP1 Results

  • Number of significant transcription factors:99
  • List of significant transcription factors with % in user set, % in YEASTRACT, p-value:
  • Are CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 on the list?
    • CIN5p, SWI4p, and ZAP1p were on this list
  • Opted not to continue with next trancriptional map task as too many genes to create and not sure as to which genes to continue with.
  • Accessed evening of 10/12

dGLN3 Results

  • Number of significant transcription factors: none
  • List of significant transcription factors with % in user set, % in YEASTRACT, p-value:
  • accessed morning of 10/14

dSWI4 Results

Opted to note continue with SWI4 as numbers did not match with Tessa's check on my work. Waiting for confirmation from 3rd party on whose data to use.

October 14, 2015

Notes from team meeting:

  • Removing spar from further looking at, not fair to assume yeast applies to this
  • all-all green excel matrix but cut out the genes that are not connected.
  • Delete the least signifcant transcription factor
  • Then redo with deletion strain and ensure that wt, dCIN5, dGLN3, HMO1, dHAP4, dSWI4, and dZAP1 and do not delete TF if attached to these genes
  • Create "family" of about 35 transcription factors, no smaller than 15
  • Not making networks off of SPAR, HMO1, or SWI4.
  • Each person assigned to different gene
    • Kayla: dCIN5
    • Kristen: dZAP1
    • Natalie: wt
    • Tessa: dGLN3
    • Grace: dHAP4


  • Paring down networks:
    • First, eliminate unconnected genes
    • Then, systematically pare down by p-value, one by one. For each elimination, check and get rid of unconnected genes. We want 15 - 35 range for our family of networks.
    • For now, we will focus on wild type and deletion strains (not Spar)
  • Two types of pare downs:
    • First, just the genes from YEASTRACT
    • Later, adding in our deletion strain genes. Careful with elimination of TF's - deletion strain genes must stay in

October 19, 2015

  • For Binding AND Expression
    • Was able to delete 21 transcription factors, but still only paired them down to 83.
    • Paired down based off YEASTRACT p-values to a network of 31
  • For Binding PLUS Expression
    • Could not delete any transcription factors, all of them had at least one 1 in the row or column
    • Turned to pairing down based off least significant p-values according to YEASTRACT output of p-values
    • Pruned out 72 rows of the least significant p-values, down to 31 transcription factors
  • For ONLY Binding
    • Had a few rows without a connection but could not delete any transcription factors
    • Turned to pairing down based off least significant p-values, down to 31 transcription factors

October 21, 2015

  • Notes from Meeting
    • Told only need to be creating networks for ONLY Binding, and confirmed that p-values were used from YEASTRACT.
    • Will make a handful of networks with transcription factors ranging from 15-35 based only off of ONLY binding.
    • Need to redo as may have deleted too many lines of data before checking back on the 0s in the edges

October 27, 2015

  • Made 17 sheets ranging from 35 genes to 15 genes by deleting based off 0s in edges and off significance of p-values INCLUDING the six deletion strains (CIN5, HAP4, GLN3, ZAP1, SWI4, HMO1)
  • Made 16 by deleting based off of 0s in edges and off significance in p-values. Deletions were done regardless of if the deleted gene was the 6 or not
  • Ran some of them through GRNsight to check their networks and connections
  • First had to reformat Excel
    • For this adjacency matrix to be usable in GRNmap (the modeling software) and GRNsight (the visualization software), must transpose the matrix. Insert a new worksheet into your Excel file and name it "network". Select the entire matrix and copy it. Go to you new worksheet and click on the A1 cell in the upper left. Select "Paste special" and "Transpose". This will paste your data with the columns transposed to rows and vice versa. This is necessary because we want the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side.
    • Delete the "p" from each of the gene names in the columns. Adjust the case of the labels to make them all upper case.
    • In cell A1, copy and paste the text "rows genes affected/cols genes controlling"
    • On the GRNsight website, just clicked on "File" and "Open" to upload the file and create the network
    • This helped to see the floating transcription factors and if any genes were connected together and not to the main network
  • Number of networks in the subfamilies for all the strains
    • dHAP4 (Grace)
      • Regardless of deletion strains: 13
      • Preserving deletion strains:19
    • wt (Natalie)
      • Regardless of deletion strains: 14
      • Preserving deletion strains:19
    • dGLN3 (Tessa)
      • Regardless of deletion strains: 16
      • Preserving deletion strains: TBD
    • dZAP1 (Kristen)
      • Regardless of deletion strains: 16
      • Preserving deletion strains: 17
    • dCIN5 (Kayla)
      • Regardless of deletion strains: TBD
      • Preserving deletion strains: TBD
  • Abstract for April Conference

Tessa, Kayla, and I will be presenting based off of in findings from spring semester. Introduce bio problem, introduce modeling, discuss results from Tessa's and my work on Zap1 and Gln3. Introduce future work (what we're doing right now)

October 28, 2015

  • Notes from the meeting
    • Attending the ASMB COnference and need to submit abstracts for each team
    • Keep working on the excel, keep making deletions from the sheet without the deletion strains
    • Start making input sheets as well
  • Abstracts
    • Talk about class experience and which way we're extending the model
    • Instead of individual instances, discuss how we're making families of networks
    • Introduce biological problem, introduce modeling process
    • Put on github wiki and work on it

November 4, 2015

  • Notes from the meeting
    • Start/continue making input sheets (protocol should be up to date online)
    • Next make expression sheets
      • Use truncated, normalized data as the input
      • ID in left column -> 15, 15, 15 -> 30,30,30 etc.
      • Start to populate by doing the biggest one first and doing the rest by deletion
    • Abstract
      • Write results and conclusions (from paper)
      • Future directions of similar things with different strains
      • Write names with middle initial and how we want it to be established for scientific career
  • Input sheets
    • Use truncated normalized data
  • Most protocol should be up for everything except degradation and production rates sheets
  • Using Microsoft Access:
    • Create an Excel spreadsheet with tabs containing expression for each strain (i.e. have one tab named "wt_expression", etc.).
    • Use rounded normalized data for each strain
    • Use only data from the 15, 30, and 60 timepoints --> use the LFC and not the average
    • Find either a space or the #VALUE!, and replace with nothing
    • Open Access, select External Data tab, then import Excel spreadsheet
    • One tab is already available upon opening the program. Importing the sheet of interest creates a new tab instead of renaming the original tab (named Table1).
    • To create a new table, go to the Create tab and select the Table button
    • Select the file you wish to upload. Then select the tab that you want to build a database for.
    • Choose primary key and select ID --> Systematic name of TFs; the systematic name is the primary key because it is unique to a specific transcription factor
    • To rename a table, right click on the tab. Select Design View

Name the new sheet whatever network you are creating it for (i.e. dHAP_35_network)

    • Populate ID colony with genes from desired network (i.e. CIN5, GLN3).
    • Select Query Design and select the tables of interest (Expression table and Network table). Both windows will show up
    • Select ID from Network and Click, Drag, and Drop onto the ID of Expression Table
    • Right click on connection formed and select Join Properties
    • Make selection of Include ALL records from 'network'
    • Drag Network down to the first Field and drag all records of expression into the fields right of Network record (first field)
    • Select make table and then name your table. Hit RUN.
    • Wang mRNA degradation rate extraction from halflives
    • Use degradation and production rates from site provided


Begin generation of input sheets for network family testing Microsoft Access used to extract logFC data for t15, 30, and 60 for each strain (wt, dCIN5, dGLN3, dHAP4, dHMO1, dSWI4, and dZAP1) for: The largest network for the dHAP4, deletion strains added family

  • Took a while to learn how to use Access, sheets were created November 30
  • Optimization parameters set according to standard used over the summer (copied from Tessa's sheets)
  • Network_weights contains network with initial weight guesses of 0 (from summer)

November 30, 2015

Created access sheets off of the directions from above

December 9, 2015

  • Formatted sheets identical to Tessa's and Kayla's and all three received same errors of:
    Index exceeds matrix dimensions.


Error in readInputSheet (line 166)
log2FC(i).deletion = Deletion(i);


Error in GRNmodel (line 30)
GRNstruct = readInputSheet(GRNstruct);

  • Tessa and Kayla were able to finish up and get rid of their errors from formatting issues, like spelling, spaces, and addition of different parameters.
  • I copied their same format and checked the sheets vs theirs, and still got errors, although new ones:


"Error using barrier (line 22) Objective function is undefined at initial point. Fmincon cannot continue.

Error in fmincon (line 799) [X,FVAL,EXITFLAG,OUTPUT,LAMBDA,GRAD,HESSIAN] = barrier(funfcn,X,A,B,Aeq,Beq,l,u,confcn,options.HessFcn, ...

Error in lse (line 85) estimated_guesses = fmincon(@general_least_squares_error,estimated_guesses,[],[],[],[],lb,ub,[],options);

Error in GRNmodel (line 32) GRNstruct = lse(GRNstruct);"

  • Could not solve errors, cross-checked against others with no results. Sheet uploaded to github. Next semester I will be doing one of two things,

A.) Rerunning with taking the averages of each time step, which may cause an issue with creation of standard deviations B.) Copying into new Excel spreadsheet entirely to ensure the first one wasn't corrupt.

January 15, 2016

  • Wants coders to get ahead of us in order to check and run for bugs
  • Start back from where we were and use information from over the summer in order to avoid same/similar bugs from trying to use different codes in different versions
  • Coders need to turn on/off L-curve added
  • Then fix bugs of can't handle missing data and only 5 strains
  • May need to change to replace current data with averages of all data for expediency's sake
    • No replicate, only one value for 15,30,etc
  • No more overwhelming of many open threads
  • Errors need to include: Branch (i.e master, beta, etc.). Date/time download. Namefile link to download. Bug, functionality, priority .5
    • Only tag data analysis if highly prioritized
  • Give us updates as we're working between meetings

  • From last semester...
    • Disc with two families (w/ and w/o deletion)
    • Alphabetize the genes in the excel sheet to make easier to read
      • Use sorting for all the sheets but the network (alphabetize one way, transpose, alphabetize back)
    • Not all families were complete in the parameters. Go with Bell's published data for production and degradation rates for doing it quickly.
    • Put in average for missing values
    • Highlight cells that were empty since MATLAB doesn't care about color. Don't leave equation there, paste values, to ensure that the equation isn't being calculated multiple different ways. Do largest network and work down.
  • For now, restricted to only 5 strains. For largest network, keep all there, but we will decide after which strains to use

January 20, 2016

  • Edited the largest input sheet with the deletion strains purposefully added, will be paring down row by row next and then with the largest sheet with deletion strains deleted out.
  • Showed Brandon the ropes and helped answer his questions. He showed us a way to highlight all the blank cells without doing it individually. Yay.
  • Also added the average values to blank cells, added the production and degradation rates, and edited the optimization sheet

January 29, 2016

  • Edited a few more input sheets but paused in order to wait for proper formatting techniques as changing with the code
  • LMU Symposium abstracts due the 12th
    • Update abstracts from the SD conference by the 5th in order to submit to LMU. Tessa will go "solo" on a verbal presentation
  • Three more changes (taken from GitHub issue #166)
    • We can now get rid of the row that says "Deletion". The code now can figure out the gene that is deleted from the strain information.
    • We need to change the word "Model" to "production_function" (cell A8)
    • We need to add a row beneath "production_function" called "L_curve". A zero value for this parameter means no L-curve analysis is done and a 1 value for this parameter means that an L-curve analysis will be run.
  • Start running the L-curve analysis
  • Do 4 runs this week: largest and smallest networks (+/- deletion strains)
  • Make 4 L-curves for each of these runs
    • L Curve is LSE on y axis and penalty on x axis. We get these values from the output sheets.
    • Get these values and plot them against each other.
    • Tells us which alpha to choose and compares largets alpha against smallest.
  • Experienced crashes when making graphs (too many alphas) so changed optimization parameter sheet to make_graphs=0, but hopefully should not be an issue this time around
  • by now, coder progress should not be interfering with what we are doing. They will soon merge code into master branch and have a new release. We need to still use beta branch, but soon they will make all changes to beta and we will be using Master.

February 2, 2016

  • Went and checked runs after running for about 21 hours. The two smallest sheets had finished earlier, as collected by Tessa at about 10 AM (roughly 14 hours after start).
  • Largest network w/o deletions strains (ONLY_DNA_binding_dZAP1_28_genes_2_1_16) was still running
  • One crashed, but realized had accidentally run one that I had already run (not sure why it ran to completion on one computer, and crashed on the other).
  • Started largest network with deletion strains (with_deletions_ONLY_DNA_binding_dZAP1_34_genes_2_1_16), will check back tomorrow afternoon.
  • Wednesday, 2/3, biostats class will be in here. Hopefully all models are collected and completed by then in case people mess up the program.

February 3, 2016

Feb 5, 2016

  • Wait out the rest of the runs to see how they finish
  • Rerun the l-curves with fewer alphas but more condensed into the area of interest (i.e. more specific alphas in the "elbow" of the graph
  • Maybe plot the W's and B's and everything (like in class) but wait until after we finish the reruns to examine those again
  • Brought up that the team should have a universal naming convention of the files.
    • Apparently one may have been started like: 21-genes_50-edges_Dahlquist-data_MM_estimation but maybe substitute Dahlquist-data with users initials to tell apart whose worksheet is whose. Also include the gene family (i.e zap1)
    • Everyone may have cut out different strains/formatted slightly differently, so maybe everyone include the deletion strains in the title as well

Feb 11, 2016

  • Going to create bar charts of b and p values similar like done in biomath modeling class
  • Label the graphs since better to do it now
  • Do in alphabetical order of regulatOR (i.e CIN->ZAP1)
  • From now on, use only released code and the latest released code.
    • Latest code, can use all 6 strains on same code, but need to turn off creation of graphs/L-curves in order to not crash Matlab
    • Make graphs=1 in order to overwrite the images
  • Large networks likely too large... Not the number of genes but the number of edges
    • With MSE and LSE, can compare to ANOVA to show that some genes are better than other genes
  • If there's time, would like to pick particular network, generate random networks, and run
  • Large networks were recreated to try to fix wonky L-curves
    • Even though small network l-curves were decent, still pare down to them and rerun as it is confusing how the small network l-curves were acceptable and the large ones were not.
    • Pare down and then after the l-curves were created, then do step-by-step pare downs

Feb 17, 2016

  • Worked on the l-curves of the "larger" input curves
  • Dahlquist sent me updated inputs for the 33 and 25 input sheets.
  • Ran 33 genes with multiple errors, Dahlquist lab computers got to 10 alphas and crashed, and finally able to get one going in SEA 120.
  • Plotted 10 alphas into l-curve found in output page

Feb 26, 2016

  • L-curves turning into s-curves as the threshold b sheet requires a title in order to work
  • new version of beta automatically does L-curve if makegraphs is set to 1
    • should have no need to continue individually plotting the l-curves later but Tessa's work on it has shown otherwise.
    • Tessa's plots are still showing S-curves
    • Have not continued making L-curves yet as most recent L-curves formatted had final few alphas stacked on each other, so were more like downward slopes than "L's"
    • See last two links in DZAP1 L-curve analysis KH
  • grace will be trying to repeat alpha= .002 to try to replicate issue in l-curve code
  • Think we should just reuse alpha .002 to try as it is generating the best information in the overall codes
  • Would it be fair to repeat different output runs for different alphas? no, natural magnitude difference. Need to find consistent alpha term, so we will continue using .002
  • May have indexing problem with threshold and production rates due to Grace's bar charts
  • Running models were killed as bug is evident. Fitzpatrick rode code last semester and we just stuck it in... bugs to be expected. Indexing errors between scripts


  • for the future: Githubissue187
    • Only do families with deletion strains included
    • Small large small large networks in case we run out of time
    • Need to do beta code since has bug fix for allowing graphs for six strains
    • Collect output sheets, theoretical set of graphs, parameter comparison, MSE. Single out individual genes to see which are being modeled better or worse based on connections, ANOVA, etc.

March 11, 2016

  • have been working on poster for LMU's Undergraduate Research Symposium
  • ran production runs with graphs produced for each gene in 33, 23, and 15 gene networks. Decided to do three sized networks as initially thought would be comparing with Tessa's, then when realized poster was only Zap1, did not have time to run all 5 networks as originally thought (33, 29, 24, 19, 15)
  • On the 24 Gene network, noticed HSF1 floating around with no connections. Tried to delete it from network and network weights, but was unable to. In 33 and 29, it only regulates RLM1, is floating freely in 24, and is deleted in 19 and 15. Will be returning to later in order to permanently delete, but for now will be leaving in network due to time constraints.

September 12, 2016

  • Discussed weekly updates on github and commenting the completed tasks of that week on GitHub
  • Reviewed Code of Conduct in order to place it on GRNsight page
  • This week our goals are further research into graph theory, perhaps using MATLAB to implement these, and work on TRACE documentation
  • GRNsight is working on betweenness centrality and shortest path
    • Look into systems bio package in MATLAB to see if there are any shortcuts for those
    • Start testing those, will be used as independent check against GRNsight
    • Play with if we substitute weights what would we get
  • Do writeup on the literature we've been looking into and then move on to MATLAB coding
  • In the future, get some graphs from degree distributions from random networks
  • Start on TRACE Documentation so we have easier documentation when GRNmap is published

September 13, 2016

  • Made a powerpoint with quick summations of the articles we had read. Sent slides to Maggie to consolidate, will update with slides when completed
  • Explored possibility of systems biology toolbox for MATLAB online and in Dahlquist Lab computers
  • There was bioinformatics Toolbox available as well as Computational Biology Apps of Molecule Viewer, MGS Browser, Phylogenetic Tree, Sequence Alignment, and Sequence Viewer
  • Googled what makes up the bioinformatics toolbox and seems like we will be able to analyze shortest path, betweenness centrality, and degree distribution. Next week we will explore how.
  • Found article that created Systems Biology and Evolution toolbox: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3767578/

September 19, 2016

  • Notes from Meeting:
    • Narrow search space for pre-built function of betweenness centrality over others. If betweenness centrality is found, others will likely be found.
  • First four topics of TRACE documentation from: http://www.openwetware.org/wiki/Dahlquist:TRACE_Documentation
  • Double check the small networks from before, run the 5 real networks, then start generating random networks and collect data from this
    • Do control experiments to ensure there aren't any major bugs before investing time into creating the data
  • Go to Nicole's talk next week on Geffy, the graph layout software.
  • Make sure to comment on the GitHub issues to show our progress


Sample auto-generated bibliography

  1. Dahlquist KD, Fitzpatrick BG, Camacho ET, Entzminger SD, and Wanner NC. Parameter Estimation for Gene Regulatory Networks from Microarray Data: Cold Shock Response in Saccharomyces cerevisiae. Bull Math Biol. 2015 Aug;77(8):1457-92. DOI:10.1007/s11538-015-0092-6 | PubMed ID:26420504 | HubMed [Paper1]
  2. JACOB F and MONOD J. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol. 1961 Jun;3:318-56. DOI:10.1016/s0022-2836(61)80072-7 | PubMed ID:13718526 | HubMed [Paper2]

    leave a comment about a paper here

  3. ISBN:0879697164 [Book1]

All Medline abstracts: PubMed | HubMed

September 20, 2016

  • Explored the possibilities of the Bioinformatics toolbox. Split into high-throughput Sequencing, Microarray Analysis, Sequence Analysis, Structural Analysis, Mass Spectrometry and Bioanalytics.
  • Within Network Analysis and Visualization, some pieces of interest below:
    • "graphallshortestpaths(G)", "graphallshortestpaths(G,...'Weights', WeightsValue, ...)", and "graphallshortestpaths(G, ...'Weights', WeightsValue, ...)
    • For more details and a worked example, see: http://www.mathworks.com/help/bioinfo/ref/graphallshortestpaths.html
      • Only using G has N-by-N sparse matrix that represents a graph. Nonzero entries in matrix G represent the weights of the edges
      • Using DirectedValue, property that indicates whether the graph is directed or undirected. Can enter "false" for an undirected graph
      • Using Weights value lets you specify custom weights for the edges. WeightsValue is a column vector that specifies custom weights for the edges in Matrix G.
    • "graphshortestpath(..., 'Weights', DirectedValue, ...) (DirectedValue can be replaced with MethodValue or WeightsValue). See full info and examples here: http://www.mathworks.com/help/bioinfo/ref/graphshortestpath.html
    • [dist, path, pred] = graphshortestpath(G, S) determines the single-source shortest paths from node S to all other nodes in the graph represented by matrix G. Input G is an N-by-N sparse matrix that represents a graph. dist are the N distances from the source to every node.
      • Directedalue that indicates whether the graph is directed or undirected
      • MethodValue can be used as Character vectors that specifies the algorithm used to find the shortest path... Choices are
        • Bellman-Ford: Assumes weights of the edges to be nonzero entries in sparse matrix G. Time complexity is O(N*E), where N and E are the number of nodes and edges
        • BFS: Breadth-dirst search. Assumes all weights to be equal, and nonzero entries in sparse matrix G to represent edges. Time complexity is O(N+E), where N and E are the number of nodes and edges respectively
        • Acyclic: Assumes G to be a directed acyclic graph and that weights of the edges are nonzero entries in sparse matrix G. Time complexity is O(N+E) where N and E are the number of nodes and edges respectively
        • Dijkstra: Default algorithm. Assumes weights of the edges to be positive values in sparse matrix G. Time complexity is O(log(N)*E), where N and E are the number of nodes and edges respectively.
    • graphconcomp and graphmaxflow may also be worth exploring in the bioinformatics toolbox
    • GRNmap is already calculating shortest path so we may be able to use this as a control to check against any errors. Also may be able to calculate betweenness centrality after finding the shortest path. Look intoo the math on this

Searched through the article above to explore if Systems Biology and Evolution toolbox could be right for the GRNsight team to download and use for further data analysis. Main functions listed in the article:

  • Betweenness centrality, clustering coefficient, and closeness centrality, bridging centrality.
  • Statistics include local average connectivity, core number, graph mean distance, graph efficiency, etc.
  • Random networks can be generated using Erdos-Reyni, small world, and ring lattice algorithms
  • Can also simulate evolution of a network via node duplication, node loss, and edge rewiring

The paper notes that it is similar to toolboxes such as "Functional Genomics Assistant" and "Mathworks Bioinformatics Toolbox." Notes that MBT has basic graph theory algorithms, but no functions but statistical analysis. Paper goes into detail about computer memory and graphs with nodes of 10,000-80,0000. They were able to create a random network with 10,000 nodes and 450,000 edges with using 2 GB of memory, but the all of the core functions finished within 10 minutes.

I would give a light recommendation to downloading this toolbox and testing it out. It seems like the people who were devlooping it were anticipating that systems biology needs both statistical and network analysis. Only issue foreseen is when exploring the GitHub page, there's a link to a manual as to how to use the program, and it seems like a Texas A&M ID is necessary for login. If we decide to proceed with using this program, I can try to contact the authors to get this users manual.