User talk:Anugraha Raman
Check out [our page] for a summary of our final thoughts and our final presentation!
Over the course of this project I've worked on finding primary literature to help identify SNPs, looking at some primary literature in an effort to understand what potential models could be like, meeting with bio group to discuss future directions, helping create data-sets, creating a small tool to help sieve through primary literature for relevant snps for various characteristics, and documenting what we've done.
It's been a really fun interdisciplinary project getting to know really fun people!
Thoughts between Nov 9 and Final Days
We met to discuss who to contact about getting more real data sets, and then we met again to discuss how to create the data set for the modeling group to use. See explanation on the biology group page.
Increasing Target Audience
By altering the trait-o-matic user interface we can expand the target audience. One idea was to have something where you could search by traits in drop down menus, and by working backwards from these phenotypes you could determine different potential genotypes, each with different probabilities. This way you could potentially identify someone in a forensic type setting or you could potentially piece together an image of someone usable in a dating serivce type thing (similar to SNP cupid except you're not making predictions about the children.)
Allowing for a 3D visualization in which the user could click on a portion of the person and examine traits in that particular area, would allow users with less genetic background to be able to easily use this tool. It would open it up to a younger audience.
- Identify all traits studied by these GWAS studies and group them together
- List rsids linked to information from dbSNP about these snps (including in particular poulation diversity data (some of it which is linked to HapMap info))
- Link rids to corresponding primary literature
Here are some screen shots:Basic Listing of Traits Found in study:
When asked for Wish List Ideas Previously, something that struck me was understand protein-protein interactions. When reading about blood-type, I found it interesting that sometimes an individual can be type O when neither of the parents were carriers. This is because inheritance of the H antigen found on the surface of RBCs plays an important role in determining blood type. If the individual is homozygous recessive for this they can still be type O blood type since the H antigen is a precursor to type A and B antigens.
November 9, 2009
We met earlier today to discuss the ideal data set. This was an interesting modeling paper about a two locus model:[ http://www.biomedcentral.com/1471-2156/9/17]
- It discusses the four common models found:
It seems that the ideal data set would be two alleles --> 1 trait
Our first example just has to be biologically inspired, not necessarily related to humans. Jackie suggested a data set currently known for yeast in which knockouts were done to figure out the epistatic interactions. The idea is that we could look for homolouges of these genes in humans, just to show a proof of principle.
Looking at Height: Break up body into three parts using baseline of average, so we can have additive properties. ie) a torso is two inches below mean and the legs are three inches above mean --> 1 inch taller than mean
Ideally we want to see all or nothing epistasis but we need to start wtih the simplest example ie. total dominance.
- Looking at pharmacogenetics it seems that we would have a very simple project (one snp--> one phenotype)
- I finalyl realized that the reason for this simplicity is because we have yet to go through metabolic pathways in order to find how multiple snps work together to yield a certain response
- This data would be much more difficult to obtain from pgp because we need to find
- enough people with the same response to a certain product
- ability to study metabolic pathways in these people beyond their genome
Questions that I had:
- How exactly would this learning work. So can we input an extremely simple example with just two genes leading to one trait, and then feed multiple genes?
- For learning, do you have to feed in every single example seperately...what exactly is the leas amount of information needed.
November 3, 2009
- When first hunting for literature on polygenic traits, I found many articles about the steps to take towards identifying and characterizing polygenic traits. As it turns out, identifying QTLs is very important for identifying polygenic traits.
- However, in our project, we are primarily interested in figuring out what the particular polygenic trait that an invidual possesses is. Then we can move on to cooler applications of finding novel polygenic traits.
- I found this cool article Pharmacogenomics: Tanslating Functional Genomics into Rational Theraputics| that would really help out with the pharmacogenetics project :D
- If we look at table 1, we find a list of polymorphisms of genes important to drug metabolism, and how they would effect different phenotypes. We could start immediately searching for these polymorphisms in the genomes entered as input and scan for these specific mutations, thus being able to readily point spew out a phenotype
- Perhaps in order to make our searching method more efficient, we could first look for genes involved in the most number of pathways such as CYP3A4, and look for mutations in those, and then work our way from most common to least common. It is nice that in this picture we can start looking at genes in terms of frequency
- Another interesting find in this article was that pharmacogenetic polymorphisms differ in frequency among ethnic and racial groups. So now we would know to include these as a primary criteria when we choose to look at external factors
October 29, 2009
Thoughts on tuesday's discussion.
The first step to either project would involve developping a method to analyze polygenic traits.
This would involve
a) developping a method
b)testing out the method on pre-existing polygenic traits
c) verifying that this method works with accuracy
Though this can be directly introduced into trait-o-matic. It can also be the basis for snp cupid or the metabolism project.
I would feel comfortable helping out with some programming aspects in this first step. Overall, though, I would feel more comfortable thinking about the questions that need to be answered, how the results would be presented to the user, what other factors to take into account when thinking about either polygenic traits, or genetic inheritance, or metabolic pathways, and what is and isn't biologically realistic.
I think a second step after identifying polygenic traits, would be to work in the effects of epistasis into our predictions.
With SNP cupid, I envision the different results of the childs traits for different genes to be ranked with probabilities attached to them.
Octber 27, 2009
Thoughts on last week's discussion
After last week's discussion it seems that SNP cupid would be a wonderful class project. It is broad enough that it can incorporate many different sub-projects into it, and by creating a base many tools can be added on later. The whole polygenic trait subproject could be incorporated into SNP cupid results to increase interest. Furthermore, SNP cupid allows work to be divided up into different areas that are not all very programming intensive. Much thought would need to go into what kinds of questions people would need answers to, and the best and most reliable places to go to in order to obtain this information before feeding it to a program.
October 13/20 2009
Step 1: We would first identify a phenotype of interest in our PGP population. Say for example, people who do not gain weight from high fat diets.
Step 2: We would use a tool like blast to look at sequence alignment, and find portions that were the same. Then we would use OMIM to look for known genes. And then weed out similar sequences that belonged to known genes. (At this point: We could also use HGMD to look for known mutations and Gene tests to see if any clinical testing had been done on our sequence). Additionallly in the weeding out step we could go to SNPedia& Gene tests to find common SNPs and mutations of the known genes to further our weeding out process.
Step 3: We would look at a larger genome dataset to confirm our new phenotype-genotype finding, and add to the Gene tests database.
Checking the validity of our tool:
- Existing information on Wikipedia*
When looking under the genetics wiki page under gene regulation, I found a small paragraph on epigenetic factors influencing DNA and DNA inheritance. I thought it would be interesting to add that these changes such as methylation or acetylation ocur post0-translationally and silence genes from being expressed. It is also interesting to not that by removing methylation patterns one can more easily reprogram a cell, which would be useful in the line of theraputics development. Furthermore, by looking for hyper/hypo methylated promoter cpg islands, we may be more easily able to identify potential cancerous tumor sites.
October 8 2009
Our idea was to use OMIM, Gene Tests, and SNPedia in ordeer to find new linkages between genotype seuqences and thier corresponding phenotypes. We also wanted to attmept to find the minimal number of genotypic sequences that would correspond to a complex phenotypic trait. In order to show our working system we would first show our program working with known genotype-phenotype linkages. See the (Project Talk Page[]) page under projects for a more detailed summary.
September 29 Due Assignment
The premise of the major idea is that the environment and our lifestyle play major roles in disease onset probability due to their effect on our Epigenetic patterns. The Homo Sapiens toolkit should include the means for our species to test itself for specific diseases due to these epigenetic factors.
PBS Nova aired a show titled “Tale of Two Mice” that focused on Epigenetics. It featured genetically identical mice having the same sex and age, found to be phenotypically distinct due to a methyl-rich diet. Specific DNA regions becoming hyper-methylated can lead to onset of cancer. Amongst humans, one twin getting cancer and the other not, can be explained by diet and environmental factors that resulted in Methylation and eventually cancer.
Today the only known natural modification of human DNA is via DNA Methylation. This Methylation affects the Cytosine base (C) when it is followed by a Guanosine (G) or only at CpG sites. When promoter CpG islands become methylated the gene associated becomes permanently silenced.
Wet-Lab methylation ‘‘profiling’’ studies have shown characteristic set of aberrantly methylated genes with varying CpG island methylation patterns in specific cancer tumors. One of the challenges faced by the lab techniques is degradation of 90% of incubated DNA. The conditions necessary for complete conversion, such as long incubation times, elevated temperature, and high Bisulphite concentration, can lead to this degradation.
An immediate small step towards the Homo Sapiens 2.0 goal of self testing for epigentic factor based diseases is trying to predict if a specific gene is methyaltion prone or resistant algorithmically
Error fetching PMID 11782440:
Error fetching PMID 16837523:
Error fetching PMID 14519846:
Error fetching PMID 17932060:
Error fetching PMID 11106248:
Error fetching PMID 12912953:
Error fetching PMID 19478183:
Error fetching PMID 17515909:
Error fetching PMID 19458720:
Error fetching PMID 19424153:
- Error fetching PMID 17284773:
example environment and lifestyle linkage to epigenetics
// Small Step 1: Predict algorithmically if a specific gene is Methylation prone or resistant
- Error fetching PMID 11782440:
- Error fetching PMID 16837523:
- Error fetching PMID 14519846:
Medium sized Step 2: ‘Count’ and curate Methylation levels for specific genes which are normal and diseased
- Error fetching PMID 17932060:
- Error fetching PMID 11106248:
Lung Cancer example: CDKN2A gene showing normal methyaltion ~0 and diseased methyaltion around ~40%
- Error fetching PMID 12912953:
Another Lung Cancer example DAPK1 gene showing normal methyaltion ~4% and diseased methyaltion around ~40%
// Large sized Step 3: Predict Methylation levels based on variables - tbd
// Much larger sized Step 4: Create in vivo logic based “counter” that will light up when it detects biomarkers within range of disease based on Methylation levels
- Error fetching PMID 19478183:
- Error fetching PMID 17515909:
Final large sized Step 5: Make the step 4 setup into a kit and let Homo Sapiens test themselves
- Error fetching PMID 19458720:
- Error fetching PMID 19424153:
Week 3 Assignment
Problems 1,2 and 3 were done within one Python script and problem 4 in a separate script. The tutorial Biopython Tutorial helped me in understanding how to proceed with the functions that had to be written for this assignment!
The fourth problem was very interesting! I had a blast working on it. Hopefuly it is done correctly :)
The length of the given sequence is 1020 base pairs. For every 100 base pairs the script randomly tries to mutate a to t/g/c t to a/g/c etc. with a probability of 0.01. I then made the script run this a 100 times and called it a simulation. The script did 800 such simulations and aggregated the results in the plot shown below:
As you can see the output plot after 800 simulations of 100 sets of single base pair evolutionary mutations as described in assignment 3b problem 4 produces the above plot showing about four to five (4.65) premature terminations for every 1020 mutations.
- Reading in the Input Sequence
I created a simple FASTA type text file to read in the p53seg sequence that was provided.
input_file = open('p53seg.txt', 'r') for cur_record in SeqIO.parse(input_file, "fasta"):
my_seq = cur_record.seq
- Problem 1 (GC Content)
The GC content % required that a float be used in the denominator to get results.
- GC count done explicitly, i.e. problem #1 in this assignment set
- Get the number of Guanines in the sequence
g_count = cur_record.seq.count('g')
- Get the number of Cytosines in the sequence
c_count = cur_record.seq.count('c')
- Get the length of the sequence
seq_count = len(cur_record)
- use float in denominator to get the decimal answer for GC%
gc_percent = ((g_count + c_count) / float(seq_count)) * 100 print 'GC % is: ' + str(gc_percent) </syntax>
- Problem 2 (Reverse Complement)
The reverse complement was obtained by simply using the Seq.reverse_complement() function.
- get the reversed complement of the sequence, i.e. problem #2 in this assignment set
rev_seq = my_seq.reverse_complement() print 'DNA reverse complement of p53seg is: 'output_file.write('
DNA reverse complement of p53seg is: ') print rev_seq output_file.write(str(rev_seq)) </syntax>
- Problem 3 (Frame Translation)
- Standard translation from Biophys101_assign3b.doc
- +2 Frame
- -1 Frame
- Method 2 ===>Using the Standard table defined
- Problem 4 (Single bp mutation simulation to detect early terminations)
- Functions defined in this script file are as follows:
- writeheader(myfile) : Writes a specific HTML header using the myfile handle
- writefooter(myfile) : Writes a specific HTML footer using the myfile handle
- writerunsummary(myfile) : Writes a specific set of sumamry information using the myfile handle
- mutatesinglebp(seq, random_Seed, forevery, prob) : Mutates a single base pair for forevery
- location range using a probability of prob; returns mutated sequence
- writeAA(myfile, seq,stop_locs) : Writes the amino acids using the myfile handle
- findstops(seq) : Finds the stop locations in the given DNA sequence
- mutate_singlebp mutates a single base pair for forevery location range using a
- probability of prob
- end of function mutate_singlebp
- end of function findstops
- I have to create a mutable sequence to use the mutate_singlebp function
Week 2 Assignment
With the first graph (exponential), with larger values of k the graph increased much faster. In face as you can see when you compare the "red triangles" to the red circles, the exponential curve associated with the red triangles (k=4.03) dwarves the exponetial curve associated with the "red circles" (k=.9) so much that the "red circles curve" appears to almost be linear in the top graph.
In the second graph (logistic), with larger values of k, the graph not only grows faster, but also starts leveling off sooner. If we were relating this to population growth, larger k values result in the population reaching its carrying capcity sooner.