User talk:Anna Turetsky
For the final project, I started out as part of the biology team, with the goal of looking through literature for concrete examples of polygenic trait data to use for modeling. After finding some diabetes literature (below), I met with that group to discuss ideas and help figure out a direction in which to go for finding examples. However, I soon realized that my interests were more in line with those of the infrastructure group, and to go beyond infrastructure, I really wanted to create a new visual interface for users of trait-o-matic (I met with them to discuss what they were working on and showed them my ideas). Currently, the SNPs are in one giant list, and there should be a more user-friendly way of looking through them. I liked the idea of organizing SNPs based on the type of trait they affected (loosely based on MESH organization), as well, and displaying this information in a visual online tool. In addition, many of their descriptions are not understandable to the average person, who really just wants to know absolute risk of a trait or disease (e.g. 20% risk of colon cancer) or relative risk as compared to the population (e.g. 1.4 times as likely to get colon cancer). At the same time, anyone wishing to delve further into how those risks were calculated should be able to do so.
The categories given by MESH can be found here.
These are the categories I was thinking about, which includes some not found in MESH (including aesthetic traits and syndromes):
-Whole body / aesthetic (Obesity, Height, Pigmentation / dermatological)
While I did not get a chance to create the actual tools to accomplish these goals, I created animations in Flash to show how they would look and what they could do. The first is a tool where users can scroll over various parts of the human body to see a trait-o-matic participant's SNPs based on the categorizations above. This can be found here Body-SNP Applet animation (I didn't know how to embed the file). The second is an easy to use tool to view phenotypes to the level of complexity the user wants to delve into. The user can click on the category of trait they want to view, which will show a list of traits the trait-o-matic participant has in that category. If a trait has a little black box next to it, it can be clicked on to show the data upon which the conclusion is based. This data can be in the form of primary literature, models made by our class, or eventually models uploaded onto trait-o-matic. The animation showing this tool can be found here Trait-Model Applet animation.
A quick lit search: diabetes
Type 2 diabetes is a late-onset disease that may be of interest, as it is both polygenic and includes behavioral/environment risk. Janssens and van Duijn point out that rather than being predictive, genes contributing to heart disease and diabetes can lead to behavioral changes which try to lower risk of developing the disease .
Prior to this, Weedon et al. showed that having multiple allele copies increases risk in accordance with a multiplicative model  (this type of statistical information can be used in affirming the effectiveness of our modeling). However, other studies such as here  and here  found that lifestyle/phenotypic factors and family history were more predictive that genetics in whether someone would actually develop diabetes.
As a side-note, I was slightly amused that a google scholar search for "highly predictive polygenic disease" turns up zero hits. Hopefully this will change in the years to come...
Basically, I'm up for doing anything that does not include programming. If anything design-oriented needs to be done, I'm happy to contribute to that rather than the biology.
Also, I like the idea of splitting into groups with similar interests.
I added to the Quantitative Trait Locus page in the "QTL mapping" subsection:
This can be done using BLAST, an online tool that allows users to enter a primary sequence and search for similar sequences within the BLAST database of genes from various organisms.
Using gene databases
HGMD "represents an attempt to collate known (published) gene lesions responsible for human inherited disease". This is extraordinarily useful for SNPCupid given that these are the only types of genes we want to focus on (for now). Thus, this website could be the best way to decide which genes to include in the SNPCupid database. I have not yet registered for access to the website, however, so I can't see details about any specific genes. Since single-nucleotide changes are shown as just the triplet changes, perhaps there is a way to parse the website for these mutations in particular. In addition, every gene has links to the mutations listed on OMIM, so we can cross-reference to discover phenotypic patterns, as well as perhaps the severity of the phenotype. Note: before explicitly using this website in its current form, we would need to know which genes to look at, which means that we would need to find out the severity of each phenotype.
Genecards proves more useful as an information source than as a route to discovering autosomal recessive genes on which to focus. The most useful parts of a page for a particular gene are the table of SNPs with their population frequency, and below that, the "Disorders and mutations" related to the gene, with links to OMIM and HGMD. It may be possible to parse Genecards to find the SNP variants in those genes that contain these "disorders and mutations" links to other websites. Alternatively, once a gene is found to be relevant for human disease in HGMD and confirmed in either HGMD or OMIM to have autosomal recessive inheritance, Genecards can be used to find more information, especially on the SNP variants.
An example of a Genecards page with these links is for NEK8, a gene involved in autosomal recessive polycystic kidney disease. Its Genecards page has links to the NEK8 OMIM page and to the NEK8 HGMD page. On the OMIM page, it can be seen that the disease is autosomal recessive, making this gene useful for SNPCupid. One can then use the SNPs back in Genecards to determine which SNPs to include in the SNPCupid-relevant allele database.
I will note that this does not address polygenic inheritance patterns. Unfortunately, with the amount that is currently known, it would be too difficult to include these in the SNPCupid algorithm.
...what Joe said. (He pretty much wrote up all the details we discussed. (Thanks Joe))
This might be too broad (as in, I'm not thinking about one specific trait), but this takes the personal genome project one step further.
For now, the only really effective way to control a person's genes is by controlling the alleles inherited from parents. Thus, I have been thinking about the idea of playing with the statistical recombination of genes due to meiosis during reproduction to create Humans 2.0. Not only could this type of selection induce extra diversity, but it could try to prevent offspring from being homozygous recessive for a well-characterized genetic disease, eradicating those disorders.
The idea: create a genetics-based dating website where potential partners are chosen based on genetic characteristics that their offspring are likely to exhibit. Users could choose what they care about most, i.e. highest diversity of alleles, avoidance of homozygous recessive alleles, highest probability of blue eyes, dark hair, high intelligence, etc. (Though some of these would run into problems by creating less diversity...). The search would return the best genetic matches after taking into account other more tangible considerations, such as age. Of course, this requires getting to the point where we have personal genetic information for everyone taking part, which would currently be a major barrier in doing this.
I will admit, even in the future the likelihood of someone using something like this is quite small. However, perhaps it could help people decide whether to have children once they are already together (similar but more broad than the genetic testing that is done for Tay Sachs likelihood in Jewish communities). With a few known diseases in mind, it would be easier to test users' genetic information, as well. Eventually, we can perhaps we can pave the way for generations of some very healthy (and maybe smart and good-looking!) people. It might even be well-received by the anti-abortion crowd!
The step we could take towards achieving this is to document known recessive genes and make a program to predict recombination into a homozygous recessive offspring based on the genetic information of the parents. We could also rate the different genes by severity (including negative vs. positive impact of the homozygous recessive, if any are non-disease) and include that in the algorithm as far as rating the desirability of possible combinations spanning the entire genome.
I'm tempted to steal this idea: Journal of Imaginary Genomics, 2006
(It's from this article about scientific integrity from a few years ago: One Last Question: Who Did the Work?)
Assignment for Sept. 24
(this got formatted strangely by openwetware...hopefully it still makes sense...)
seq = 'cggagcagctcactattcacccgatgagaggggaggagagagagagaaaatgtcctttag\ gccggttcctcttacttggcagagggaggctgctattctccgcctgcatttctttttctg\ gattacttagttatggcctttgcaaaggcaggggtatttgttttgatgcaaacctcaatc\ cctccccttctttgaatggtgtgccccaccccccgggtcgcctgcaacctaggcggacgc\ taccatggcgtagacagggagggaaagaagtgtgcagaaggcaagcccggaggcactttc\ aagaatgagcatatctcatcttcccggagaaaaaaaaaaaagaatggtacgtctgagaat\ gaaattttgaaagagtgcaatgatgggtcgtttgataatttgtcgggaaaaacaatctac\ ctgttatctagctttgggctaggccattccagttccagacgcaggctgaacgtcgtgaag\ cggaaggggcgggcccgcaggcgtccgtgtggtcctccgtgcagccctcggcccgagccg\ gttcttcctggtaggaggcggaactcgaattcatttctcccgctgccccatctcttagct\ cgcggttgtttcattccgcagtttcttcccatgcacctgccgcgtaccggccactttgtg\ ccgtacttacgtcatctttttcctaaatcgaggtggcatttacacacagcgccagtgcac\ acagcaagtgcacaggaagatgagttttggcccctaaccgctccgtgatgcctaccaagt\ cacagacccttttcatcgtcccagaaacgtttcatcacgtctcttcccagtcgattcccg\ accccacctttattttgatctccataaccattttgcctgttggagaacttcatatagaat\ ggaatcaggatgggcgctgtggctcacgcctgcactttggctcacgcctgcactttggga\ ggccgaggcgggcggattacttgaggataggagttccagaccagcgtggccaacgtggtg'
len = len(seq)
- length is 1020 bp
- 1. Please determine the GC content of p53seg.
- a is the counter and the loop adds 1 to 'a' every time it sees a\
- c or a g in the sequence
for i in range(len):
if seq[i] == 'c': a=a+1 if seq[i] == 'g': a=a+1
- a, which is the number of g's and c's, is 540
print 'GC content is about', (a*100)/len, 'percent'
- 2. Determine the DNA reverse complement of p53seg.
- comp is the array of the complement sequence and the loop appends\
- the complement of each base. revcomp is the reverse comp- the\
- backwards sequence of comp
comp = 
for i in range(len):
if seq[i] == 'c': comp.append('g') if seq[i] == 'g': comp.append('c') if seq[i] == 'a': comp.append('t') if seq[i] == 't': comp.append('a')
revcomp = 
for i in range(len):
revcompseq = "".join(revcomp)
print 'the reverse complement sequence is', revcompseq
- 3. Translate the p53seg gene into its protein\
- sequence in all 6 frames (+1, +2, +3, -1, -2, -3)
- The c function turns a sequence into an array of codons; that is, 3\
- bases per array element. The inputs are the sequence, which array is being\
- made, and the open reading frame(orf).
def c(seq, codonarray, orfstart):
for i in range(orfstart,len,3): codonarray.append(seq[i:(i+3)])
codonarray1 = 
codonarray2 = 
codonarray3 = 
c(seq, codonarray1, 0)
c(seq, codonarray2, 1)
c(seq, codonarray3, 2)
revcodonarray1 = 
revcodonarray2 = 
revcodonarray3 = 
c(revcompseq, revcodonarray1, 0)
c(revcompseq, revcodonarray2, 1)
c(revcompseq, revcodonarray3, 2)
- The p function finds the start codon and then turns all the codons after it\
- into the amino acids they code, stopping when the sequence reaches a stop\
- codon. It then prints the amino acid sequence of the protein.
protseq =  for i in range(len/3): if codonarray[i] == 'atg': start = i break for i in range(start, len/3): if codonarray[i] == 'ttt': protseq.append('F') if codonarray[i] == 'tct': protseq.append('S') if codonarray[i] == 'tat': protseq.append('Y') if codonarray[i] == 'tgt': protseq.append('C') if codonarray[i] == 'ttc': protseq.append('F') if codonarray[i] == 'tcc': protseq.append('S') if codonarray[i] == 'tac': protseq.append('Y') if codonarray[i] == 'tgc': protseq.append('C') if codonarray[i] == 'tta': protseq.append('L') if codonarray[i] == 'tca': protseq.append('S') if codonarray[i] == 'taa': protseq.append('*') break if codonarray[i] == 'tga': protseq.append('*') break if codonarray[i] == 'ttg': protseq.append('L') if codonarray[i] == 'tcg': protseq.append('S') if codonarray[i] == 'tag': protseq.append('*') break if codonarray[i] == 'tgg': protseq.append('W') if codonarray[i] == 'ctt': protseq.append('L') if codonarray[i] == 'ctc': protseq.append('L') if codonarray[i] == 'cta': protseq.append('L') if codonarray[i] == 'ctg': protseq.append('L') if codonarray[i] == 'cct': protseq.append('P') if codonarray[i] == 'ccc': protseq.append('P') if codonarray[i] == 'cca': protseq.append('P') if codonarray[i] == 'ccg': protseq.append('P') if codonarray[i] == 'cat': protseq.append('H') if codonarray[i] == 'cac': protseq.append('H') if codonarray[i] == 'caa': protseq.append('Q') if codonarray[i] == 'cag': protseq.append('Q') if codonarray[i] == 'cgt': protseq.append('R') if codonarray[i] == 'cgc': protseq.append('R') if codonarray[i] == 'cga': protseq.append('R') if codonarray[i] == 'cgg': protseq.append('R') if codonarray[i] == 'att': protseq.append('I') if codonarray[i] == 'atc': protseq.append('I') if codonarray[i] == 'ata': protseq.append('I') if codonarray[i] == 'atg': protseq.append('M') if codonarray[i] == 'act': protseq.append('T') if codonarray[i] == 'acc': protseq.append('T') if codonarray[i] == 'aca': protseq.append('T') if codonarray[i] == 'acg': protseq.append('T') if codonarray[i] == 'aat': protseq.append('N') if codonarray[i] == 'aac': protseq.append('N') if codonarray[i] == 'aaa': protseq.append('K') if codonarray[i] == 'aag': protseq.append('K') if codonarray[i] == 'agt': protseq.append('S') if codonarray[i] == 'agc': protseq.append('S') if codonarray[i] == 'aga': protseq.append('R') if codonarray[i] == 'agg': protseq.append('R') if codonarray[i] == 'gtt': protseq.append('V') if codonarray[i] == 'gtc': protseq.append('V') if codonarray[i] == 'gta': protseq.append('V') if codonarray[i] == 'gtg': protseq.append('V') if codonarray[i] == 'gct': protseq.append('A') if codonarray[i] == 'gcc': protseq.append('A') if codonarray[i] == 'gca': protseq.append('A') if codonarray[i] == 'gcg': protseq.append('A') if codonarray[i] == 'gat': protseq.append('D') if codonarray[i] == 'gac': protseq.append('D') if codonarray[i] == 'gaa': protseq.append('E') if codonarray[i] == 'gag': protseq.append('E') if codonarray[i] == 'ggt': protseq.append('G') if codonarray[i] == 'ggc': protseq.append('G') if codonarray[i] == 'gga': protseq.append('G') if codonarray[i] == 'ggg': protseq.append('G') protseqjoined = "".join(protseq) print protseqjoined
print 'the amino acid sequence for reading frame +1 is:' p(codonarray1)
print 'the amino acid sequence for reading frame +2 is:' p(codonarray2)
print 'the amino acid sequence for reading frame +3 is:' p(codonarray3)
print 'the amino acid sequence for reading frame -1 is:' p(revcodonarray1)
print 'the amino acid sequence for reading frame -2 is:' p(revcodonarray2)
print 'the amino acid sequence for reading frame -3 is:' p(revcodonarray3)
- 4. Please introduce single base-pair mutations (i.e. replacement of\
- A by T/C/G, G by A/T/C, etc…) to the p53seg gene at a rate of 1% \
- (i.e. ~1 mutation every 100 base pairs) and document the changes to the\
- protein sequence (give a couple of trial results). How often do you see\
- premature terminations?
- mutseqarray is the sequence with the mutations put in through the loop that\
- picks a random base to change out of every 100 and then picks a random\
- nucleotide to change it to. the functions of writing the codons and then\
- the protein sequence are then repeated using the mutant sequence. The reading\
- frame of the mutant sequence can be modified in the program code.
seqarray = 
for i in range(len):
import random as rd
mutseqarray = seqarray
mutsite = 
basetype = 
for i in range(len/100):
for j in range(10):
mutsite[j] = mutsite[j]+(100*j)
for i in range(len/100):
basetype.append(rd.randrange(0,4)) if basetype[i] == 0: mutseqarray[mutsite[i]] = 'a' if basetype[i] == 1: mutseqarray[mutsite[i]] = 't' if basetype[i] == 2: mutseqarray[mutsite[i]] = 'g' if basetype[i] == 3: mutseqarray[mutsite[i]] = 'c'
mutseq = "".join(mutseqarray)
mutcodonarray = 
c(mutseq, mutcodonarray, 2)
print 'the amino acid sequence for reading frame +3 after mutations is:' p(mutcodonarray)
Assignment for Sept. 15
Not only have I not done much programming, but I haven't really done graphing in excel, making this a more confusing assignment than I think it should have been. I think I need to see during class how to graph the logistic growth curve. For the exponential growth, as others commented, whether it is growth or decay depends on the value of k being greater than or less than 1, respectively. Bold text