User talk:Kelly Brock: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
== Personal/Lab Info ==  
== Final Project ==  


===Final Project===
===Final Project===
Line 37: Line 37:




== Assignments ==


===Assignment 5:  10/13/2009 and 10/20/2009===
===Assignment 5:  10/13/2009 and 10/20/2009===

Revision as of 05:18, 18 December 2009

Final Project

Final Project

For the final project, I'm trying to find out how to further modularize the project in terms of OMIM and SNPedia as part of the Infrastructure group. While SNPedia is categorized by rs ID's - meaning that they are officially registered in the NIH's public database of single nucleotide polymorphisms - OMIM uses a different categorization scheme. Their information is classified using a six-numeral scheme based on location in the genome [[1]]. I'm trying to find out how I can translate between the two classification schemes in order to more fully integrate OMIM into our application of Trait-o-Matic,along with studying how different databases could be integrated into the system.

I've found out that the OMIM site also has a genetic map displaying where its entries are found in the genome. Current research suggests that there should be a way to translate this information into the rs ID's already used.

I'm not sure when the project is due, but barring any complications I plan on figuring this out and implementing it this weekend. The infrastructure group has been talking about meeting over this time span, so we should be able to present a polished final project. I would also like to keep working on this for the remainder of final exam period.

Also, I would definitely be up for helping anybody document their stuff, as a side note.


Yay - a general Project Idea Exists!

If we're looking at gene interactions, I'm not sure how much information is available biologically. The first step would be to decide how we're going to go about doing the research. Should we look at gene networks themselves, and see what metabolites coexist in a certain process a la FBA analysis, or should we go through literature and make a list of what genes have been shown to interact with which other genes? There will be a lot of biology to go through before we get to the hard coding. That being said, I would rather work on coding details than biology details, although I'm looking forward to doing both!


Thoughts on Class Meeting October 22 2009

I really liked the idea of exploring a genealogy-type experiment where we randomly pair males and females in our database, predict their offspring, and recurse to see how the population would change over time. This tool could help evolutionary biologists to test for a specific mutation and how likely it is to be present after k generations. Furthermore, this project would encompass SNPcupid - we would have to write the functionality to simulate offspring genomes, after all - and it would also give a more-research-oriented focus to the project. Then, everybody's happy! We could make the different parts available individually so that couples who wanted to learn about any potential time bombs in their potential child's DNA could get a preliminary screening through our program. It's the best of both worlds! The biggest issue, I think, would be computing time, since we would potentially be dealing with an entire database's worth of human DNA sequences, or at least the coding regions.

Flow of program:

Have options for:

 -How many generations you want to simulate, how many offspring per couple, etc.
 -The parents and any parameters  (i.e. just test two, randomly select pairs in database, how many pairs, etc.)
 -How many random mutations are introduced into the genome
 -Have the option to look at entire genome or just a portion of it
 -Could highlight "interesting" portions of genome that have changed
 -Might also get into predicting traits based on genes, another idea that was discussed (i.e. Trait-o-matic) - could use those    
  handy-dandy databases for that

Over and out!

Kelly


Assignments

Assignment 5: 10/13/2009 and 10/20/2009

I'm not sure exactly what the assignment is referring to - we have to specifically use the databases? Also, I'm not sure if this is due today . . .

So Anugraha's and my algorithm would give an output something like:

Genes in Common: AACTG....GTA location trait in common ATGACCT...TA location trait in common ... AGTCA.....AA location trait in common

Then we could use this data structure to look up the gene in OMIM. OMIM defines its genes by a six-number code [[2]], so we would first have to find the translation key between the position and what gene it corresponds to. For example, if we identify a sequence associated with susceptibility to malaria, then we would cross-check it against the OMIM database. Then we could actively search for that gene to see if the trait has already been documented. Traits are also listed as a six-number code, so we would also need translation services for that component. I am sure that this information, to find which codes correspond to which location, is floating around on the internet. Everything's on the internet.

Update: Actually, when I was looking at GeneCards, the database that links all "known and predicted human genes" and includes disease relationship information, I found that it would actually be better than OMIM to use. HGMD, a different database, also lists gene locations up front, which could definitely be helpful in cross-correlating between different data sets if one set is incomplete (or to help modularize so that individual gene name designations are not as necessary).

For the information part of the assingment:

QTL refers to quantitative trait locus, which describes polygenic traits and how each gene might contribute to the overall phenotype; epistasis is "the interaction between genes," which seems to be a fancier way of saying "gene interactions." I didn't change anything on the wiki - I figured my classmates pretty much had it covered and I didn't want to be a source of incorrect knowledge. For the record, our algorithm can - conceivably, at least - work with epistasis!


Is this what the assignment's asking for?

Kelly



Assignment 4: 10/08/09

Anugraha and I met and talked about implementing the Phenomenal Pheno-matic, a program to generate hypotheses based on data. We could search our databases for genes that were overexpressed to try to find a correlation between different gene types and their phenotypes. This would involve use of OMIM, GeneTests, SNPedia, and PGP, and would be a good interdisciplinary project. For the longer, more detailed description, please see the edited Project talk page [[3]].


Assignment 3. Brainstorming Human 2.0

Looking through OMIM, GeneTests, and SNPedia was definitely an adventure - I've never felt so paranoid about my own genes before! (unless you include that time I went to the fourth grade in enormous bell-bottom pants with cartoon characters drawn on them). I think the most surprising gene from SNPedia was Rs3057, which has been linked to having perfect pitch. [[4]].

If we want to create a Human 2.0, I wonder if we want to start by focusing on artistic abilities. Are painters more likely to have a certain mutation, for example? In my MCB80 course, we talked about how a not-insignificant number of famous painters have bad depth perception - maybe seeing the world as flat helps them translate their visions onto flat paper. We could document musicians with sequenced genes and try to find correlations between their genotypes and musical phenotypes. We might could also study how big a role the environment plays in determining somebody's artistic abilities as a side project by seeing how dissimilar the genes are.

Alternatively, another cool project would be to use genetic information to determine a person's risk for becoming addicted to things like nicotine and alcohol. We could do this by refreshing a list of genes currently thought to be associated with these diseases and implementing a fast search algorithm to go through a given genome. Gaining large data sets and comparing our predicted results with reality would be an interesting assignment as well.


Assignment 2. Python Epicness

I organized the Python code by questions 1,2,3, and 4 as indicated in the comments. I redirected the output stream into a txt file, which is what I turned in on Thursday for my answers. This was by far my favorite homework assignment this week! I like python as a language because it seems like a really good mix of C, Scheme (especially the lists and dictionaries), and Matlab. As far as the experiment itself goes, part 4 was the most interesting to me because it modeled actual mutations instead of providing intrinsic data about the sequence. I hope I did the "get 1% and randomly mutate them" like y'all wanted!

<syntax = python>

  1. Kelly Brock
  2. File: BiophysAsst3P1.py
  3. Answer four parts of Assignment 3

import random

  1. For part 4, when we have to do multiple experiments

TRIALS = 6

  1. Input genetic sequence into memory

sequence = "cggagcagctcactattcacccgatgagaggggaggagagagagagaaaatgtcctttag" sequence += "gccggttcctcttacttggcagagggaggctgctattctccgcctgcatttctttttctg" sequence += "gattacttagttatggcctttgcaaaggcaggggtatttgttttgatgcaaacctcaatc" sequence += "cctccccttctttgaatggtgtgccccaccccccgggtcgcctgcaacctaggcggacgc" sequence += "taccatggcgtagacagggagggaaagaagtgtgcagaaggcaagcccggaggcactttc" sequence += "aagaatgagcatatctcatcttcccggagaaaaaaaaaaaagaatggtacgtctgagaat" sequence += "gaaattttgaaagagtgcaatgatgggtcgtttgataatttgtcgggaaaaacaatctac" sequence += "ctgttatctagctttgggctaggccattccagttccagacgcaggctgaacgtcgtgaag" sequence += "cggaaggggcgggcccgcaggcgtccgtgtggtcctccgtgcagccctcggcccgagccg" sequence += "gttcttcctggtaggaggcggaactcgaattcatttctcccgctgccccatctcttagct" sequence += "cgcggttgtttcattccgcagtttcttcccatgcacctgccgcgtaccggccactttgtg" sequence += "ccgtacttacgtcatctttttcctaaatcgaggtggcatttacacacagcgccagtgcac" sequence += "acagcaagtgcacaggaagatgagttttggcccctaaccgctccgtgatgcctaccaagt" sequence += "cacagacccttttcatcgtcccagaaacgtttcatcacgtctcttcccagtcgattcccg" sequence += "accccacctttattttgatctccataaccattttgcctgttggagaacttcatatagaat" sequence += "ggaatcaggatgggcgctgtggctcacgcctgcactttggctcacgcctgcactttggga" sequence += "ggccgaggcgggcggattacttgaggataggagttccagaccagcgtggccaacgtggtg"

  1. Part 1 - CG content
  2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

print "Kelly Brock\n" print "Biophysics 101 Asst 3\n" print "Part I\n\n"

  1. Variable to keep track of how many 'cg's we've encountered

count = 0

  1. Check each character in our genetic string

for i in range(0, len(sequence)): if (sequence[i] == 'g') | (sequence[i] == 'c'): count += 1

  1. Compute fraction of total characters equal to c or g

answer = count*1.0/len(sequence) print "CG fraction is: " + str(answer) + "\n"

  1. Part 2 - Find reverse complement
  2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

print "\nPart II\n"

  1. Make list to hold reversed sequence

RevSeq = list((sequence[::-1]))

  1. Change all values to their complements

for i in range(0,len(RevSeq)): if RevSeq[i] == 'c': RevSeq[i] = 'g' elif RevSeq[i] == 'g': RevSeq[i] = 'c' elif RevSeq[i] == 't': RevSeq[i] = 'a'; elif RevSeq[i] == 'a': RevSeq[i] = 't'

  1. Recast our sequence back into a string

RevSeq = "".join(RevSeq) print "Reverse Complement Sequence" print RevSeq

  1. Part 3 - Determining protein sequence
  2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

print "\nPart III\n"

  1. Hardcode the protein dictionary

standard = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C', 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C', 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*', 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',

'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R', 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R', 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R', 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',

		'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
		'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',

'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',

 		'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',

'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G', 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G', 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G', 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G' }

  1. Make function to find protein abbreviations
  2. sequence is forward genetic seq, RevSeq is reverse
  3. complement, and posneg indicates whether we want to
  4. do all frames (>1) or just the positive ones (1)

def proteinabbr(sequence, RevSeq, posneg):

# Will hold the list of one-letter abbreviations for the proteins # encoded by p53 in different frames protein = list() totalprot = list()

# Top loop chooses + open frame (0) or - open frame (1) for l in range(0,posneg):

# Use + open frame with normal sequence if l == 0: sign = " + " seq = sequence

# The second time, use reverse complement sequence else: sign = " - " seq = RevSeq

# There are 3 possible reading frames for both normal and # reverse complement sequences for m in range(0,3): print "\nFrame" + sign + str(m+1)

# Go through each triple in our frame for i in range(m,len(seq),3):

# Prevents error if not evenly divisible by 3 if (i+2) < len(seq):

# Lookup protein value in dictionary and add it to the # protein list protein.append(standard[seq[i:(i+3)]])

# Prints result as string and clears list print "".join(protein) totalprot.append(protein) protein = list() return totalprot

  1. Do all that we just defined and store as original, unmutated sequence

original = proteinabbr(sequence, RevSeq, 2)

  1. Part 4
  2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

print "\nPart IV\n"

  1. Have to do this for a certain number of trials

for j in range(0,TRIALS):

# Make list to hold the random numbers - there should be 1% of total mutspot = random.sample(range(0,(len(sequence)-1)),len(sequence)/100)

# Make string of sequence mutable mutseq = list(sequence)

# Possibilities for mutation for each character A = ['c','g','t'] C = ['a','g','t'] G = ['a','c','t'] T = ['a','c','g']

# Choose another nucleotide for each spot where you assigned a mutation for i in mutspot: if mutseq[i] == 'a': mutseq[i] = random.choice(A) elif mutseq[i] == 'c': mutseq[i] = random.choice(C) elif mutseq[i] == 'g': mutseq[i] = random.choice(G) else: mutseq[i] = random.choice(T)

print "\nMutated Protein Sequence for Frames +1,2,3 in Trial " + str(j) print "\nMUTATED SEQUENCE" print "".join(mutseq)

# Translate mutated string into 3-frame protein abbreviations mutprotseq = proteinabbr("".join(mutseq), RevSeq, 1)

print "\nNumber of immature stop codons: "

# Go through each ORF we computed for k in range(0,3):

# We also want to see how many changes were introduced countmut = 0

# We want to count how many times a new stop codon is introduced # into the code, compared to the original sequence. * = stop countstop = 0

# Find each protein within each ORF for l in range(0,len(mutprotseq[k])):

# Is it a mutation? if mutprotseq[k][l] != original[k][l]: countmut += 1

# Did you introduce a new stop codon? if mutprotseq[k][l] == '*': countstop += 1

print "\nThe total number of protein mutations was " + str(countmut) print "Of these, " + str(countstop) + " were incorrect stop codons." </syntax>

Assignment 1. Python and Excel

I'm currently having technical difficulties getting Python to run - it doesn't want to recognize the matplotlib or numpy libraries. However, I did complete the Excel graphs - with increasing k for the first equation, the function values also increased as expected, resulting in different endpoints of the curve. For the second equation, the curve decreased to very negative numbers, like a reflection of a normal exponential curve. For the third graph, I got that all values were zero since (k* e^x * (1-e^x)) would always be <= 0, and the max would automatically choose zero.