From OpenWetWare

Revision as of 11:57, 11 May 2007 by JCAnderson (Talk | contribs)
Jump to: navigation, search

Sequencing Analysis

So, you've made your basic or composite, and you think you've found a colony that contains the right product. You now want to confirm that it's right. You need to sequence to find out exactly what the sequence is of the thing in your tube. In general, sequencing is cheaper and better when outsourced to a company or core facility. So, you send out your sample, and they send you back a sequence file. In this part of the tutorial, we're going to go through how sequencing works and how to analyze the data they send you.
Before we start, you will need ApE (A Plasmid Editor) for this. So, if you haven't already done so, download it from Replace the feature annotation database within ApE with an updated version. To do this, find directory in which you installed ApE on your computer and locate the file "Default_Features.txt". On my computer, it's: C:\Program Files\ApE\Accessory Files\Features. Replace this file with This Version. Additionally, download the program FinchTV from so you can view chromatograms.

How sequencing works

What you really need to know

Sequencing starts with a sample plasmid DNA or PCR product and one DNA oligonucleotide. This is what you send to the sequencing facility. They do something called "cycle sequencing" to your sample, and email you back some data. You can expect between 400bp and 1000bp of "true" data that begins somewhere around 20-50 bp into the read. Where "good" and "bad" data starts and ends varies sample to sample, and we'll get into that later. The sequence you read corresponds to the region 3' of where your oligo anneals. So, you must pick the appropriate oligo to send the sequencers based on what region of the plasmid you are interested in.

Overview of the process

(From http://
So, the facility is going to run a reaction similar to a PCR on your sample and then take the products generated in that sample and load them into an instrument. The reaction is going to contain many little fragments of your sequencing, and the machine will separate them to single base pair resolution using capillary electrophoresis. Capillary electrophoresis is basically like the agarose gels we run in lab, but the gel is in a long narrow tube. It detects each little DNA as it comes off the column by fluorescence, and the spectrum of fluorescence (the "Chromatogram") can be interpretted by software as a string of A's, C's, T's and G's referred to as the "calls". They will send you both a text file of calls and a chormatogram file.

How the cycling reaction works

From http:// The cycling reaction starts by denaturing your sample plasmid and annealing the oligo to its homolous sequence. The reaction contains essentially a PCR reaction (dNTPs, thermostable DNA polymerase, buffer), so the polymerase starts adding bases to the 3' end of the oligo. However, there are two additional components -- ddNTPs (dideoxynucleotides, in "A" of the figure) and dyes. The dye will make the synthesized products visible in the electrophoresis instrument. The ddNTPs are chain terminators. Because they lack a 3' hydroxyl group, whenever one of these gets incorporated into a growing DNA the synthesis cannot proceed further resulting in a truncated product. For each cycling reaction, one of the 4 ddNTPs is added. So, in the reaction with ddATP, the chains get termiated at every A, and so on for all 4 reactions. The "cycling" aspect of this is that the process of denaturing, annealing, and extending is repeated so that there is linear (but not exponential) amplification of the original plasmid template.

If we were to load these cycling reactions on a normal gel like we have in lab, you'd see something like the gel at left (from http:// In this gel, you can see that for every vertical space on the gel there is one band from one of the lanes. You can therefore read off what base was present at each position.

In practice, since the sample is run by capillary electrophoresis, you end up with a chromatogram plotting fluorescence intensity versus time like this:

(From http:// That is an example of a chromatogram--the raw data from sequencing which you receive with every sample. The "calling" of the bases is just an algorithm's best interpretation of what each peak of that spectrum correponds to. Usually the calls are pretty accurate, but occasionally the quality of the data is poor and the calls are wrong. The remainder of this tutorial deal with how you interpret the sequencing results.

Interpretting a sequencing result

Go ahead and download the following 2 files:

 Image:Jca387 ca998 2007-03-10 D02 005.txt (The calls)
 Image:Jca387 ca998 2007-03-10 D02 005.ab1 (The chromatogram)

Open up jca387_ca998_2007-03-10_D02_005.txt in notepad, select all the text, and paste it into a window of ApE. Hit ctrl-K. This will search through the feature database and light up any features present in the sequence. Run your cursor over some of the colored text and take a look at what's in there. You've seen this plasmid before, it's pBca9145-Bca1089, the Biobricks version 2.0 RFP basic part. You should see RFP and the 4 Biobrick 2.0 restriction sites: EcoRI, BglII, BamHI, and XhoI.

Let's now compare this read to the sequence file you downloaded for the previous tutorials. If you can't find the file, here's the link. Open up JCASeq_pBca9145-Bca1089.str in a second window of ApE. Highlight all the sequence in the sequencing read, copy it, and search for that string of text in pBca9145-Bca1089.

Uh oh...what happened? (You should have gotten an error saying "No sequence found". Does this mean the plasmid is wrong? Um, no, not at all. In fact, this is par for the course. This is about as good as a read gets, and we know this plasmid is perfectly fine. So, what's up. Well, go ahead and launch the ab1 file into FinchTV and let's look at the raw data.

First of all, the read begins directly 3' to the spot where the oligo anneals. In this case, the oligo was ca998 (gtatcacgaggcagaatttcag), so the first few bases should have been "ataaaaaaaat". Clearly, though, the first 35 bases of this read are totally garbage. That's normal. An important take-home point from this is that if your oligo anneals closer than 50bp to sequence you need to read, you're probably not going to get the data you want. From around 35bp in to around 800bp, this read looks really nice. Go ahead and select bases 35 to 813 of the sequence file in ApE and see if they match pBca9145-Bca1089. You should now be able to light up this region within pBca9145-Bca1089.

So, what can we conclude about plasmid pBca9145-Bca1089? We definitely can conclude that the part we made is absolutely correct. The quality of the read for the region between the BglII and BamHI sites is perfect here. And really, that's all we really want to get out of this sequencing effort. However, we can't say anything about the rest of the plasmid. We know nothing from this read about the sequence around the colE1 origin or the Bla gene. If we wanted to know something about those regions, we'd have to use a different oligonucleotide for sequencing that corresponded to those regions.

This first example is pretty easy. We see the entire biobrick part, and it was a perfect match to the template. So, no worries. Now let's look at a harder-to-interpret case. Download the following:


Put the calls file into ApE, and hit ctrl-K. Two EcoRI sites, one BglII site, and no Biobricks pop up. Now open up pBca9145-Bca1126.str and look at what should be in this file. So, is this thing wrong? Well, we can't really say anything yet, but we'll have to look at it very closely. First of all, we were only expecting one EcoRI site. The site at position 1 of the read has a good chance of being false. Open up the ab1 file and look at the region of the read. Does the chromatogram trace file support this conclusion? Looks to me like any base calls in that region are pure speculation. So, I won't worry about a potential EcoRI site.

Select bases 190-315 of the read file and search for this string of text in pBca9145-Bca1126.str. You should see that both files have this string which is a fragment of the intended biobrick part, a phoA coding sequence. So, the thing isn't totally wrong. Select residues 16-1368 of pBca9145-Bca1126.str. You've highlighted the entire phoA part. Search for it in the calls file. It shouldn't find its cognate. Why not? First of all, notice the size of the fragment--it's 1353bp long. There is no possible way you could find the whole thing in your read. The whole file is only 1140bp long. So, you can't possibly see the entire phoA Biobrick part in one sequencing read. All we can do here is evaluate whether the sequence we do see is consistant with the model file.

Let's focus on just the N-terminus of phoA, then. Open up your trace file and let's see where the read starts to get messy. It looks pretty good until at least 650. We won't worry about any similarity between the files after 650, then. Close the trace file--we're done with it. Now find bases 400-650 in your sequence file, select them, copy them, and find that string within pBca9145-Bca1126.str. That should have worked. If it didn't, try it again. Now you have a region of brown-colored sequence selected. Copy it, search for it in the calls file, and assuming the sequence gets highlighted within the calls file, paste in the annotated version. You now should have a little patch of sequence in there that looks like this:
You've now marked the 3' end of the sequence you care about. Anything below that is garbage.

We know phoA starts at 16, and the trace gets sloppy around

If you have any comments or want to report a potential error in the tutorial, please email me (Chris Anderson) at

Personal tools