Exploring HIV Evolution: An Opportunity for Research

From OpenWetWare
Jump to navigationJump to search

Exploring HIV Evolution: An Opportunity for Research

Authors: Sam Donovan and Anton E. Weissten

Human Immunodeficiency Virus (HIV), like other retroviruses, has a much higher mutation rate than is typically found in organisms that do not go through reverse transcription (the copying of RNA into DNA). Mansky and Temin (1995) estimated the rate of point mutations in HIV to be 3 x 10-5 errors per base per replication cycle. The impact of HIV as a pathogen is due in large part to this high mutation rate which, among other things, causes the surface proteins to change and avoid normal immune detection and suppression. A great deal has been learned about the evolution of HIV because it is relatively easy to sample the rapidly changing population of viruses within an infected individual and look at the patterns of molecular change over time.

In this activity you will study aspects of sequence evolution by working with a set of HIV sequence data from 15 different subjects (Markham, et al., 1998). You will first learn about the dataset, then study the possible sources of HIV for these subjects, and then design and pursue your own research project.

Basic HIV biology such as life history characteristics of the virus and its interactions with the human immune system are not discussed here but are important background for placing these exercises into a broader biological and clinical context. A few suggested resources for reviewing basic HIV biology are provided in the reference section.

An orientation to the HIV sequence data

The HIV genome is very small and relatively simple. It is made up of nine genes and about 9,500 nucleotides. In this lab, you will use sequences from the envelope gene, env (see Figure 1). The envelope gene codes for two membrane proteins (gp41 and gp120) that extend from the cell membrane and are involved in identifying target cells for the HIV to infect (see Figures 2 and 3). The HIV surface proteins are also sites that the immune system can sometimes detect, making it possible to destroy that HIV virus particle (see Figure 4). For the dataset you will be working with, the researchers identified different forms, or clones, of HIV based on differences in the nucleotide sequence of a short, 285 base pair, region of gp120 called V3 (see Figure 1). The V3 region is known to be highly variable and involved with both host cell and antibody recognition. Characterizing the population of HIV in a subject based on the different versions of the V3 region sequence gave researchers a measure of HIV evolution that has potential clinical significance.


Activity 1: Looking at the NCBI Resources and HIV sequence data

As previously mentioned, these data were originally published as part of a research study looking at HIV evolution in different subjects. To learn a little more about the study and the data itself, this first activity involves searching the National Center for Biotechnology Information (NCBI) databases for additional information related to this research.

Part 1: PubMed

The NCBI provides access to a variety of databases including PubMed (a literature database) and GenBank (a nucleic acid sequence database). This federally funded research resource is very useful in part because the databases are linked together. This means that finding information in one area will allow you to look up related information in other areas. You should find the PubMed record for the Markham et al. article and then follow the links option for that record to find all the nucleotide sequences associated with that article.

NCBI website URL: http://www.ncbi.nlm.nih.gov

Original paper: Markham, R. B., W. C. Wang, A. E. Weisstein, Z. Wang, A. Munoz, A. Templeton, J. Margolick, D. Vlahov, T. Quinn, H. Farzadegan, X. F. Yu (1998). Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline. Proceedings of the National Academy of Sciences 95(21):12568-73. Pub Med ID: 98445411

* How did you search for the PubMed entry?
* What other ways might you have searched?
* What other types of related information are available?

Part 2: GenBank

In this section you will take a closer look at a GenBank record and the type of data that is stored there. Once you reach the nucleic acid data associated with the Markham et al. paper you will see that there are a variety of different ways to view the data.

The data you will be working with is coded to help you recognize its source. While all of the data are HIV sequences, each sequence is identified based on the subject it was taken from, the visit during which it was collected, and its clone number. Thus, each sequence has a code like S4V2-4 that can be read as subject 4, visit 2, 4th clone. Each clone is a unique sequence collected during a particular visit. Over 600 different HIV sequences were identified in these 15 subjects and published electronically in GenBank. Choose one of the GenBank records and view both the full record and the FASTA formatted sequence.

* What was the accession number of the sequence you chose?
* Which subject of the study was that HIV sequence from? Which section of the record contains information about who the HIV was collected from?
* Download several (4 to 6) sequences in FASTA format to your local hard drive by selecting several at the same time in the summary view so they are saved into a single text file. Be careful to remember where you put the file and what you name it so that you can find it later.
* Open the file that you saved with a word processor to confirm that you have the sequences and that they are in the FASTA format. In the FASTA format each sequence is preceeded by a label which begins with the greater than sign (>).

Part 3: Introduction to the Biology Workbench

In order to analyze sequence data we will use the Biology Workbench, an on-line suite of bioinformatics tools. This section contains a brief introduction to the Biology Workbench that is intended to get you up and running quickly. There is more information about how the Biology Workbench is organized and how to use various tools in the “Orientation to the Biology Workbench” supplement on the Microbes Count! CD.

* Log in to the Biology Workbench.
* If you do not already have an account, you will need to set one up by following the Set up an account link.
* Once you have logged in, scroll down until you see the 5 buttons that take you to the different tool sets. This exercise uses nucleic sequence data, so follow the appropriate link.
* You should now see a scrolling list of tools for working with nucleic sequence data. Select Add new sequences and press the Run button. The next window allows you to enter sequences in a variety of ways. You can type them in directly, paste them in from another file, or upload a text file containing sequence data.
* Choose the Browse button to select the file you saved earlier from NCBI. Once you have selected the file use the Upload button to open it and read the data into this page.
* Once the labels and sequences appear in the data fields, choose Save to import that data into your Biology Workbench session.
* Each sequence should now appear as a data line below the list of analysis tools. Select one sequence and use the command View the Sequence to confirm that the sequence was successfully imported.

Now that you have been introduced to the Biology Workbench interface and procedures for uploading data files, you are ready to use the same procedures to upload a real dataset for analysis in the next activity.

* Look in the list of nucleic acid tools and find the ClustalW tool. Highlight the tool and then select Help to see more information about what this tool does.
* Select all of your sequences using the appropriate command and run a multiple sequence alignment using ClustalW.
* Look over the output and see if you can relate the differences in the sequences to the topology of the unrooted tree diagram and the pairwise similarity scores.

Note that the sequence labels on your tree will be the accession numbers for those sequence records. With just the accession numbers to describe the sequences, it may be difficult to think about the biological basis for the comparisons you just performed. Given what you have learned in Activity 1 you should be able to go back to GenBank and find additional information about the sequences you selected. We will use the ClustalW tool extensively in Activity 2. Look over this practice tree and think about how you will interpret your experimental trees to draw biological inferences from them.

* Go to the Session Tools and create a new session to store the sequences you will be working with in Activity 2.