Jmenzago Week 13

Purpose

The purpose of this assignment is to use various bioinformatical tools to analyze the spike glycoprotein of SARS-CoV-2 to better understand its structure-function relationship

Combined Methods/Results

Converting the DNA sequence to a protein sequence

The sequence converted was that of surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
- QHD43416.1: spike protein (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1)
- DNA Sequence FASTA:

>spike protein (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1) DNA sequence
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCA
ATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCA
GTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATG
TCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGC
TTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTCGAAGACCCAGTCC
CTACTTATTGTTAATAACGCTACTAATGTTGTTATTAAAGTCTGTGAATTTCAATTTTGTAATGATCCAT
TTTTGGGTGTTTATTACCACAAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGC
GAATAATTGCACTTTTGAATATGTCTCTCAGCCTTTTCTTATGGACCTTGAAGGAAAACAGGGTAATTTC
AAAAATCTTAGGGAATTTGTGTTTAAGAATATTGATGGTTATTTTAAAATATATTCTAAGCACACGCCTA
TTAATTTAGTGCGTGATCTCCCTCAGGGTTTTTCGGCTTTAGAACCATTGGTAGATTTGCCAATAGGTAT
TAACATCACTAGGTTTCAAACTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGATTCTTCTTCA
GGTTGGACAGCTGGTGCTGCAGCTTATTATGTGGGTTATCTTCAACCTAGGACTTTTCTATTAAAATATA
ATGAAAATGGAACCATTACAGATGCTGTAGACTGTGCACTTGACCCTCTCTCAGAAACAAAGTGTACGTT
GAAATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAACAGAATCTATT
GTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTG
TTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTGATTATTCTGTCCTATATAATTCCGCATC
ATTTTCCACTTTTAAGTGTTATGGAGTGTCTCCTACTAAATTAAATGATCTCTGCTTTACTAATGTCTAT
GCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTG
ATTATAATTATAAATTACCAGATGATTTTACAGGCTGCGTTATAGCTTGGAATTCTAACAATCTTGATTC
TAAGGTTGGTGGTAATTATAATTACCTGTATAGATTGTTTAGGAAGTCTAATCTCAAACCTTTTGAGAGA
GATATTTCAACTGAAATCTATCAGGCCGGTAGCACACCTTGTAATGGTGTTGAAGGTTTTAATTGTTACT
TTCCTTTACAATCATATGGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACT
TTCTTTTGAACTTCTACATGCACCAGCAACTGTTTGTGGACCTAAAAAGTCTACTAATTTGGTTAAAAAC
AAATGTGTCAATTTCAACTTCAATGGTTTAACAGGCACAGGTGTTCTTACTGAGTCTAACAAAAAGTTTC
TGCCTTTCCAACAATTTGGCAGAGACATTGCTGACACTACTGATGCTGTCCGTGATCCACAGACACTTGA
GATTCTTGACATTACACCATGTTCTTTTGGTGGTGTCAGTGTTATAACACCAGGAACAAATACTTCTAAC
CAGGTTGCTGTTCTTTATCAGGATGTTAACTGCACAGAAGTCCCTGTTGCTATTCATGCAGATCAACTTA
CTCCTACTTGGCGTGTTTATTCTACAGGTTCTAATGTTTTTCAAACACGTGCAGGCTGTTTAATAGGGGC
TGAACATGTCAACAACTCATATGAGTGTGACATACCCATTGGTGCAGGTATATGCGCTAGTTATCAGACT
CAGACTAATTCTCCTCGGCGGGCACGTAGTGTAGCTAGTCAATCCATCATTGCCTACACTATGTCACTTG
GTGCAGAAAATTCAGTTGCTTACTCTAATAACTCTATTGCCATACCCACAAATTTTACTATTAGTGTTAC
CACAGAAATTCTACCAGTGTCTATGACCAAGACATCAGTAGATTGTACAATGTACATTTGTGGTGATTCA
ACTGAATGCAGCAATCTTTTGTTGCAATATGGCAGTTTTTGTACACAATTAAACCGTGCTTTAACTGGAA
TAGCTGTTGAACAAGACAAAAACACCCAAGAAGTTTTTGCACAAGTCAAACAAATTTACAAAACACCACC
AATTAAAGATTTTGGTGGTTTTAATTTTTCACAAATATTACCAGATCCATCAAAACCAAGCAAGAGGTCA
TTTATTGAAGATCTACTTTTCAACAAAGTGACACTTGCAGATGCTGGCTTCATCAAACAATATGGTGATT
GCCTTGGTGATATTGCTGCTAGAGACCTCATTTGTGCACAAAAGTTTAACGGCCTTACTGTTTTGCCACC
TTTGCTCACAGATGAAATGATTGCTCAATACACTTCTGCACTGTTAGCGGGTACAATCACTTCTGGTTGG
ACCTTTGGTGCAGGTGCTGCATTACAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGTATTG
GAGTTACACAGAATGTTCTCTATGAGAACCAAAAATTGATTGCCAACCAATTTAATAGTGCTATTGGCAA
AATTCAAGACTCACTTTCTTCCACAGCAAGTGCACTTGGAAAACTTCAAGATGTGGTCAACCAAAATGCA
CAAGCTTTAAACACGCTTGTTAAACAACTTAGCTCCAATTTTGGTGCAATTTCAAGTGTTTTAAATGATA
TCCTTTCACGTCTTGACAAAGTTGAGGCTGAAGTGCAAATTGATAGGTTGATCACAGGCAGACTTCAAAG
TTTGCAGACATATGTGACTCAACAATTAATTAGAGCTGCAGAAATCAGAGCTTCTGCTAATCTTGCTGCT
ACTAAAATGTCAGAGTGTGTACTTGGACAATCAAAAAGAGTTGATTTTTGTGGAAAGGGCTATCATCTTA
TGTCCTTCCCTCAGTCAGCACCTCATGGTGTAGTCTTCTTGCATGTGACTTATGTCCCTGCACAAGAAAA
GAACTTCACAACTGCTCCTGCCATTTGTCATGATGGAAAAGCACACTTTCCTCGTGAAGGTGTCTTTGTT
TCAAATGGCACACACTGGTTTGTAACACAAAGGAATTTTTATGAACCACAAATCATTACTACAGACAACA
CATTTGTGTCTGGTAACTGTGATGTTGTAATAGGAATTGTCAACAACACAGTTTATGATCCTTTGCAACC
TGAATTAGACTCATTCAAGGAGGAGTTAGATAAATATTTTAAGAATCATACATCACCAGATGTTGATTTA
GGTGACATCTCTGGCATTAATGCTTCAGTTGTAAACATTCAAAAAGAAATTGACCGCCTCAATGAGGTTG
CCAAGAATTTAAATGAATCTCTCATCGATCTCCAAGAACTTGGAAAGTATGAGCAGTATATAAAATGGCC
ATGGTACATTTGGCTAGGTTTTATAGCTGGCTTGATTGCCATAGTAATGGTGACAATTATGCTTTGCTGT
ATGACCAGTTGCTGTAGTTGTCTCAAGGGCTGTTGTTCTTGTGGATCCTGCTGCAAATTTGATGAAGACG
ACTCTGAGCCAGTGCTCAAAGGAGTCAAATTACATTACACATAA

Insert the FASTA sequence of the genome into the text box on NCBI Open Reading Frame Finder then click "submit"
- Translated protein sequence:

>lcl|ORF1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHS
TQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNI
IRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNK
SWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGY
FKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLT
PGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASV
YAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSF
VIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYN
YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPT
NGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTG
VLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCL
IGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLG
AENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECS
NLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGF
NFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLI
CAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQD
VVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGR
LQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLM
SFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGT
HWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKE
ELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSC
GSCCKFDEDDSEPVLKGVKLHYT

Results on page:

The correct reading frame for these results is ORF1
- Correct reading frame can be found by looking for the one that begins with the start codon (AUG), covers the sequence for the entire protein, and runs from 5' to 3'
  - AUG represented as "M" on the software
- Correct reading frame can be checked on NCBI protein record

Determining what is already known about the S protein

Information about the S protein was found using UniProt Knowledgebase (UniProt KB)
Run a search with "SARS-CoV" as the only keyword
- Produced 70 reviewed results and 748 unreviewed results
Run a search using the accession number "P59594"
- Produced one reviewed result
- Information provided in the database entry
  - Protein name, gene, and organism it is from
  - Function of all its subuints (S1,S2,S2')
  - Different names and taxonomy of the protein
  - Subcellular location of the protein
  - Pathology and biotech
  - Post-translational modification and processing
  - Protein interaction
  - Structure
  - Family and domains
  - Amino acid and genetic sequences
  - Similar proteins
  - Cross-references

Analyzing the S protein sequence

Protein sequence analyzed was taken from Yan et al. (2020)
- 6M17
  - Links to the 2019-nCoV RBD/ACE2-B0AT1 complex discussed in the paper
    - The FASTA file has three sections, one for each part of the complex
    - The FASTA sequence for the S protein is the 3rd one on the file
Insert the FASTA sequence for the S protein into the text box on PredictProtein server and click "PredictProtein"
- Results:
  - The sequence was 223 amino acids long
    - There were 90 aligned proteins
    - There were 31 matched PDB structures
  - PredictProtein also provides structure and function annotations
    - Structure annotations available:
      - Secondary structure and solvent accessiblity
      - Transmembrane helices
      - Protein disorder and flexibility
      - Disulphide bridges
    - Function annotations available:
      - Effect of point mutations
      - Gene ontology terms
      - Subcellular localization
      - Binding sites
Whereas UniProt provides a collection of known information about a protein, PredictProtein offers a variety of predictions on how a protein would change its structure of function is part of the sequence is altered.

Image of resulting predicted features for S protein:

Analyzing a 3D model of the S protein

Protein structure from Yan et al. (2020) 6M17 used for this task
- Structure is the 2019-nCoV RBD/ACE2-B0AT1 complex, not just the S protein
Model analyzed using NCBI Structure viewer iCn3D
- Search for protein on NCBI using PDB ID (6M17 for Yan et al.)
- Click on "full-feature 3D viewer" to interact with protein
Rotate the 3D model so that it is oriented in the same way as Figure 4A from Yan et al. (2020)
- There are two RBD-PD complexes in the entire structure, it does not matter which is used to match Figure 4A
- To hide other tertiary structures:
  - Select the structure by clicking on it
  - On the tabs at the top, go to View->Hide Selection
- To change the color of a structure:
  - Select a structure
  - On the tabs at the top, go to Color->Unicolor->select desired color
- To show N and C termini:
  - Select a structure
  - On the tabs at the top, go to View->Label->N- & C- Termini
- To highlight secondary structures in a complex
  - Select a structure
  - On the tabs at the top, go to Color->Secondary->select desired color

Figure 4A from Yan et al. (2020)

Top left - Figure 4A from Yan et al. (2020)
Top right - Replicate of whole RBD-PD complex in Figure 4A
- Alpha helices and beta sheets not labeled like in Figure 4A
  - It is difficult to see beta3 and beta4 unless the secondary structures are different colors
  - For labeled secondary structures, see image with colored secondary structures below
Bottom left - Replicate of RBD-PD bridge from Figure 4A
- Unable to show polar interactions
Bottom right - Replicate of RBD-PD bridge rotated 180 degrees from Figure 4A
- Unable to show polar interactions

Image shows the secondary structures and N and C termini in RBD-PD complex from Figure 4A
- SARS-CoV-RBD
  - No alpha helices
  - Beta sheets yellow
  - N and C termini labeled in yellow
- ACE2-PD
  - Alpha helices in red
  - Beta sheets in green
  - N terminus labeled in cyan
  - C terminus does not appear in figure from Yan et al. (2020), so it is not labeled in the replicate
The image is similar to the results generated by PredictProtein
- Most of the S protein was predicted to consist of loops (grey), and about 25% of it would be made of beta strands (yellow), which is reflected in the image above
  - About 5% was predicted to be alpha helices, which was incorrect as there are no helices in the image above
- The ACE2-PD was predicted to be about 50% loops and 50% helices, with about 5% of it made up of strands
  - This seems correct based on the image above, as it looks to be made of an even amount of loops (green) and helices (red), with a small amount of strands (green)
All amino acids discussed in the paper came from any of the four labeled secondary structures (marked in blue)
- Primarily alpha1

Beginning your research project

When comparing the RBD-PD complex of SARS-CoV and SARS-CoV-2, Yan et al. (2020) found that differences in residues lead to weaker interactions. Walls et al. (2020) discuss that when the SARS outbreak of 2002 reemerged in 2003-04, the virus had a weaker interaction with ACE2 and patients showed milder symptoms. This research project will further explore the role that certain amino acids play in the structure-function relationship of SARS-CoV-2 and ACE2.
ACE2 sequences from humans, mice, and bats will be analyzed
- Humans because our race is currently in a pandemic because of this virus
- Mice because they have been unaffected by the virus, and it could be because of the interaction between ACE2 and the S protein
- Bats because it is theorized that the virus jumped from them to humans
Sequences will be taken from UniProt and will be compared through a multiple sequence alignment. Their 3D structures will also be compared using iCn3D or other similar programs to see if any apparent structural differences will weaken or inhibit binding to the S protein

Scientific Conclusion

The purpose of this assignment was to use various bioinformatical tools to analyze the spike glycoprotein of SARS-CoV-2. Working with these tools to visualize and predict aspects of the protein from its sequence built familiarity with the software and how they can be used in conjunction to answer potential research questions. Using a few of the tools from this assignment to take a closer look at the role that polar residues play in the strength of the interaction between ACE2 and SARS-CoV-2 can provide insight into the structure-function relationship of the two.

Acknowledgements

My homework partners for the week were Drew Cartmel and Nicholas Yeo
- We communicated multiple times through Zoom or texts to brainstorm about out research project
I followed the instructions on BIOL368/S20:Week 13 to complete this assignment
- Syntax for links to any software or protein structures used for this assignment were copied from this page
Citation for Walls et al. (2020) copied from this page
- DNA sequence for spike protein (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1) used in task 1 copied from this page
I used MediaWiki Help:Images to learn how to resize the images on this page.
Except for what is noted above, this individual journal entry was completed by me and not copied from another source.

Jmenzago (talk) 23:01, 22 April 2020 (PDT)

References

MediaWiki. (2020). Help:Images. Retrieved April 22, 2020, from https://www.mediawiki.org/wiki/Help:Images
OpenWetWare. (2020). BIOL368/S20:Week 13. Retrieved April 20, 2020 from https://openwetware.org/wiki/BIOL368/S20:Week_13.
NCBI. (2020). 6M17: The 2019-nCoV RBD/ACE2-B0AT1 complex. Retrieved April 22, 2020, from https://www.ncbi.nlm.nih.gov/Structure/pdb/6M17.
NCBI. (n.d.). Home - ORFfinder - NCBI. Retrieved April 22, 2020, from https://www.ncbi.nlm.nih.gov/orffinder/.
NCBI. (n.d.). Surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]. Retrieved April 22, 2020, from https://www.ncbi.nlm.nih.gov/protein/1791269090.
PredictProtein (2020). RostLab. Retrieved April 22, 2020, from https://open.predictprotein.org/
RCSB PDB. (2020). 6M17:The 2019-nCoV RBD/ACE2-B0AT1 complex. Retrieved April 22, 2020, from https://www.rcsb.org/structure/6M17.
Walls, A. C., Park, Y. J., Tortorici, M. A., Wall, A., McGuire, A. T., & Veesler, D. (2020). Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell. DOI: 10.1016/j.cell.2020.02.058.
Yan, R., Zhang, Y., Li, Y., Xia, L., Guo, Y., & Zhou, Q. (2020). Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science, 367(6485), 1444-1448. doi: 10.1126/science.abb2762.

Assignments

Individual Journal Entries

Class Journal Entries

Jmenzago Week 13

Contents

Purpose

Combined Methods/Results

Converting the DNA sequence to a protein sequence

Determining what is already known about the S protein

Analyzing the S protein sequence

Analyzing a 3D model of the S protein

Beginning your research project

Scientific Conclusion

Acknowledgements

References

Assignments

Individual Journal Entries

Class Journal Entries

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools