Open writing projects/Python all a scientist needs: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(imported summary and outline)
(copied outline to talk page)
Line 1: Line 1:
This is a paper/presentation for [http://us.pycon.org/2008/ Pycon 2008] that I am writing on OWW.  The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a [http://arxiv.org/abs/0708.2038 recent comparitive genomics study]. I am experimenting with writing the paper on OWW and eventually submitting it to the [http://arxiv.org arXiv].  If anyone is interested in the topic, or the process, please [[Special:Emailuser/Julius_B._Lucks|email me through OWW.]]
This is a paper/presentation for [http://us.pycon.org/2008/ Pycon 2008] that I am writing on OWW.  The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a [http://arxiv.org/abs/0708.2038 recent comparitive genomics study]. I am experimenting with writing the paper on OWW and eventually submitting it to the [http://arxiv.org arXiv].  If anyone is interested in the topic, or the process, please [[Special:Emailuser/Julius_B._Lucks|email me through OWW.]]


== Summary ==
== Abstract ==


Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.  
Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.  


== Outline ==
== The Scientists Dilemma ==
 
=== The Scientists Dilemma ===
* A typical research project requires a variety of computational tasks
* A typical research project requires a variety of computational tasks
** Data generation
** Data generation
Line 25: Line 23:
* Python overcomes these weaknesses
* Python overcomes these weaknesses


=== Comparative Genomics Case Study ===
== Comparative Genomics Case Study ==
* Brief Project Description - Compare DNA sequences of viruses
* Brief Project Description - Compare DNA sequences of viruses
** Download and parse the genome files of a many viruses
** Download and parse the genome files of a many viruses
Line 32: Line 30:
** Visualize the genomic data in a 'genome landscape' plot
** Visualize the genomic data in a 'genome landscape' plot


=== BioPython ===
== BioPython ==
* Overview of BioPython
* Overview of BioPython
** A suite of bioinformatics tools for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
** A suite of bioinformatics tools for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
Line 41: Line 39:
** parsing code can be wrapped in custom classes that make sense for the particular project
** parsing code can be wrapped in custom classes that make sense for the particular project


=== MatPlotLib ===
== MatPlotLib ==
* Overview of MatPlotLib
* Overview of MatPlotLib
** Matlab-like graphical environment
** Matlab-like graphical environment
Line 52: Line 50:
** can easily re-generate complicated plots since by tweaking the code
** can easily re-generate complicated plots since by tweaking the code


=== SWIG ===
== SWIG ==
* Overview of SWIG
* Overview of SWIG
** allows you to speed up selected parts of an application by writing in another language (C,C++)
** allows you to speed up selected parts of an application by writing in another language (C,C++)
Line 63: Line 61:
** can leverage experience in other languages that scientists typically have, within python
** can leverage experience in other languages that scientists typically have, within python


=== Conclusions ===
== Conclusions ==
* Practical Conclusions
* Practical Conclusions
** community modules are useful for a variety of scientific tasks
** community modules are useful for a variety of scientific tasks

Revision as of 14:48, 13 February 2008

This is a paper/presentation for Pycon 2008 that I am writing on OWW. The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a recent comparitive genomics study. I am experimenting with writing the paper on OWW and eventually submitting it to the arXiv. If anyone is interested in the topic, or the process, please email me through OWW.

Abstract

Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.

The Scientists Dilemma

  • A typical research project requires a variety of computational tasks
    • Data generation
    • Data analysis
    • Data visualization
  • The most common solution is to use separate tools for each task
    • Data generation in C
    • Data analysis in proprietory software
    • Data visualization in separate graphing package
  • This is an inadequate solution
    • These tools can't be pipelined easily
      • Many manual steps have to be repeated if something changes
    • Poor at best data provenance
      • Not sure if an error is due to a program or human error
      • Can only repeat analysis by following written steps in a lab notebook
      • Steps are easily forgotten and hard to pass on
  • Python overcomes these weaknesses

Comparative Genomics Case Study

  • Brief Project Description - Compare DNA sequences of viruses
    • Download and parse the genome files of a many viruses
    • Store the genome in a project-specific genome class
    • Draw random genomes to compare to the 'real' genome
    • Visualize the genomic data in a 'genome landscape' plot

BioPython

  • Overview of BioPython
    • A suite of bioinformatics tools for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
  • Use of BioPython in this project
    • Parsing GenBank files from the National Center for Biotechnology Information
    • example code
  • Benefits of using Biopython
    • parsing code can be wrapped in custom classes that make sense for the particular project

MatPlotLib

  • Overview of MatPlotLib
    • Matlab-like graphical environment
  • Use of MatPlotLib in this project
    • generating genome landscapes
    • example code
  • Benefits of using MatPlotLib
    • graphics code resides along-side of data generation code
    • quick trouble shooting
    • can easily re-generate complicated plots since by tweaking the code

SWIG

  • Overview of SWIG
    • allows you to speed up selected parts of an application by writing in another language (C,C++)
  • Use of SWIG in this project
    • speed up of the random genome drawing routine
    • example code
  • Benefits of using SWIG
    • get all the benefits of python with the speed for critical parts
    • sped up parts are used in the exact same context - no need for glue code
    • can leverage experience in other languages that scientists typically have, within python

Conclusions

  • Practical Conclusions
    • community modules are useful for a variety of scientific tasks
    • python can easily be used by more scientists
  • Bigger picture conclusions for good scientific practice
    • code readability and package structure promotes good scientific practice
    • python and its modules provide a consistent framework to promote data provenance
    • can plug into other community tools and practices to help science - e.g. unit testing


References/Resources