Open writing projects/Python all a scientist needs

This is a paper/presentation for Pycon 2008 that I am writing on OWW. The paper is about how I used python and its libraries and extensions as a complete scientific programming package for a recent comparitive genomics study. I am experimenting with writing the paper on OWW and eventually submitting it to the arXiv. If anyone is interested in the topic, or the process, please email me through OWW.

Outline

Abstract

Any cutting-edge scientific research project requires a myriad of computational tools for data generation, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation (http://arxiv.org/abs/0708.2038). In this talk, I discuss the challenges of this project, and how the combined power of Biopython (http://biopython.org), SWIG (http://www.swig.org) and Matplotlib (http://matplotlib.sourceforge.net) were a perfect solution. I finish by discussing how python promotes good scientific practice, and how its use should be encouraged within the scientific community.

The Scientists Dilemma

A typical scientific research project requires a variety of computational tasks to be performed. At the very heart of every investigation is the generation of data to test hypotheses. An experimental physicist might build instruments to collect light scattering data; a crystallographer will collect X-ray diffraction data; a biologist might collect flouresence intensity data for a variety of reporter genes or dna sequence data; and a computational researcher might write programs to collect simulation data. All of these types of data collection require computer programming to control instruments (or simulation code) to collect and manage data in an electronic format.

Once data is collected, the next task is to analyze it in the context of hypothesis-driven models that help scientists understand the phenomenon they are studying. In the case of light, or X-ray scattering data, there is a well-proven physical theory of light scattering that is used to process the data and calculate the observed structure function of the material being studied (reference). This structure function is then compared to predictions made by the hypotheses begin tested. In the case of biological reporter gene data, the light intensity data is matched up with genetic or DNA sequencing data, and statistically analyzed for trends. Large scale biomedical genetic studies might search for correlations between DNA sequences and cancer rates among patients.

What is clear is that the original raw data in each case is typically extensively processed by more computational programs. Visualization tools to create a variety of scientific plots are often a preferred tool for both troubleshooting ongoing experiments (laboratory or computational). Scientific plots and charts are often the end product of a scientific investigation in the form data-rich graphics that demonstrate the truth of a hypothesis compared to its alternatives (cite Tufte).

Unfortunately, too often scientists resort to a grab-bag of tools to perform these varied computational tasks. For physicists, C or FORTRAN is often used to generate simulation data, and C code is used to control experimental apparatus. Data analysis is performed in external software packages such as Matlab or Mathematica for equation solving (reference), or Stata, SPSS or R for statistical data (reference). Furthermore, separate data visualization packages can be used making the toolset extremely varied.

Such a hodge-podge of tools is an inadequate solution for a variety of reasons. From a computational perspective, most of these tools cannot be pipelined easily which necessitates many manual steps or excessive glue code that most scientists are not trained to write. Far more important than just an inconvenience associated with gluing these tools together is the extreme burden placed on the scientist in terms of data management. This in turn provides poor at best data provenance when it is in fact of utmost importance in scientific studies where data integrity is the foundation of every conclusion reached and every fact established. In such a complicated system there are often a plethora of different data files in several different formats residing at many different locations. Most tools do not produce adequate metadata for these files, and scientists typically fall back on cryptic file naming schemes to indicate what type of data the files contain and how it was generated. Such complications can easily lead to mistakes.

Furthermore, when data files are manually moved around from tool to tool, it is not clear if an error is due to program error, or human error in using the wrong file. Analyses can only be repeated by following work flows that have to be manually recorded in a paper or electronic lab notebook. This makes steps easily forgotten, and hard to pass on to future generations of scientists.

The Python programming language and associated community tools (reference) can help scientists overcome some of these problems by providing a general scientific programming platform that allows scientists to generate, analyze, visualize and manage their data within the same computational framework. Python can be used to generate simulation data, or control instrumentation to capture data. Data analysis can be accomplished in the same way, and there are graphics libraries that can produce scientific charts and graphs. Furthermore python code can be used to glue all of these python solutions together so that visualization code resides alongside the code that generates the data it is applied to. This allows streamlined generation of data which makes data management feasible. Most importantly such a uniform tool set allows the scientist to record the steps used in data work flows to be written down in python code allowing automatic provenance tracking.

In this paper, we outline a recent comparative genomics case study where python and associated libraries were used as a complete scientific programming platform. We introduce several specific python libraries, and how they were used to facilitate input of standardized biological data, generate comparative genomic data based on a desgned statistical test, and provide solutions to speed bottle-necks in the code. Throughout, we point to resources for further reading on these topics. We conclude with ideas about how python promotes good scientific programing practice, and ...

Comparative Genomics Case Study

Brief Project Description - Compare DNA sequences of viruses
- Download and parse the genome files of a many viruses
- Store the genome in a project-specific genome class
- Draw random genomes to compare to the 'real' genome
- Visualize the genomic data in a 'genome landscape' plot

BioPython

Overview of BioPython
- A suite of bioinformatics tools (reference old biopython paper) for tasks such as parsing bio-database files, computing alignments between biological sequences, interacting with bio web-services
Use of BioPython in this project
- Parsing GenBank files from the National Center for Biotechnology Information
- example code
Benefits of using Biopython
- parsing code can be wrapped in custom classes that make sense for the particular project
show snippet of lambda genbank file
snippets of parsing code
- example class structure of what I needed (removing overlaps, getting CDS sequences, drawing random genomes to preserve AAs and preserve global codon usage)

MatPlotLib

Overview of MatPlotLib
- Matlab-like graphical environment
Use of MatPlotLib in this project
- generating genome landscapes
- example code
Benefits of using MatPlotLib
- graphics code resides along-side of data generation code
- quick trouble shooting
- can easily re-generate complicated plots since by tweaking the code

SWIG

Overview of SWIG
- allows you to speed up selected parts of an application by writing in another language (C,C++)
Use of SWIG in this project
- speed up of the random genome drawing routine
- example code
Benefits of using SWIG
- get all the benefits of python with the speed for critical parts
- sped up parts are used in the exact same context - no need for glue code
- can leverage experience in other languages that scientists typically have, within python

Conclusions

Practical Conclusions
- community modules are useful for a variety of scientific tasks
- python can easily be used by more scientists
Bigger picture conclusions for good scientific practice
- code readability and package structure promotes good scientific practice
- python and its modules provide a consistent framework to promote data provenance
- can plug into other community tools and practices to help science - e.g. unit testing

Open writing projects/Python all a scientist needs

Contents

Abstract

The Scientists Dilemma

Comparative Genomics Case Study

BioPython

MatPlotLib

SWIG

Conclusions

References/Resources

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools