Julius B. Lucks/Projects/Python Articles/Scientific Pipelines

From OpenWetWare
Revision as of 17:02, 17 February 2007 by Julius B. Lucks (talk | contribs) (New page: I have been on a quest to improve the scientific pipeline. = The Beginning = I first have to start off by explaining a bit about my crusade to find the perfect tools to do science. I f...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

I have been on a quest to improve the scientific pipeline.

The Beginning

I first have to start off by explaining a bit about my crusade to find the perfect tools to do science. I first started programming as an undegraduate chemistry student at UNC Chapel Hill with FORTRAN (link). Back in the day, FORTRAN was evidently a god send to scientists because it gave them a language so close to the machine that they could write programs that made the absolute most of precious memory and clock cycles. While this is a good thing, it is not a fun thing. I was delighted at first to even be able to program a computer, but my quest to improve my tools when I encountered comment less 300,000 line monolithic FORTRAN program, CADPAC (link) at Cambridge University as a master's student.

Now you could argue that there was more to the huge pain that CADPAC was than FORTRAN (such as the almost complete lack of comments), but there is also a lot of it that was FORTRAN. In particular, FORTRAN code does not look much different than a string of 0's and 1's if you look at it for long enough, which as I mentioned earlier was a good thing for people that wanted to waste not a resource. However, this leads to some really awkward looking code that is hard to read, and even remember what you were doing if your were an author. In other words, FORTRAN code is not very self-documenting, something I now consider very valuable.

So at that time I was given a brand new project, and in my mind a brand new environment to learn a new tool. A lot of my friends new C++, so I thought I would give it a try. Now C++, being a completely object oriented language, was a totally new world to me. It was both delightful in what doors it opened in terms of thinking about a problem, and a royal pain in how to make it do some things. C++ still requires a very keen eye to detail that a novice can quickly get lost in. I was thrilled to be able to create variables when I needed them (unlike FORTRAN where all memory usage was declared at the beginning of the program so compilers can optimize the code easier), but I was confused by having to tidy up after myself, especially after I had already created way more object complexity than I had to just because I could. I want to stress that I was a novice, and that the experts don't have these problems. But I was also trying to do a lot of science with my code and had other things to worry about. I needed a solution that was a little more friendly to the novice.

Along came a trip to a damaged book sale where I picked up Randall Schwartz and Tom Christiansen's Learning Perl. I know that Perl and C++ don't even try to do the same thing, so they are really not comparable from the language point of view. But I am telling the story of a novice scientific programmer who has many different tasks at hand. I had heard of Perl, so I bought the book for 2 pounds and read it over the weekend. It was the first time I had realized that there is room for more than one tool in my scientific toolkit.

Well I shouldn't say the first time. I had already been familiar with unix systems and the variety of tools they offer. I had even spent a considerable amount of time becoming a relative awk (link) master (at least compared to my other skills). Perl opened my eyes to a consistent tool that I could write all my other, non-number-crunching tasks in.

So my first project pipeline started to develop. I would write code to do serious number crunching in C++. This code took a parameter file and made several output files. I used perl to write the parameter files and move around the output files, and to script multiple runs of the number code. This system worked pretty well, but the C++ code took me a long time to develop, and there was the disadvantage of having two languages. This only got more complicated as I started to need to make plots and graphics of the results.

I won't bore you with the details about my exploration into how to make good graphics, but I will tell you that I finally added a third tool to my toolkit, matlab (link). The idea was two-fold: that I could prototype number crunching code in matlab, and eventually move it to C++ if I needed the speed; and I could make graphics in matlab. So eventually I had another step to my pipeline of having perl scripts mash up data generated by the C++ code, write matlab input files to generate graphics, and call batch runs of matlab to generate the graphics. At the time I was also starting to get into the idea of putting my data on the web for my collaborators to see, so I was using perl to generate a lot of HTML with these graphics as well.

Scientific Pipeline's

It took me about 7 years to come up with this pipeline in the midst of my many scientific projects I was working on at various times. There are two important points here. The minor point is that I had many different tools to accomplish the various tasks of the pipeline. This was not a bad learning experience, but certainly hard to maintain. Sometimes I would go months without having to scale a code up with C++, so I would naturally by rusty when I came back to it.

The major point is that the scientific pipeline process is more general than my particular experience. I did not receive any formal training in this stuff, which I think speaks volumes about the level of training of the majority of scientific programmers. Most classes I have seen listed involve the algorithms, but not the day to day of how you bring those algorithms to bear on your projects.

So I want to extract the abstract ideas of what a good scientific pipeline is. It is first a development and prototyping phase. This usually happens when you are playing around with various hypotheses in your research. You need to be able to test many ideas in a rapid succession, as well as keep track of what you have tried and if it worked or not (the lab notebook concept.) You don't want to write code that will save you computer time at this stage, you want to write code that will save your time. After all, you aren't going for the gold run here, you are merely sniffing around a bit to find out where the big leads and clues are.

Next comes the first round of production. You have isolated a hypothesis or two and you need to do some serious analysis. Here you might want to consider re-writing some portions of your code so that they will save computer time.

After that, you still want the flexibility to go back and readjust your code rapidly because inevitably you made some mistakes in your thinking. You are thus back to the development and prototyping phase, armed with new knowledge you learned in your first production phase. Typically you can go back and forth between these phases many times in the life of a single project.

At every point you need to be able to communicate your results with others, and often scientific graphics are a primary means of communication. They are useful as intermediate results, and are often crucial for that final stamp of approval of a project, the publication.

Python

This discussion finally turns to why I started writing this article in the first place. Recently I have been proselytizing python as my language of choice for almost any new programming project I undertake. Every time I talk to someone, I think of new reasons to like python, and I find better ways to explain my existing reasons. I think python and the python community effectively addresses each step of the scientific pipeline, and I think it could become a new defacto scientific computing language. When you couple this with all of the industry related uses it has, it becomes an extremely powerful platform to blend the latest technology with scientific methods.

Briefly, here is why I love it:

  • It is Object Oriented from the ground up (unlike Perl which tacked it on later), so has a better structure than perl.
  • It is a little more verbose than other languages, which makes the code more self-documenting naturally than perl. This makes it easy to write code fast, that you will still understand many months later.
  • The code is very clean looking which is extremely important for maintaining code.
  • It comes with a very good unit-testing module that makes it trivial to change the internals of the code, while still making sure it does the job you want it to. (Unit-testing is not prevalent in scientific programming yet, something which I hope changes.)
  • It has an interactive interpreter (unlike perl), with great object introspection so you can really develop fast in it.
  • The scientific support is very extensive:
    • It has a very good numerical module, numpy (link), so you can prototype serious computations in it.
    • It has a BioPython (link) project similar to BioPerl's (link) (though not as mature), supporting many common biological tasks.
    • You can write C++/C extensions in it very easily so you can speed up parts of the code to be truly production speed.
    • It has R (link) bindings so you can do statistics in it.
    • It has a very good plotting library, matplotlib (link), that is designed to be similar to matlab so you can do all of your graphics in it.

I plan to write several articles highlighting these features of python, and how these ideas from software engineering can be applied to scientific pipelines.