Blast link

From OpenWetWare
Revision as of 17:16, 3 January 2012 by Ben Woodcroft (talk | contribs) (→‎sh: blastdbcmd: not found: new error - thanks to Kenlee Nakasugi for pointing out this fix)
Jump to navigationJump to search

Blast_link is an add-on for wwwblast that allows arbitrary links to be added to the results page for each hit. A common (default) use case is to implement the link such that the entire hit sequences can be extracted, where the default install of www-blast does not permit this. This use-case will herein be referred to as the link-to-sequences mode.

Pre-requisites

Blast_link requires Perl. The default link-to-sequences mode of blast_link also requires the BLAST+ executables makeblastdb and blastdbcmd.

Installation

Pre-requisites

Blast_link is an add-on for wwwblast, and requires that to be installed first. Note that the preparation of the binary BLAST databases is slightly more complex when working with blast_link, so follow the instructions on this wiki page and not the wwwblast wiki page for that part.

Download

The newest version of blast_link can be obtained from GitHub (less direct link). Blast_link is entirely open source program and modifications and improvements are appreciated. It is not affiliated with NCBI.

Put the extracted files in the base blast directory (the directory that blast.html is in).

After the extracted files have been put in place, you should be able to navigate to the new blast query page, blast_link.cgi. For example, the URL might be http://localhost/blast/blast_link.cgi. The old blast.html site will still work, but doing so will bypass the blast_link code.

Preparation of binary BLAST+ databases

A binary BLAST database is a collection of multiple files (.nhr, .nin and .nsq files for nucleotide databases). They must be created from a fasta file in a terminal, using the BLAST+ toolkit, available from NCBI (PubMed citation). The legacy BLAST toolkit can be used to achieve the same goal, though the command line syntax differs.

For the link-to-sequences mode of blast_link, creation of the binary databases is slightly more onerous than usual, because the sequence identifiers need to be indexed by blast+. In order to be indexed, two conditions must be me:

  1. Each sequence identifier in the fasta file must conform to the NCBI naming standards.
  2. When issuing the makeblastdb command using the fasta file, the -parse_seqids flag must be used.

The simplest way to make the fasta file conform to the standard is to prepend 'gnl|blast|' before each sequence identifier. You might do this in a text editor by doing a find and replace of '>' for '>gnl|blast|' but you have to watch out there are no other '>' characters not at the start of the identifier line. The same task can also be achieved by using sed, a command line tool that comes by default with OSX and Linux:

$ sed -e 's/^>/>gnl|blast|/' -e 's/,/_/g' mysequences.fasta >mysequences.ncbi_standard_ids.fasta

Note that all commas were replaced with underscores as well. Then all that remains is to create the binary BLAST database using the -parse_seqids flag. For instance, if mysequences.ncbi_standard_ids.fasta is a fasta file of nucleotide sequences,

$ makeblastdb -in mysequences.ncbi_standard_ids.fasta -dbtype nucl -parse_seqids
Building a new DB, current time: 09/23/2010 14:12:18
New DB name:   mysequences.ncbi_standard_ids.fasta
New DB title:  mysequences.ncbi_standard_ids.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 1620 sequences in 0.207906 seconds.

Then the name of the database to specify in blast.rc and blast.html will be 'mysequences.ncbi_standard_ids.fasta'.

To test if the parsing worked, use the blastdbcmd tool, like so:

$ blastdbcmd -entry 'gnl|blast|Contig2' -db mysequences.ncbi_standard_ids.fasta
>Contig2
ATGCAAAACCCCCCCC

Using blast_link

Once you have setup blast.rc and blast.html a la wwwblast, blast_link is setup and ready to go. From the entry form e.g. http://localhost/blast/blast_link.cgi, you should be able to click on hits to get their full sequences.

Troubleshooting

makeblastdb not working

Trying to run makeblastdb on OSX 10.4 results in this error:

$ makeblastdb -in /Users/someone/Sites/blast/db/Queries_fna.txt -dbtype nucl -parse_seqids
dyld: lazy symbol binding failed: Symbol not found: __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_i
 Referenced from: /Users/someone/blast-2.2.24+/bin/makeblastdb
 Expected in: flat namespace

dyld: Symbol not found: __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_i
 Referenced from: /Users/someone/blast-2.2.24+/bin/makeblastdb
 Expected in: flat namespace

Trace/BPT trap

The problem is that BLAST+ isn't supported on OSX 10.4. One solution to the problem is to use the legacy BLAST software (i.e. not BLAST+). The procedure for using the legacy BLAST with Blast_link is much the same as for BLAST+, except that formatdb is used, instead of makeblastdb, and the get_sequences.cgi file from Blast_link needs to be configured. An example of using formatdb:

$ formatdb -i my_nucleotide_sequences.fasta -o -p F

The -i flag specifies the sequences to format into the database, -o specifies that the sequence IDs should be parsed (required), and '-p F' specifies that the sequences are nucleotide (by default protein sequences are assumed).

To configure get_sequences.cgi, in a text editor modify the following lines:

#my $BLAST_DB_EXTRACT_METHOD = 'fastacmd';
my $BLAST_DB_EXTRACT_METHOD = 'blastdbcmd';

becomes

my $BLAST_DB_EXTRACT_METHOD = 'fastacmd';
#my $BLAST_DB_EXTRACT_METHOD = 'blastdbcmd';


Internal Server Error

If a webpage is encountered that says "Internal Server Error", something has gone wrong while the webserver was running wwwblast. To find out more about this error, look at the end of the apache error log. The error log is a file, which might be /var/log/httpd/error_log or /var/log/apache2/error.log.

sh: blastdbcmd: not found

If this error is encountered when users click to get a fasta file of the sequences, one way to fix this is to modify get_sequences.cgi so that instead of

$cmd = "blastdbcmd ...

You have

$cmd = "/path/to/blastdbcmd ...

Acknowledgements

Blast_link was created by Ben J. Woodcroft working in Bernie Degnan's group at the University of Queensland, the Molecular Geo- and Palaeobiology Lab of the Ludwig-Maximilians-Universität (LMU) and Stuart Ralph's group at the University of Melbourne. Creation of this wiki page and implementation of the legacy blast compatibility was funded by Dan Jackson's group at the University of Göttingen.