BLAST on Lewis


The fastest way to run BLAST on Lewis is to use mpi.  One limitation, however, is that nucleotide sequences must be less than ~120 KB in length.  If that is a limitation consider using the tcp/ip connection instead.  It, however, is limited to 4 cpus which must all be on one node.  You may want to see Lewis FAQs or Running MPI Jobs on Lewis for more information about using Lewis.  If you were looking for WU-BLAST go here.

A.  Running BLAST with mpi.

  1. Databases properly formatted for mpiBlast on Lewis are listed under the directory /dfs/databases/mpi-db.  To query one of them create a file like this one in your home directory:
    [<userid>@lewis ~]$ cat file1
    mpirun.lsf -np 16 /evbio/src/mpiBlast.last/bin/mpiblast -d gball -p blastn -v 10 -b 0 -i $HOME/query.fa -o $HOME/query.gball.out
    Note that you must put your userid in the path to the query file.  The "-d" flag is to the database, where here gball indicates all of GenBank.  Other databases are listed here.  Finally, make the file executable with the following command:  chmod 755 file1

  2. Create a second file that calls the first:
    [<userid>@lewis ~]$ cat file2
    #BSUB -J gb
    #BSUB -n 16
    #BSUB -a mvapich2
    #BSUB -oo %Jgb.o
    #BSUB -eo %Jgb.e
    ./file1
    The "-J" option gives the job a name.  The "-a mvapich" option is required for LSF integration with InfiniBand.  The "-n" option indicates the number of cpus.  We have found that mpi blast runs best with 16 cpus.  More will not help.  The "-oo" and "-eo" options create output and error files, respectively, in your home directory.  More can be learned about the bsub options by using the command "man bsub".

  3. Create a third file to tell mpiBlast where to find things.  You must call it ".ncbirc":
    [<userid>@lewis ~]$ cat .ncbirc
    [NCBI]
    Data=/evbio/src/mpiBlast/build/ncbi/data
    [mpiBLAST]
    Shared=/dfs/databases/mpi-db
    Local=/home/userid/data/temp
    Note that in the last line you must put your userid, and, that to list a file whose name begins with the character "." requires the "-a" option, as in "ls -la".

  4. To submit the blast program indicated in file1 you need to type and enter at the command line:
    [<userid>@lewis ~]$ bsub < file2
    You will immediately see a response like this:
    Job <nnnn> is submitted to default queue .
    Where nnnn is the job number.  If you want to see what your job status is type and enter:
    [<userid>@lewis ~]$ bjobs -w
    Your results will be returned to your home directory with file names that are indicated in file2, so e.g., the output for the above submission was in two files called nnnngb.e and nnnngb.o, where ".e" refers to error and ".o" refers to output.
B.  Running BLAST with tcp/ip.
  1. Test to see if the NCBI tools are in your path by typing "blastall" at the command line.  You should see the list of options for blast printed on your screen.  If so then go to the last step.

  2. Put the ncbi tools in your path.  Assuming you are using the bash shell open up the .bashrc file in your home directory and put the following:
    export NCBI=/evbio/NCBI/ncbitools/ncbi/build/

  3. Include the new variable in the PATH statement of the same file so it reads something like this:
    export PATH=$PATH:$NCBI
    (There may be more variables included in the PATH statement. This is just an example.)

  4. Now you must exit your session on lewis and log back in.

  5. Type at the command line:
    echo $NCBI
    You should see the path indicated as /evbio/NCBI/ncbitools/ncbi/build/

  6. Repeat the first step.  You should see all the options for running BLAST.

  7. To run blastn create two files. First, file1 is an executable file that might contain
    blastall -p blast -d database -i query -a 4 -v 100 -b 1
    You need to state the kind of blast (blastn, blastx, blastp, tblastx, tblastn), the full path and name of the database, and the name of the query file.  If you want, e.g., gball, use /dfs/databases/genbank2/gball.  Or for NR protein use /dfs/databases/NR/nr.  The list of databases is here.

    Create file2:
    # Set job parameters
    #BSUB -a mpichp4
    #BSUB -J jobname
    #BSUB -oo jobname.o%J
    #BSUB -eo jobname.e%J

    # Set number of CPUs
    #BSUB -n 4
    #BSUB -R "span[hosts=1]"

    # Start MPI job
    mpirun.lsf ./file1
    Note that the maximum possible number of cpus is 4.  The "span[hosts=1]" argument ensures all 4 cpus are on the same node.  It is required to include this line otherwise LSF may over-allocate threads to the node your job is running on, causing great inefficiency for you and others.

  8. To submit the blast program indicated in file1 you need to type and enter at the command line:
    [<userid>@lewis ~]$ bsub < file2
    See step 4 of Running BLAST with mpi above for more details of the output files.