Authored by Jin Choi for EDAMAME2015, based on a previous tutorial. Modified by Adina Howe. EDAMAME-2015 wiki
EDAMAME tutorials have a CC-BY license. Share, adapt, and attribute please!
Let’s have a BLAST! That’s right, the Basic Local Alignment Search Tool!
You all may have previously used the NCBI BLAST Web page to do individual searches. Today we’ll automate batch searches at the command line on your own computer.
Today, imagine that you have genes that you have sequenced and want to identify the closest known sequences within a database.
Make your current working directory your home directory, for starters:
cd ~/
First, let’s check and see if we have BLAST installed, and this will depend on the compute resources you are using.
which blastn
If you have BLAST installed and in your path (BLAST may be installed but not in your path), it will give you something like this:
/usr/bin/blastn
If you don’t have BLAST, then you will need to install it.
To install the BLAST software, type the following command. This installs the most recent version:
sudo apt-get install -y ncbi-blast+
Now, we can’t run BLAST without creating a database to compare our sequences to. A popular database is to use all the genes in NCBI (e.g., similar to the web interface’s databases). This, like a lot of NCBI databases is huge, so I don’t suggest putting this on your laptop unless you have a lot of room. It’s best on a larger computer (HPCC, Amazon machine). For this tutorial, let’s download a small database for this tutorial.
We’ll also download the sequences that you want to compare to your database. This would be similar to obtaining sequences from your sequencing facility.
First, let’s use curl to retrieve a database of genes. These often come from a literature search or hard work. In this tutorial, its easy though, you’ll just download it...but assume you’ve searched the literature and these are genes of interest that your sequence might match:
curl -O https://s3.amazonaws.com/edamame/Blast_Tutorial.tar.gz
To get access to this data, let’s unzip it. It’s in a compressed form to transfer from the “cloud” to your local computer faster.
tar -zxvf Blast_Tutorial.tar.gz
Next, let’s navigate into the directory and take a look at the files it contains.
cd Blast_Tutorial
Now you’ve got these files. We can take a look at the size and names of this file with the following command...
ls -l
Here’s some description of some these files:
MyQuery.txt: This is actually (mis)labeled as a ”.txt” file but it is a FASTA file. How can you tell? We’re going to leave it as a ”.txt” file, mainly because its very likely people will send you sequences like this in the future and its important to know that the extension of file type is somewhat arbitrary.
rep_set.fna: This is a database of previously observed genes. You would like to know if your sequences match any of these genes which have been sequenced from the bacterial isolates in your lab.
So, now we’ve got the database files, but BLAST requires that each subject database be preformatted for use; this is a way of speeding up certain types of searches. To do this, we have to format the database. You should do:
makeblastdb -in rep_set.fna -dbtype nucl -out rep_set.fna.db
The -in parameter gives the name of the database, the -out parameter says “save the results”, and the -dbtype parameter says “what type of the database”. For DNA, you’d want to use ‘-dbtype nucl’. FYI, for protein, ‘-dbtype prot’.
Let’s see what the BLAST database looks like
ls
You may notice that there are 3 files that were generated.
rep_set.fna.db.nhr
, rep_set.fna.db.nin
, and rep_set.fna.db.nsq
BLAST uses these 3 “indexed” files to make searching genes against genes go faster and more efficiently.
Just a reminder:
Now try a BLAST: You need a file that have your query in. Remember, here is your query looks like.
cat MyQuery.txt
This file contains sequences of 4 bacteria and 5 fungi that you have gotten sequenced. You wonder if it matches anything in the known database.
To do this, we’ll run a BLAST. We use blastn because we will search a nucleotide database using a nucleotide query.
blastn -db rep_set.fna.db -query MyQuery.txt
Once you get the output, you can do a few different things...
blastn -db rep_set.fna.db -query MyQuery.txt -out out.txt
and then use ‘less’ to look at it:
less out.txt
blastn -db rep_set.fna.db -query MyQuery.txt | less
Sometimes tabular form (output format) is useful. To get the result in tabular form,
blastn -db rep_set.fna.db -query MyQuery.txt -out outtabular.txt -outfmt 6
Let’s try a different blast so you get your own practice, with a slight twist. Imagine you’ve compiled a database of genes from all isolates that originate from soil, and you would like to compare it to genes in NCBI RefSeq (a popular genome database). We’re providing you two files – the RefSoil16s.fa file – the sequences of 16S rRNA genes from soil isolates and the database – rep_set_sub.fna that you have downloaded from NCBI.
Imagine you’ve compiled a database of genes from all isolates that originate from soil, and you would like to compare it to genes in NCBI RefSeq (a popular genome database). We’re providing you two files – the RefSoil16s.fa file – the sequences of 16S rRNA genes from soil isolates and the database – rep_set_sub.fna that you have downloaded from NCBI. How would you find the sequences within the database which are the closest match to the genes from your isolates?
BLAST has lots and lots and lots of options. Run ‘blastn’ by itself to see what they are. Some of the most useful ones are -evalue
.
If you want to search protein, use ‘blastp’ instead of ‘blastn’. ‘blastx’, ‘tblastn’, ‘tblastx’ also available.
Here are the options/flags available to this program:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1
10 = Comma-separated values
11 = BLAST archive format (ASN.1)
##Help and other resources