Logo

CB2442

Here we store material for the course CB2442, Bioinformatics.

This project is maintained by kth-gt

Bioinformatics Tools Booklet

There is a seemingly endless supply of bioinformatics tools. The field of bioinformatics itself is very broad and, on top of that, there are often many ways of solving a problem, which will be more or less adequate in different scenarios. This list includes all the tools that you need to complete the labs (plus a few extra), but is by no means exhaustive.

Useful commands in the terminal:

Command Description
cat file display an entire file
less file show a file one screen at a time; press q to leave
pwd path-to-directory; shows you where you are
head -n file displays the first n lines of a file
tail -n file displays the last n lines of a file
wc file displays number of lines, words and bytes in a file
ls shows (lists) all files in the current directory
ls -lh shows all files in the current directory, including size and permissions
cp file1 file2 makes a copy of file1 called file2
mv file1 file2 m*oves file1 to the name or location file2
rm file  erases (removes) file
mkdir dir makes a directory called dir
cd dir changes current directory to dir
rmdir dir removes directory dir, if it is already empty
grep pattern file prints all lines of the file that contain the pattern
grep -c pattern file counts how many lines in the file contain the pattern

Prodigal

Prodigal is a gene-finding tool for bacteria and archaea. A special mode can be used for certain bacteria which have non-standard genetic codes. Prodigal outputs the coordinates of the genes found and their translation into protein. To get nucleotide sequences Prodigal can also be run through the command line in the following way:

$ prodigal -i input_file -d nucleotide_output_file -a aminoacid_output_file

Prodigal is available on Bioconda. To install it, run

$ conda install -c bioconda prodigal 

GenScan: http://hollywood.mit.edu/GENSCAN.html

GenScan is a web-based tool for finding genes and exons in nucleotide sequences. It is meant for vertebrates and certain plants. If the sequences to be scanned are too large, it is possible to download GenScan and run it from the command line.

BLAST: Basic Local Alignment Search Tool

Blast is a tool for comparing sequences to each other. This can be used simply to compare two sequences or to compare a sequence of interest against a very large database. The standard usage of blast is to compare against a database. The Blast suite includes many different tools, the main ones are:

Blast can be run online or locally from the command line. In the latter case, you can build your own database of relevant reference sequences.

Online BLAST: https://blast.ncbi.nlm.nih.gov

The blast server includes several databases. The most popular ones are nr, which includes every sequence ever submitted to the NCBI servers, and RefSeq, which includes only well annotated, carefully selected references. In some cases, instead of giving single proteins as hits, Blast will give whole annotated genomes. In this case, one must open the genome in question, go to the position of the match (marked in the Blast output) and read the annotation there.

To compare sequences against each other, one must check the box “Align two or more sequences” in online Blast.

BLAST+

BLAST+ is a set of command line tools that have the same functionality as online Blast but uses custom, locally-built databases. To format a BLAST database, use the command makeblastdb as follows:

$ makeblastdb -dbtype type -in input_fasta_file -out database_name

where type is either prot (for a protein database) or nucl (for a nucleotide database). Three files will be created with the name database_name plus an extension. To run a BLAST Nucleotide search, type:

$ blastn -query query_fasta_file -db database_name -evalue threshold -outfmt 7 -out output_filename.txt

The commands blasp, blastx, tblastn, and tblastx can be used analogously. To see all the options available for a given tool, type the name of the desired tool followed by the flag -help. Note that the default e-value threshold (10) is very high and will give many false positives. It is usually a good idea to use a much lower value, such as 10e-10.

BLAST+ is available on Bioconda. To install it, run:

$ conda install -c bioconda blast 

tRNAscan: http://lowelab.ucsc.edu/tRNAscan-SE/

tRNAscan is a tool for identifying transfer RNA in nucleotide sequences. It can be run online or downloaded to be run locally.

tRNAscan is available on Bioconda. To install it, run

$ conda install -c bioconda trnascan-se 

Barrnap

Barrnap is a tool for finding ribosomal RNA in nucleotide sequences. It can take bacterial, archaeal and eukaryotic sequences.

Barrnap is available on Bioconda. To install it, run

$ conda install -c bioconda barrnap 

To run barrnap

$ barrnap input.fasta --outseq output.fasta

Multiple sequence aligners

These three multiple sequence alignment tools are based in EBI. They run different algorithms in the background, but the user interface is always the same. The sequences to be aligned are pasted on a window or uploaded from a file. Protein and nucleotide sequences are acceptable, in a variety of formats. Several output formats can also be chosen. The most used ones are Fasta and ClustalW. In the clustalw option, you can choose to colour amino acids according to their chemical properties, facilitating the visualization of the alignment.

Weblogo: http://weblogo.berkeley.edu/logo.cgi

Weblogo is a tool for producing logos of conserved sequences based on short multiple alignments. The fasta or clustalw sequences are pasted or uploaded, and an image is generated of the chosen format and size.

InterPro/Pfam: http://pfam.xfam.org/

InterPro is a comprehensive bioinformatics resource that integrates data from various protein domain and family databases, and Pfam is one of the databases integrated into InterPro. Pfam is a specific protein domain and family database that uses Hidden Markov Models (HMMs) to represent protein domains and families, enabling the identification of conserved regions in protein sequences.

UniProtKB: http://www.uniprot.org/

UniProtKB is a high-quality annotated protein database. The annotation is either done manually (collected in the SwissProt database) or automatically (TrEMBL database).

Deep TMHMM: http://www.cbs.dtu.dk/services/TMHMM-2.0/

Deep TMHMM is a tool for predicting transmembrane domains by inputting amino acid sequences in fasta format. The output is a list of partitions of your protein sequence into regions inside/outside the cell and regions inside the membrane, together with a plot showing the probability for each amino acid to be placed in each type of region.

Philius: http://www.yeastrc.org/philius

Philius is a tool for predicting transmembrane domains and signal peptides based on an amino acid sequence (fasta format is supported only by submitting it through an e-mail form). The output is a confidence measure of the sequence being transmembrane and a partitioning of your protein sequence into regions inside/outside the cell and regions inside the membrane, together with a confidence measure for each region (press the “show list” link next to “Predicted protein segments” to view these statistics).

Silva Classifier: https://www.arb-silva.de/aligner/

A rRNA database project with a comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. Can be used as a tool for assigning phylogenety to ribosomal RNA sequences or subsequences, by checking the “Search and Classify” box.

RDP Hierarchy browser: http://rdp.cme.msu.edu/hierarchy/hb_intro.jsp

This tool helps you to obtain the ribosomal RNA sequences of many different species out of the Ribosomal Database project. You can search and select the organisms of interest and download their rRNA sequences, which could for example be used for a phylogenetic analysis. For most organisms multiple rRNA sequences are listed, just pick one of them if you want to make a phylogenetic tree.

Galaxy: https://usegalaxy.org/

Galaxy is an open source, web-based platform for data intensive biomedical research. The interface is divided into three panels; Tools (left), Display (center) and History (right). You use the tools panel to upload data and select tools to run. Every time you upload data or run a tool a new item appears in the History panel. From the History panel you can choose to view your raw data and/or results from the tools you have used which will then be displayed in the Display panel. Some files are in binary format (for example BAM files) and they cannot be viewed. If you choose to view them they will be downloaded to your computer instead.

When you need to execute the same tool on a number of datasets, there is an option available to run them all at once in parallel (as shown in the figure below).

Most/all of the tools available in Galaxy are also available as open source software to be run from the command line. While that may be the ‘standard’ way to run these tools the Galaxy environment is a great platform to get familiar with the programs, data files and the results.

A good place to start to learn about galaxy is Galaxy 101

Following is a short list of the tools in Galaxy, some of which you will be using through Galaxy in the labs: