Here we store material for the course CB2442, Bioinformatics.
This project is maintained by kth-gt
There is a seemingly endless supply of bioinformatics tools. The field of bioinformatics itself is very broad and, on top of that, there are often many ways of solving a problem, which will be more or less adequate in different scenarios. This list includes all the tools that you need to complete the labs (plus a few extra), but is by no means exhaustive.
Useful commands in the terminal:
Command | Description |
---|---|
cat file | display an entire file |
less file | show a file one screen at a time; press q to leave |
pwd | path-to-directory; shows you where you are |
head -n file | displays the first n lines of a file |
tail -n file | displays the last n lines of a file |
wc file | displays number of lines, words and bytes in a file |
ls | shows (lists) all files in the current directory |
ls -lh | shows all files in the current directory, including size and permissions |
cp file1 file2 | makes a copy of file1 called file2 |
mv file1 file2 | m*oves file1 to the name or location file2 |
rm file | erases (removes) file |
mkdir dir | makes a directory called dir |
cd dir | changes current directory to dir |
rmdir dir | removes directory dir, if it is already empty |
grep pattern file | prints all lines of the file that contain the pattern |
grep -c pattern file | counts how many lines in the file contain the pattern |
Prodigal is a gene-finding tool for bacteria and archaea. A special mode can be used for certain bacteria which have non-standard genetic codes. Prodigal outputs the coordinates of the genes found and their translation into protein. To get nucleotide sequences Prodigal can also be run through the command line in the following way:
$ prodigal -i input_file -d nucleotide_output_file -a aminoacid_output_file
Prodigal is available on Bioconda. To install it, run
$ conda install -c bioconda prodigal
GenScan is a web-based tool for finding genes and exons in nucleotide sequences. It is meant for vertebrates and certain plants. If the sequences to be scanned are too large, it is possible to download GenScan and run it from the command line.
Blast is a tool for comparing sequences to each other. This can be used simply to compare two sequences or to compare a sequence of interest against a very large database. The standard usage of blast is to compare against a database. The Blast suite includes many different tools, the main ones are:
Other Blast tools are used to compare nucleotide sequences against protein sequences. This is done by translating the nucleotides into proteins in all three possible frames for each DNA strand.
Blast can be run online or locally from the command line. In the latter case, you can build your own database of relevant reference sequences.
The blast server includes several databases. The most popular ones are nr, which includes every sequence ever submitted to the NCBI servers, and RefSeq, which includes only well annotated, carefully selected references. In some cases, instead of giving single proteins as hits, Blast will give whole annotated genomes. In this case, one must open the genome in question, go to the position of the match (marked in the Blast output) and read the annotation there.
To compare sequences against each other, one must check the box “Align two or more sequences” in online Blast.
BLAST+ is a set of command line tools that have the same functionality as online Blast but uses custom, locally-built databases. To format a BLAST database, use the command makeblastdb as follows:
$ makeblastdb -dbtype type -in input_fasta_file -out database_name
where type is either prot (for a protein database) or nucl (for a nucleotide database). Three files will be created with the name database_name plus an extension. To run a BLAST Nucleotide search, type:
$ blastn -query query_fasta_file -db database_name -evalue threshold -outfmt 7 -out output_filename.txt
The commands blasp, blastx, tblastn, and tblastx can be used analogously. To see all the options available for a given tool, type the name of the desired tool followed by the flag -help
. Note that the default e-value threshold (10) is very high and will give many false positives. It is usually a good idea to use a much lower value, such as 10e-10.
BLAST+ is available on Bioconda. To install it, run:
$ conda install -c bioconda blast
tRNAscan is a tool for identifying transfer RNA in nucleotide sequences. It can be run online or downloaded to be run locally.
tRNAscan is available on Bioconda. To install it, run
$ conda install -c bioconda trnascan-se
Barrnap is a tool for finding ribosomal RNA in nucleotide sequences. It can take bacterial, archaeal and eukaryotic sequences.
Barrnap is available on Bioconda. To install it, run
$ conda install -c bioconda barrnap
To run barrnap
$ barrnap input.fasta --outseq output.fasta
These three multiple sequence alignment tools are based in EBI. They run different algorithms in the background, but the user interface is always the same. The sequences to be aligned are pasted on a window or uploaded from a file. Protein and nucleotide sequences are acceptable, in a variety of formats. Several output formats can also be chosen. The most used ones are Fasta and ClustalW. In the clustalw option, you can choose to colour amino acids according to their chemical properties, facilitating the visualization of the alignment.
Weblogo is a tool for producing logos of conserved sequences based on short multiple alignments. The fasta or clustalw sequences are pasted or uploaded, and an image is generated of the chosen format and size.
InterPro is a comprehensive bioinformatics resource that integrates data from various protein domain and family databases, and Pfam is one of the databases integrated into InterPro. Pfam is a specific protein domain and family database that uses Hidden Markov Models (HMMs) to represent protein domains and families, enabling the identification of conserved regions in protein sequences.
UniProtKB is a high-quality annotated protein database. The annotation is either done manually (collected in the SwissProt database) or automatically (TrEMBL database).
Deep TMHMM is a tool for predicting transmembrane domains by inputting amino acid sequences in fasta format. The output is a list of partitions of your protein sequence into regions inside/outside the cell and regions inside the membrane, together with a plot showing the probability for each amino acid to be placed in each type of region.
Philius is a tool for predicting transmembrane domains and signal peptides based on an amino acid sequence (fasta format is supported only by submitting it through an e-mail form). The output is a confidence measure of the sequence being transmembrane and a partitioning of your protein sequence into regions inside/outside the cell and regions inside the membrane, together with a confidence measure for each region (press the “show list” link next to “Predicted protein segments” to view these statistics).
A rRNA database project with a comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. Can be used as a tool for assigning phylogenety to ribosomal RNA sequences or subsequences, by checking the “Search and Classify” box.
This tool helps you to obtain the ribosomal RNA sequences of many different species out of the Ribosomal Database project. You can search and select the organisms of interest and download their rRNA sequences, which could for example be used for a phylogenetic analysis. For most organisms multiple rRNA sequences are listed, just pick one of them if you want to make a phylogenetic tree.
Galaxy is an open source, web-based platform for data intensive biomedical research. The interface is divided into three panels; Tools (left), Display (center) and History (right). You use the tools panel to upload data and select tools to run. Every time you upload data or run a tool a new item appears in the History panel. From the History panel you can choose to view your raw data and/or results from the tools you have used which will then be displayed in the Display panel. Some files are in binary format (for example BAM files) and they cannot be viewed. If you choose to view them they will be downloaded to your computer instead.
When you need to execute the same tool on a number of datasets, there is an option available to run them all at once in parallel (as shown in the figure below).
Most/all of the tools available in Galaxy are also available as open source software to be run from the command line. While that may be the ‘standard’ way to run these tools the Galaxy environment is a great platform to get familiar with the programs, data files and the results.
A good place to start to learn about galaxy is Galaxy 101
Following is a short list of the tools in Galaxy, some of which you will be using through Galaxy in the labs: