Bioinformatics Packages

Package List

ANGSD

ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. Most methods take genotype uncertainty into account instead of basing the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and has been used on large sample sizes.

Usage, version 0.921

module load angsd/0.921

Usage, version 0.931 with 2019-11-05 bugfix

module load angsd/2019-11-05

Example: /share/Apps/examples/angsd

#!/bin/bash

#SBATCH -p lts
#SBATCH -t 60
#SBATCH -n 1
#SBATCH -N 1

echo "This examples downloads sample data if not present"

if [[ ! -d bams ]]; then
  if [[ ! -f bams.tar.gz ]]; then
    wget http://popgen.dk/software/download/angsd/bams.tar.gz
    tar -xvzf bams.tar.gz
  fi
  module load samtools/1.10
  for i in bams/*.bam
  do
    samtools index $i
  done
  ls bams/*.bam > bam.filelist
  module unload samtools/1.10
fi

module load angsd/2019-11-05

angsd -b bam.filelist -GL 1 -doMajorMinor 1 -doMaf 2 -P 5

angsd -b bam.filelist -GL 1 -doMajorMinor 1 -doMaf 2 -P 5 -minMapQ 30 -minQ 20 -minMaf 0.05

For more information visit http://www.popgen.dk/angsd/index.php/ANGSD

BamTools

BamTools is a project that provides both a C++ API and a command-line toolkit for reading, writing, and manipulating BAM (genome alignment) files.

Usage, version 2.4.1

module load bamtools/2.4.1

Usage, version 2.5.1

module load bamtools/2.5.1

For more information visit https://github.com/pezmaster31/bamtools/wiki

Bartender

Bartender is a c++ tool that is designed to process random barcode data. Bartender is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY.

It currently has three functionalities.

It extracts barcodes from FASTA or FASTQ files.
It clusters barcode reads and counts the frequency of each cluster.
It generates count trajectories for time-course data.

Usage

module load bartender/1.1

Example

/share/Apps/examples/bartender

For more information visit https://github.com/LaoZZZZZ/bartender-1.1

BayeScan

BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations. BayeScan is based on the multinomial-Dirichlet model.

Usage

module load bayescan/2.1

Example Script, available at /share/Apps/examples/bayescan

#!/bin/bash

#SBATCH -p lts
#SBATCH -t 60
#SBATCH -n 20
#SBATCH -N 1

#BayeScan can only be run on a single node

# Load BayeScan module

module load bayesan/2.1

# Set number of OpenMP Threads
export OMP_NUM_THREADS=${SLURM_NTASKS}

cd ${SLURM_SUBMIT_DIR}

bayescan_2.1 test_band_intensity_AFLP.txt
# Want to use different number of threads, say 10
bayescan_2.1 test_binary_AFLP.txt -threads 10
# Use SNP genotypes matrix (See manual for more details)
bayescan_2.1 test_genotype_SNP.txt -snp
# Optional input file (discarded.txt) containing list of loci to discard
bayescan_2.1 test_msats.txt -d discarded

bayescan_2.1 test_SNPs.txt -d discarded.txt -snp -threads 10

For more information visit http://cmpg.unibe.ch/software/BayeScan/

BayesTraits

BayesTraits is a computer package for performing analyses of trait evolution among groups of species for which a phylogeny or sample of phylogenies is available. This new package incoporates our earlier and separate programes Multistate, Discrete and Continuous. BayesTraits can be applied to the analysis of traits that adopt a finite number of discrete states, or to the analysis of continuously varying traits. Hypotheses can be tested about models of evolution, about ancestral states and about correlations among pairs of traits.

module load bayestraits

For more information visit http://www.evolution.rdg.ac.uk/BayesTraitsV3.0.2/BayesTraitsV3.0.2.html

bgc

bgc implements Bayesian estimation of genomic clines to quantify introgression at many loci.

Usage

module load bgc/1.03

Example script, available at /share/Apps/examples/bgc

#!/bin/bash

#SBATCH -p eng
#SBATCH -n 1
#SBATCH -t 24:00:00
#SBATCH --qos=nogpu
#SBATCH -J bgctest

cd ${SLURM_SUBMIT_DIR}

module load bgc/1.03

bgc -a p0in.txt -b p1in.txt -h admixedin.txt -M map.txt -O 0 -x 50000 -n 25000 -p 1 -q 1 -N 1 -m 1 -D 0.5 -t 5 -E 0.0001 -d 1 -s 1 -I 0 -u 0.04

estpost -i mcmcout.hdf5 -p LnL -o ln1 -s 2 -w 0
estpost -i mcmcout.hdf5 -p alpha -o a1 -s 2 -w 0

estpost -i mcmcout.hdf5 -p alpha -o a.out -s 0 -c 0.95 -w 0
estpost -i mcmcout.hdf5 -p beta -o b.out -s 0 -c 0.95 -w 0

estpost -i mcmcout.hdf5 -p gamma-quantile -o qa.out -s 0 -c 0.95 -w 0
estpost -i mcmcout.hdf5 -p zeta-quantile -o qb.out -s 0 -c 0.95 -w 0

exit

For more information visit https://sites.google.com/site/bgcsoftware/

BLAST

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

module load blast-plus

For more information visit https://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAT

BLAT is a bioinformatics software a tool which performs rapid mRNA/DNA and cross-species protein alignments. BLAT is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. (Source: Kent, W.J. 2002. BLAT -- The BLAST-Like Alignment Tool. Genome Research 4: 656-664.)

BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome (but not the genome itself) in memory. Since the index takes up a bit less than a gigabyte of RAM, BLAT can deliver high performance on a reasonably priced Linux box.

module load blat

For more information, visit https://genome.ucsc.edu/cgi-bin/hgBlat or http://www.kentinformatics.com/

Bowtie

Bowtie, an ultrafast, memory-efficient short read aligner for short DNA sequences (reads) from next-gen sequencers. Please cite: Langmead B, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

module load bowtie

For more information visit https://sourceforge.net/projects/bowtie-bio/

Bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

Usage, version 2.3.4.1

module load bowtie2/2.3.4.1

Usage, version 2.3.5

module load bowtie2/2.3.5

For more information visit http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

BWA

BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome.

Usage, version 0.7.15

module load bwa/0.7.15

Usage, version 0.7.17

module load bwa/0.7.17

For more information visit https://github.com/lh3/bwa

Cactus

Cactus is a reference-free whole-genome multiple alignment program. The principal algorithms are described here: https://doi.org/10.1101/gr.123356.111

Canu

Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION).

module load canu

For more information visit https://canu.readthedocs.io/en/latest/

eXpress

eXpress is a streaming DNA/RNA sequence quantification tool. It has initially been tested for RNA-Seq transcriptome quantification but can be used in any application where abundances of target sequences need to be estimated from short reads sequenced from them. More details, installation instructions, and the manual can be found at http://bio.math.berkeley.edu/eXpress/

module load express

Fastqc

A quality control tool for high throughput sequence data.

module load fastqc

For more information visit http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FASTX-Toolkit

The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

Usage

module load fastx-toolkit/0.0.14

For more information visit http://hannonlab.cshl.edu/fastx_toolkit/index.html

FreeBayes

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

Usage

module load freebayes/1.1.0

For more information visit https://github.com/ekg/freebayes

GATK

A genomic analysis toolkit focused on variant discovery. The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.

module load gatk

For more information visit https://gatk.broadinstitute.org/hc/en-us

Guppy

Hal

Produces multiple alignments and trees from genomic data. Hal is a phylogenetic pipeline. The alignments can be produced by a choice of four alignment programs and analyzed by a variety of phylogenetic programs. The Hal pipeline connects the programs BLASTP, MCL, user specified alignment programs, GBlocks, ProtTest and user specified phylogenetic programs to produce species trees.

JELLYFISH

JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism.

JELLYFISH is a command-line program that reads FASTA and multi-FASTA files containing DNA sequences. It outputs its k-mer counts in an binary format, which can be translated into a human-readable text format using the "jellyfish dump" command. See the documentation below for more details.

module load jellyfish

For more information visit http://www.cbcb.umd.edu/software/jellyfish/

Kallisto

kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. Pseudoalignment of reads preserves the key information needed for quantification, and kallisto is therefore not only fast, but also as accurate as existing quantification tools. In fact, because the pseudoalignment procedure is robust to errors in the reads, in many benchmarks kallisto significantly outperforms existing tools.

module load kallisto

kallisto is described in detail in:

Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519

For more information visit http://pachterlab.github.io/kallisto

Miniasm

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

module load miniasm

For more information visit https://github.com/lh3/miniasm

Minimap2

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.

module load minimap2

For more information visit https://github.com/lh3/minimap2

NGSTools

NGS (Next-Generation Sequencing) technologies have revolutionised population genetic research by enabling unparalleled data collection from the genomes or subsets of genomes from many individuals. Current technologies produce short fragments of sequenced DNA called reads that are either de novo assembled or mapped to a pre-existing reference genome. This leads to chromosomal positions being sequenced a variable number of times across the genome. This parameter is usually referred to as the sequencing depth. Individual genotypes are then inferred from the proportion of nucleotide bases covering each site after the reads have been aligned.

Low sequencing depth and high error rates stemming from base calling and mapping errors can cause SNP (Single Nucleotide Polymorphism) and genotype calling from NGS data to be associated with considerable statistical uncertainty. Probabilistic models, which take these errors into account, have been proposed to accurately assign genotypes and estimate allele frequencies (e.g. Nielsen et al., 2012; for a review Nielsen et al., 2011).

ngsTools is a collection of programs for population genetics analyses from NGS data, taking into account data statistical uncertainty. The methods implemented in these programs do not rely on SNP or genotype calling, and are particularly suitable for low sequencing depth data. An application note illustrating its application has published (Fumagalli et al., 2014).

module load ngstools

For more information visit https://github.com/mfumagalli/ngsTools

PAML

PAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood.

module load paml

For more information visit http://abacus.gene.ucl.ac.uk/software/paml.html

PEAR

PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.

PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.

module load pear

For more information visit http://abacus.gene.ucl.ac.uk/software/paml.html

PHAST

Phylogenetic Analysis with Space/Time models (PHAST) is a freely available software package consisting of a collection of command-line programs and supporting libraries for comparative and evolutionary genomics. Best known as the search engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser, PHAST also includes several tools for phylogenetic modeling, functional element identification, as well as utilities for manipulating alignments, trees and genomic annotations.

module load phast

For more information visit http://compgen.cshl.edu/phast/index.php

Pilon

Pilon is a software tool which can be used to:

Automatically improve draft assemblies
Find variation among strains, including large event detection

Pilon requires as input a FASTA file of the genome along with one or more BAM files of reads aligned to the input FASTA file. Pilon uses read alignment analysis to identify inconsistencies between the input genome and the evidence in the reads. It then attempts to make improvements to the input genome, including:

Single base differences
Small indels
Larger indel or block substitution events
Gap filling
Identification of local misassemblies, including optional opening of new gaps

Pilon then outputs a FASTA file containing an improved representation of the genome from the read data and an optional VCF file detailing variation seen between the read data and the input genome.

To aid manual inspection and improvement by an analyst, Pilon can optionally produce tracks that can be displayed in genome viewers such as IGV and GenomeView, and it reports other events (such as possible large collapsed repeat regions) in its standard output.

module load pilon

For more information visit https://github.com/broadinstitute/pilon/wiki

Porechop

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

module load porechop

For more information visit https://github.com/rrwick/Porechop

Relion

RELION (for REgularised LIkelihood OptimisatioN) is a stand-alone computer program for Maximum A Posteriori refinement of (multiple) 3D reconstructions or 2D class averages in cryo-electron microscopy. It is developed in the research group of Sjors Scheres at the MRC Laboratory of Molecular Biology.

module load relion

For more information visit https://github.com/3dem/relion

RSEM

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.

module load rsem

For more information visit https://github.com/deweylab/RSEM

Salmon

Salmon is a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses new algorithms (specifically, coupling the concept of quasi-mapping with a two-phase inference procedure) to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.

module load salmon

For more information visit http://combine-lab.github.io/salmon/

SAMTools

SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format

Usage, version 1.4

module load samtools/1.4

Usage, version 1.9

module load samtools/1.9

For more information visit http://www.htslib.org/

SnpEff

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).

module load snpeff

For more information visit http://snpeff.sourceforge.net/

STAR

STAR is an ultrafast universal RNA-seq aligner.

module load star

For more information visit https://github.com/alexdobin/STAR

tabix

Generic indexer for TAB-delimited genome position files

Usage

module load tabix/2013-12-16

For more information visit https://github.com/samtools/tabix

trimmomatic

A flexible read trimming tool for Illumina NGS data.

Usage, version 0.36

module load trimmomatic/0.36

Usage, version 0.38

module load trimmomatic/0.38

For more information visit http://www.usadellab.org/cms/?page=trimmomatic

Trinity

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.

module load trinity

For more information visit http://trinityrnaseq.github.io/

VCFTools

VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.

Usage, version 0.0.14

module load vcftools/0.1.14

Usage, version 0.0.15

module load vcftools/0.1.15

For more information visit http://vcftools.sourceforge.net/