Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. — Wikipedia
A curated list of awesome Bioinformatics software, resources, and libraries. Mostly command line based, and free or open-source. Please feel free to contribute!
Table of Contents
Package suites
Package suites gather software packages and installation tools for specific languages or platforms. We have some for bioinformatics software.
-
Bioconductor - A plethora of tools for analysis and comprehension of high-throughput genomic data, including 1500+ software packages. [ paper-2004 |
web ] |
-
-
- BioJulia - Bioinformatics and computational biology infastructure for the Julia programming language. [ web ]
- Rust-Bio - Rust implementations of algorithms and data structures useful for bioinformatics. [ paper-2016 ]
- SeqAn - The modern C++ library for sequence analysis.
- GGD - Go Get Data; A command line interface for obtaining genomic data. [ web ]
- SRA-Explorer - Easily get SRA download links and other information. [ web ]
Data Processing
Command Line Utilities
- Bioinformatics One Liners - Git repo of useful single line commands.
- BioNode - Modular and universal bioinformatics, Bionode provides pipeable UNIX command line tools and JavaScript APIs for bioinformatics analysis workflows. [ web ]
-
bioSyntax - Syntax Highlighting for Computational Biology file formats (SAM, VCF, GTF, FASTA, PDB, etc…) in vim/less/gedit/sublime. [ paper-2018 |
web ] |
- CSVKit - Utilities for working with CSV/Tab-delimited files. [ web ]
- csvtk - Another cross-platform, efficient, practical and pretty CSV/TSV toolkit. [ web ]
- datamash - Data transformations and statistics. [ web ]
- easy_qsub - Easily submitting PBS jobs with script template. Multiple input files supported.
- GNU Parallel - General parallelizer that runs jobs in parallel on a single multi-core machine. Here are some example scripts using GNU Parallel. [ web ]
- grabix - A wee tool for random access into BGZF files.
- gsort - Sort genomic files according to a specified order.
- tabix - Table file index. [ paper-2011 ]
- wormtable - Write-once-read-many table for large datasets.
- zindex - Create an index on a compressed text file.
Next Generation Sequencing
Workflow Managers
-
BigDataScript - A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities. [ paper-2014 |
web ] |
- Bpipe - A small language for defining pipeline stages and linking them together to make pipelines. [ web ]
- Common Workflow Language - a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. [ web ]
- Cromwell - A Workflow Management System geared towards scientific workflows. [ web ]
-
Galaxy - a popular open-source, web-based platform for data intensive biomedical research. Has several features, from data analysis to workflow management to visualization tools. [ paper-2018 |
web ] |
-
Nextflow (recommended) - A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner. [ paper-2018 |
web ] |
-
Ruffus - Computation Pipeline library for python widely used in science and bioinformatics. [ paper-2010 |
web ] |
-
SeqWare - Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments. [ paper-2010 |
web ] |
-
Snakemake - A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment. [ paper-2018 |
web ] |
- Workflow Descriptor Language - Workflow standard developed by the Broad. [ web ]
Pipelines
- Awesome-Pipeline - A list of pipeline resources.
- bcbio-nextgen - Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction. [ web ]
- R-Peridot - Customizable pipeline for differential expression analysis with an intuitive GUI. [ web ]
Sequence Processing
Sequence Processing includes tasks such as demultiplexing raw read data, and trimming low quality bases.
- AfterQC - Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data. [ paper-2017 ]
- FastQC - A quality control tool for high throughput sequence data. [ web ]
- Fastqp - FASTQ and SAM quality control using Python.
- Fastx Tookit - FASTQ/A short-reads pre-processing tools: Demultiplexing, trimming, clipping, quality filtering, and masking utilities. [ web ]
-
MultiQC - Aggregate results from bioinformatics analyses across many samples into a single report. [ paper-2016 |
web ] |
-
SeqKit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang. [ paper-2016 |
web ] |
- seqmagick - file format conversion in Biopython in a convenient way. [ web ]
- Seqtk - Toolkit for processing sequences in FASTA/Q formats.
- smof - UNIX-style FASTA manipulation tools.
Data Analysis
The following items allow for scalable genomic analysis by introducing specialized databases.
- Hail - Scalable genomic analysis.
- GLNexus - Scalable gVCF merging and joint variant calling for population sequencing projects. [ paper-2018 ]
Sequence Alignment
Pairwise
-
Bowtie 2 - An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. [ paper-2012 |
web ] |
- BWA - Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.
- WFA - the wavefront alignment algorithm (WFA) which expoit sequence similarity to speed up alignment [ paper-2020 ]
- Parasail - SIMD C library for global, semi-global, and local pairwise sequence alignments [ paper-2016 ]
-
Multiple Sequence Alignment
- POA - Partial-Order Alignment for fast alignment and consensus of multiple homologous sequences. [ paper-2002 ]
Quantification
- Cufflinks - Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. [ paper-2010 ]
-
RSEM - A software package for estimating gene and isoform expression levels from RNA-Seq data. [ paper-2011 |
web ] |
Variant Calling
- freebayes - Bayesian haplotype-based polymorphism discovery and genotyping. [ web ]
- GATK - Variant Discovery in High-Throughput Sequencing Data. [ web ]
-
Structural variant callers
- Delly - Structural variant discovery by integrated paired-end and split-read analysis. [ paper-2012 ]
- lumpy - lumpy: a general probabilistic framework for structural variant discovery. [ paper-2014 ]
- manta - Structural variant and indel caller for mapped sequencing data. [ paper-2015 ]
- gridss - GRIDSS: the Genomic Rearrangement IDentification Software Suite. [ paper-2017 ]
- smoove - structural variant calling and genotyping with existing tools, but,smoothly.
BAM File Utilities
- Bamtools - Collection of tools for working with BAM files. [ paper-2011 ]
- bam toolbox MtDNA:Nuclear Coverage; BAM Toolbox can output the ratio of MtDNA:nuclear coverage, a proxy for mitochondrial content.
- mergesam - Automate common SAM & BAM conversions.
- mosdepth - fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing. [ paper-2017 ]
-
- Somalier - Fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs. [ paper-2020 ]
- Telseq - Telseq is a tool for estimating telomere length from whole genome sequence data. [ paper-2014 ]
VCF File Utilities
-
- vcfanno - Annotate a VCF with other VCFs/BEDs/tabixed files. [ paper-2016 ]
- vcflib - A C++ library for parsing and manipulating VCF files.
- vcftools - VCF manipulation and statistics (e.g. linkage disequilibrium, allele frequency, Fst). [ paper-2011 ]
GFF BED File Utilities
- gffutils - GFF and GTF file manipulation and interconversion. [ web ]
- BEDOPS - The fast, highly scalable and easily-parallelizable genome analysis toolkit. [ paper-2012 ]
-
Variant Simulation
- Bam Surgeon - Tools for adding mutations to existing
.bam
files, used for testing mutation callers. [ web ]
- wgsim - Comes with samtools! - Reads simulator. [ web ]
Variant Prediction/Annotation
-
SIFT - Predicts whether an amino acid substitution affects protein function. [ paper-2003 |
web ] |
-
Python Modules
Data
Visualization
Genome Browsers / Gene Diagrams
The following tools can be used to visualize genomic data or for constructing customized visualizations of genomic data including sequence data from DNA-Seq, RNA-Seq, and ChIP-Seq, variants, and more.
-
Squiggle - Easy-to-use DNA sequence visualization tool that turns FASTA files into browser-based visualizations. [ paper-2018 |
web ] |
- biodalliance - Embeddable genome viewer. Integration data from a wide variety of sources, and can load data directly from popular genomics file formats including bigWig, BAM, and VCF.
[ paper-2011 | web ]
-
BioJS - BioJS is a library of over hundred JavaScript components enabling you to visualize and process data using current web technologies. [ paper-2014 |
web ] |
- Circleator - Flexible circular visualization of genome-associated data with BioPerl and SVG. [ paper-2014 ]
-
-
IGV js - Java-based browser. Fast, efficient, scalable visualization tool for genomics data and annotations. Handles a large variety of formats. [ paper-2019 |
web ] |
- Island Plot - D3 JavaScript based genome viewer. Constructs SVGs. [ paper-2015 ]
-
JBrowse - JavaScript genome browser that is highly customizable via plugins and track customizations. [ paper-2016 |
web ] |
-
PHAT - Point and click, cross platform suite for analysing and visualizing next-generation sequencing datasets. [ paper-2018 |
web ] |
- pileup.js - JavaScript library that can be used to generate interactive and highly customizable web-based genome browsers. [ paper-2016 ]
-
- Lucid Align - A modern sequence alignment viewer. [ web ]
-
Circos - Perl package for circular plots, which are well suited for genomic rearrangements. [ paper-2009 |
web ] |
- ClicO FS - An interactive web-based service of Circos. [ paper-2015 ]
-
OmicCircos - R package for circular plots for omics data. [ paper-2014 |
web ] |
-
J-Circos - A Java application for doing interactive work with circos plots. [ paper-2014 |
web ] |
-
- fujiplot - A circos representation of multiple GWAS results. [ paper-2018 ]
Database Access
Resources
Sequencing
- Next-Generation Sequencing Technologies - Elaine Mardis (2014) [1:34:35] - Excellent (technical) overview of next-generation and third-generation sequencing technologies, along with some applications in cancer research.
- Annotated bibliography of *Seq assays - List of ~100 papers on various sequencing technologies and assays ranging from transcription to transposable element discovery.
- For all you seq… (PDF) (3456x5471) - Massive infographic by Illumina on illustrating how many sequencing techniques work. Techniques cover protein-protein interactions, RNA transcription, RNA-protein interactions, RNA low-level detection, RNA modifications, RNA structure, DNA rearrangements and markers, DNA low-level detection, epigenetics, and DNA-protein interactions. References included.
RNA-Seq
- Review papers on RNA-seq (Biostars) - Includes lots of seminal papers on RNA-seq and analysis methods.
- Informatics for RNA-seq: A web resource for analysis on the cloud - Educational resource on performing RNA-seq analysis in the cloud using Amazon AWS cloud services. Topics include preparing the data, preprocessing, differential expression, isoform discovery, data visualization, and interpretation.
- RNA-seqlopedia - RNA-seqlopedia provides an awesome overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment.
- A survey of best practices for RNA-seq data analysis - Gives awesome roadmap for RNA-seq computational analyses, including challenges/obstacles and things to look out for, but also how you might integrate RNA-seq data with other data types.
- Stories from the Supplement [46:39] - Dr. Lior Pachter shares his stories from the supplement for well-known RNA-seq analysis software CuffDiff and Cufflinks and explains some of their methodologies.
- List of RNA-seq Bioinformatics Tools - Extensive list on Wikipedia of RNA-seq bioinformatics tools needed in analysis, ranging from all parts of an analysis pipeline from quality control, alignment, splice analysis, and visualizations.
- RNA-seq Analysis - @crazyhottommy’s notes on various steps and considerations when doing RNA-seq analysis.
ChIP-Seq
YouTube Channels and Playlists
- Current Topics in Genome Analysis 2016 - Excellent series of fourteen lectures given at NIH about current topics in genomics ranging from sequence analysis, to sequencing technologies, and even more translational topics such as genomic medicine.
- GenomeTV - “GenomeTV is NHGRI’s collection of official video resources from lectures, to news documentaries, to full video collections of meetings that tackle the research, issues and clinical applications of genomic research.”
- Leading Strand - Keynote lectures from Cold Spring Harbor Laboratory (CSHL) Meetings. More on The Leading Strand.
- Genomics, Big Data and Medicine Seminar Series - “Our seminars are dedicated to the critical intersection of GBM, delving into ‘bleeding edge’ technology and approaches that will deeply shape the future.”
- Rafael Irizarry’s Channel - Dr. Rafael Irizarry’s lectures and academic talks on statistics for genomics.
- NIH VideoCasting and Podcasting - “NIH VideoCast broadcasts seminars, conferences and meetings live to a world-wide audience over the Internet as a real-time streaming video.” Not exclusively genomics and bioinformatics video but many great talks on domain specific use of bioinformatics and genomics.
Blogs
- ACGT - Dr. Keith Bradnam writes about this “thoughts on biology, genomics, and the ongoing threat to humanity from the bogus use of bioinformatics acroynums.”
- Opiniomics - Dr. Mick Watson write on bioinformatics, genomes, and biology.
- Bits of DNA - Dr. Lior Pachter writes review and commentary on computational biology.
- it is NOT junk - Dr. Michael Eisen writes “a blog about genomes, DNA, evolution, open science, baseball and other important things”
Miscellaneous
Online networking groups
License
