JavaGene

Update: a couple of years ago the (much bigger) open source BioJava project incorporated most of the JavaGene code.

JavaGene is a small set of Java classes for genomic sequence analysis that support fluent programming.

It's good for for low-level sequence manipulations, such as one might need when writing gene finders, analyzing gene structure, modeling molecular evolution and similar projects. It's not for annotation pipelines or integrating disparate data sources.

JavaGene completely encapsulates tedious sequence arithmetic, eliminating pesky off-by-one bugs and edge condition error checking, and generally makes working with stranded sequences a breeze.

I originally wrote this for my own use while studying bioinformatics at the University of Pennsylvania, and used it there for three good-sized research projects. The version here is a refactored subset.

License: MIT

Overview

• Simple! Documented! Includes sample programs!

• Encapsulates common sequence data types and operations; supports chained, fluent programming.

• Implements a rich set of methods to manipulate locations on a sequence, such as "sliding window" iterators, prefix( ), suffix( ), contains(), distance( ), overlaps(), upstream(), union(), and many more like it.

• Transparently handles operations on forward and reverse strands.

• Supports Fasta format sequence files. Seamlessly supports both typical in-memory sequences and oversized gigabyte-sized sequences (using memory-mapped io).

• Supports various flavors of GFF/GTF feature files. Supports selection of features such as genes, exons, etc based on attributes. Can splice a sequence based on a list of features.

• AminoAcid utilities: Translate nucleotides to amino acids. Check for synonyms. Find Blosum62 distance.

• Nucleotide utilities: Complementation for both DNA and RNA sequences. Check for matches using the IUPAC "ambiguous" symbols, such as R,Y,A, etc.

Resources

1. Sample programs:

Essentials (getting started)
GenesToProteins (a small but real program)
Strands (working with stranded sequences)
Bio (nucleotide and protein tools)

2. Online JavaDoc for the entire library

Downloads

sample-data.zip	Data for sample programs
GitHub repo	Source, JAR file, sample programs