Efficient construction of a compressed de Bruijn graph for pan-genome analysis
Timo Beller
Enno Ohlebusch


Algorithm:
These are the algorithms proposed in
	Efficient construction of a compressed de Bruijn graph for pan-genome analysis
	Accepted on the 26th Annual Symposium on Combinatorial Pattern Matching (CPM2015).
Please cite the paper above, if you use one of theses algorithms.


Requirements:
	- a modern c++11 ready compiler such as gcc version 4.7
	- the Succinct Data Structure Library (sdsl-lite)
	  (version d533b2600950b4f878cb063ca0cd1bf340c53df4, maybe newer)


Install:
1. Install the SDSL by the commands:
	git clone git@github.com:simongog/sdsl-lite.git sdsl-lite
	cd sdsl-lite
	./install.sh [PATH]

2. Save the path of the sdsl into the variable SDSLLITE e.g. by:
	export SDSLLITE=[PATH to SDSL]

3. Build the executables by the command:
	make


Usage:
All programs require 3 parameters:
- the input file in FASTA
- the output filename
- a file containing the values of k


Example:
	./a1.x input.fa example kfile.txt
or
	./a2.x input.fa example kfile.txt
or
	./a3.x input.fa example kfile.txt
will produce 4 files (two for k=10 and two for k=20):
     example.k10.dot
     example.k10.start_nodes.txt
     example.k20.dot
     example.k20.start_nodes.txt
Where the *.dot files contain the command de Bruijn graph and
the start_nodes.txt contains the list of the start nodes.


Algorithms:
a1.cpp - has theoretical running time O(n log n) and should be the fastest in practice.
         The space requirement is roughly n \log n Bit + size of the compressed de Bruijn graph.
	 Note that the suffix-array construction needs 5n (9n) Bytes for files <2GB (>2GB).

a2.cpp - has theoretical running time O(n log sigma) but in practice a2 is mostly a little slower than a1.
         The space requirement is roughly 2 n \log n Bit + size of the compressed de Bruijn graph.
	 Note that the suffix-array construction needs 5n (9n) Bytes for files <2GB (>2GB).

a3.cpp - has theoretical running time O(n log sigma) but in practice much slower than a1 and a2.
         The space requirement is roughly 1.5n Byte + size of the compressed de Bruijn graph.


Notes:
	The programs should compile with commit d533b2600950b4f878cb063ca0cd1bf340c53df4,
	of the SDSL, but may also work with newer versions of the SDSL.


Limitations:
	The input file must be in FASTA format, especially the input may not contain the 0-byte or 1-byte,
	newlines and characters between '>' and the next newline will be removed.
	k must be smaller than the shortest sequence.
