About

On this site, you find the test data that were used for efficient construction of a compressed de Bruijn graph for pan-genome analysis.

Human Genome

The file 7 genomes has file size 21,625,319,541 (md5sum: affc827aa48b7cfd07eb9d7e071a3bf3) and contains 21,201,290,946 base pairs. It was created by concatenation of the following files in the following order:

hg16 (NCBI34) from July 2003

Download (md5sum: 9c4567258b47b6dd466225c58da65eb4)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg16/chromosomes/

Comment: Modified file - converted lowercase to uppercase and removed 3 characters (RR and M) from chromosome 3.

 

hg17 (NCBI35) from May 2004

Download (md5sum: 57f5af6e6004497f82b284b75a712486)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/

Comment: Modified file - converted lowercase to uppercase and removed 3 characters (RR and M) from chromosome 3.

 

hg18 (NCBI36) from Mar. 2006

Download (md5sum: f37590f3007ac483488891113f222dc8)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/

Comment: Modified file - converted lowercase to uppercase and removed 3 characters (RR and M) from chromosome 3.

 

hg19 (GRch37) from Feb. 2009

Download (md5sum: 55c0eb9b019d9f727b0d0ae42b5ca237)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

Comment: Modified file - converted lowercase to uppercase.

 

hg38 (GRch38) from Dec. 2013

Download (md5sum: ea47ff706942f5e58b327aac61e528d6)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/

Comment: Modified file - converted lowercase to uppercase.

 

maternal haplotype of NA12878

The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.

Download (md5sum: 4a5e7ffec07364de66e56022d5864107)

Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip

Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.

paternal haplotype of NA12878

The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.

Download (md5sum: 75e170b383de42aeb14732cabeab9a00)

Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip

Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.

Human Chromosome 1

Chr1 of hg16 (NCBI34) from July 2003

Download (md5sum: f339b1e234e9b708d04ef7928ccbcd7e)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg16/chromosomes/chr1.fa.zip

Comment: Modified file - converted lowercase to uppercase.

 

Chr1 of hg17 (NCBI35) from May 2004

Download (md5sum: 057693e7e5be4a813610dc49d2647f05)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/chr1.fa.gz

Comment: Modified file - converted lowercase to uppercase.

 

Chr1 of hg18 (NCBI36) from Mar. 2006

Download (md5sum: b9fb6b270b7e6cf777a5925c904a7f9e)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz

Comment: Modified file - converted lowercase to uppercase.

 

Chr1 of hg19 (GRch37) from Feb. 2009

Download (md5sum: a46474e572b3be254b8f4e59034d6238)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

Comment: Modified file - converted lowercase to uppercase.

 

Chr1 of hg38 (GRch38) from Dec. 2013

Download (md5sum: 358f980f9e54a41f2df778b7a89a620e)

Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz

Comment: Modified file - converted lowercase to uppercase.

 

1_NA12878_maternal.fa of NA12878

The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.

Download (md5sum: c4af94539eab79b56de98f7767a72f3f)

Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip

Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.

 

1_NA12878_paternal.fa of NA12878

The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.

Download (md5sum: 335f5e6754218a825939a4b485c5c85d)

Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip

Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.

E.Coli

We used (almost) the same Ecoli files as described in the Supplementary Data of "SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips". Therefore, we downloaded the sequences (or when suitable the reverse complement) from the page www.ncbi.nlm.nih.gov/nuccore/ by the following accession numbers.

 

accession numberfilenamemd5sumcomment
FM180568 01.FM180568.sequence.fasta3939f45d6d97afa76a2fbb603307a237
FN554766 02.FN554766.sequence.fastab5726c3bae898831d5240f8897736c12
CP000247 03.CP000247.sequence.fasta08643a4078ec97b36ea3da402bef95f6
CU928145 04.CU928145.sequence.fastad6e80db065ddf221b5925888ff8edd67
CP001671 05.CP001671.sequence.fasta97f209b1693b222e97a828d3c5a9c449
CP000468 06.CP000468.sequence.fasta80d233b0ffab129579f55789b136ad2e
CP004009 07.CP004009.sequence.fasta59b2b0fe09f5c2d4bd93475a0ecee948
CP004009 08.CP004009.sequence.fastac6ed68fd4899e8deac84f57afbb8e700
AM946981 09.AM946981.sequence.fasta6bd14dc519e588723715ff04fb289742
CP001396 10.CP001396.sequence.fasta6aa6aee90306e7800d681ad2a20e0c03
CP000819 11.CP000819.sequence.fastae79fd22c2716252d440e9f6b7e60546f
AE014075 12.AE014075.sequence.fasta5b6fb26a0e33185fbedc6ef017858c85
CP000946 13.CP000946.rev.sequence.fastaa08d364f46c80c5cffe06c1fe8384e35In constrast to splitMEM we take this reverse complement sequence
CP001637 14.CP001637.rev.sequence.fastaa6c4702cc30c2bc5b797cdb100652eeaThis is the reverse complemented sequence
AP012030 15.AP012030.sequence.fasta376ed43655a0b17ab910e953f0ee0f58
CP000800 16.CP000800.sequence.fastac308c831695ae52b69b3e3a06d91f91b
CU928162 17.CU928162.sequence.fastac0a080d2e1ad0e1f5db1f29de0783802
FN649414 18.FN649414.sequence.fasta05d5f327b5786f16e569edcd5a4ae4d5
CP000802 19.CP000802.sequence.fasta5796adc77aaf0ac2803e5a84a3bf9e1a
CU928160 20.CU928160.sequence.fasta96e0455c30720ab5cbf950ba5deef623
CU928164 21.CU928164.sequence.fasta17af046949d2bc75da6be07441b09b24
CP001969 22.CP001969.sequence.fasta3ae46456e0f1bad4214e4cad7101cb6e
CP006784 23.CP006784.sequence.fasta6543a8baf0a9aedbfe9661674b38ae4e
CP002516 24.CP002516.rev.sequence.fasta6fd8c3dafd4e5c3c691185429edd9e7eThis is the reverse complemented sequence
CP002970 25.CP002970.rev.sequence.fasta5ad67263edba15f2ece68c6afab893cfThis is the reverse complemented sequence.
CP000948 26.CP000948.sequence.fastac5e591bf4793f8649c4d880151aff13c
AP012306 27.AP012306.sequence.fastaab94cfcbf5015cbca0bba3b35b74b0da
U00096 28.U00096.sequence.fasta7c5486a762455b4b811ea2411aa111d7SplitMEM Paper reports other filesize
AP009048 29.AP009048.sequence.fasta059c8fae616045cb0a3447ed04316295
CU651637 30.CU651637.sequence.fastadcbb0c93454bca2b1ab274bf6e14c876
CP006584 31.CP006584.sequence.fasta5ff79d9d936dec004aa190fec1c91b98
CP002797 32.CP002797.sequence.fasta59780302ab6da977cf19f0858dd84ba3
NC_01335333.NC_013353.sequence.fastae35447195a6eb85b956c93c66b32853dSplitMEM Paper uses the sequence with the accession number P010958
CP003297 34.CP003297.rev.sequence.fastadfc12bbdc2e4fc66dfe075671f1310dbThis is the reverse complemented sequence
CP003301 35.CP003301.rev.sequence.fasta4f0036e0cbf337b4424b1d647c9f2717This is the reverse complemented sequence
CP003289 36.CP003289.rev.sequence.fasta86ce567ff234f60cf515819e83e75d49This is the reverse complemented sequence
AP010960 37.AP010960.sequence.fasta1f3cadfd676587468621ec6118d0a572
AE005174 38.AE005174.sequence.fasta0605d65eccb8411792eb03517d8d6e1c
BA000007 39.BA000007.sequence.fasta2819c48fc28e4399eefdcd6b1695c6c0
CP001164 40.CP001164.sequence.fasta99308a74849430a134967e2645b62ca5
CP001368 41.CP001368.sequence.fasta36c1491fb544769e24c4dfeb997aeb0b
AP010953 42.AP010953.sequence.fastadb3a056d9472995ebfd524c1a0720112
CP001846 43.CP001846.sequence.fasta284b034e88e11e169de16b46df425f83
CP003109 44.CP003109.sequence.fastaee68b91e95eb5e9821101df85e98b184
CP003034 45.CP003034.sequence.fasta466006b02d21f7ff8fb68e43f6ee305e
CP001855 46.CP001855.sequence.fasta60b61bb47971940d1ff7f5ae98e7b678
CP002291 47.CP002291.sequence.fasta95b8882b832471c5ff4bda6fc53754d7
CU928161 48.CU928161.sequence.fastaf08f59f924fb3d0b2786cacfd95db243
AP009240 49.AP009240.sequence.fasta6647d2f1657637e133b37401b30c3e68
AP009378 50.AP009378.sequence.fastae8cc199da10e1fa45157683720e4b68f
CP000970 51.CP000970.sequence.fasta979fe28928cddf5f8bed140173c71077
CP002167 52.CP002167.rev.sequence.fasta8d84661b1cf63e9b9732cd830812c072This is the reverse complemented sequence
CU928163 53.CU928163.sequence.fastad8bfddb71f6dd61661fb1a4f9eedc372SplitMEM Paper reports other filesize
CP002729 54.CP002729.sequence.fastad5344aaf32bd04140be3140624c61817
CP000243 55.CP000243.sequence.fastad3e1c749af0a39d29190c61f089aa6df
CP002185 56.CP002185.sequence.fasta8e54c38451669a0e1dba0c49d76caf90
CP002967 57.CP002967.sequence.fasta72dd80f519cc369c19507ad5e522ece2
CP001925 58.CP001925.sequence.fasta911666bb04b01804400ba351fc1effe1
CP001665 59.CP001665.rev.sequence.fastafd656e6dcbd4f0806c62d0fce255769cThis is the reverse complemented sequence
CP002212 60.CP002212.sequence.fasta93aa4ed287b350492ad41b44e4260e3eSplitMEM Paper has accession number P002212
CP002211 61.CP002211.sequence.fasta57ca40270401c08f240fd9a5b49fe601
CP006698 62.CP006698.sequence.fasta0a0da9962b86a31ae2a92d59c579399c

All files are also in this archiv.