28.6. 14:00 - Grüner Vortrag M.Sc. Uwe Baier

Universität Ulm

BWT Tunneling: Theorie sowie Einsatz in Datenkompression und Sequenzanalyse

Vorstellung des Dissertationsvorhabens "BWT Tunneling":

The Burrows Wheeler Transform (BWT) is a famous reversible text transforma-
tion invented by David J. Wheeler around 1978 and published by Mike Burrows
and David J. Wheeler in 1994. The key idea of the transform is to sort all
suffixes of a given string S in lexicographic order and to concatenate all cycli-
cally preceding characters of the sorted suffixes, resulting in a new string L. It
is well-known that S can be reconstructed solely from string L, and that L has
some nice properties making it very attractive for lossless data compression as
well as sequence analysis.
One such property comes from the observation that, in real-world texts,
similar contexts are succeeded (but also preceded) by similar characters. For
example, in English text, the letter q is always followed by the letter u. Moving
over to the BWT, the lexicographically sorted suffixes serve as kind of a context,
and thus the BWT gathers preceding characters of similar contexts. As a result,
the BWT string L typically contains long repetitions of the same character,
making it highly compressible using e.g. run-length encoding.
The observation that similar contexts are preceded by similar characters
naturally can be extended to the observation that similar contexts are preceded
by similar strings, e.g. the string “peasy” typically is preceded by “easy”. A
new technique called Tunneling is able to exploit this observation, leading to a
significant reduction of the size of BWT-based data structures, and therefore
will be the main topic of the talk.
Beside of the pure theory behind Tunneling, applications of the technique
to typical BWT fields are shown: in the case of lossless data compression, the
technique allows to build a BWT-based data compressor achieving compression
rates competitive or even superior to those of nowadays best lossless data com-
pressors. In the field of sequence analysis, Tunneling allows to represent some
string-based data structures in a succinctness level below of previously known