Burrows Wheeler Transform (BWT) for data compression, indexing and search (including BWA)

Burrows Wheeler Transform (BWT)

LA home
Computing
Algorithms
Glossary
Strings
  Suffix array
  BWT
  Factors
  Suffix tree
  BWT

Example:

	0	1	2	3	4	5	6	7	8	9	1 0	1 1
S =	a	g	c	a	g	c	a	g	a	c	t	$

where the end of sequence pseudo-symbol, $, is less than all proper symbols.

S → BWT(S)

Sort the suffixes of S; the Burrows Wheeler transform [BW94] of S, BWT(S), consists of the symbols before each sorted suffix in turn. Note that $ comes before S[0].

Equivalently (with $), sort the rotations of S; BWT(S) consists of the last symbol of each sorted rotation in turn.

	suffix#	BWT(S)	suffix/rotation
0	11	t	$agcagcagact	$...
1	8	g	act$agcagcag	a...
2	6	c	agact$agcagc
3	3	c	agcagact$agc
4	0	$	agcagcagact$
5	5	g	cagact$agcag	c...
6	2	g	cagcagact$ag
7	9	a	ct$agcagcaga
8	7	a	gact$agcagca	g...
9	4	a	gcagact$agca
10	1	a	gcagcagact$a
11	10	c	t$agcagcagac	t...

ch	$	a	c	g	t
rank(ch)	0	1	5	8	11

BWT(agcagcagact) = tgcc$ggaaaac

Demonstration:

refStr= pat= output area needs JavaScript on op (errors and tracing) msgs

Data Compression

A virtue of the transform is that BWT(S) is likely to be compressible by quite simple data-compression methods, unless S is inherently incompressible. For example, suppose that some word, w=w₀...w_k-1, occurs frequently in S. Then all of the suffixes starting w₁...w_k-1... will be sorted in a "block" and there will be a corresponding "run" of w₀ in BWT(S). Similarly for suffixes starting w₂...w_k-1... and runs of w₁, and so on.

Recovering S from BWT(S)

Surprisingly, it is possible to recover S from BWT(S).

The first symbol of BWT(S), here t, must be the last symbol of S because the empty suffix ($) is the first suffix.

That t starts the suffix t$, at position 11 = rank(t) in the sorted list.

It must be preceded in S by symbol 11 of BWT(S), c.

Suffix ct$ must be in the range [rank(c), rank(g)) = [5, 8) in the sorted list; in fact it must be at position 7, because two other cs precede it in BWT(S).

Hence ct$ must be preceded by the 7th symbol of BWT(S), a.

All of S can be recovered in this way.

function recover(bwt)
 { // Recover original string from its transform
   var pos = 0,
       ans = endChar;  // $-terminated here
   for( var i = 1; i < bwt.length; i++ )
    { ans = bwt.charAt(pos) + ans;
      pos = inverse(pos, bwt);
    }
   return ans;
 }//recover(BWT)

function inverse(pos, bwt)
 { // one step of the reverse reconstruction
   var ch     = bwt.charAt(pos);
   var chCode = ch.charCodeAt(0);
   return rank[chCode] + occ(ch, bwt, pos);
 }//inverse(pos,bwt)

function occ(ch, bwt, pos)
  // returns the # of occurences of ch in bwt
  // before position pos

It might seem that occ() must take O(|S|)-time,

function occSlow(ch, bwt, i)  // (SLOW, but see occFast)
 { var count = 0;
   for( var j = 0; j < i; j++ )
      if( bwt.charAt(j) == ch ) count++;
   return count;
 }//occSlow(ch,bwt,i)

but it can be made to take O(step)-time by "caching" symbol-counts at regular intervals of size 'step' along BWT(S).

function occFast(ch, bwt, loc)
 { if( loc < 0 ) return 0;
   var bucket = Math.floor(loc/freqBucketSize);
   var lo = bucket * freqBucketSize;
   var count = freqCache[bucket][ch.charCodeAt(0)];
   for(var j = lo; j < loc; j++ )
      if( bwt.charAt(j) == ch ) count ++ ;
   return count;
 }//occFast(ch,bwt,loc)

Note that [FM00] use two levels of caching, [LD09] one.

Multiplicity

The number of occurences, if any, of a pattern, pat, in S can be found, using BWT(S), by a reverse search of pat. Initially all of BWT(S), the range [lo, hi) = [0, |BWT(S)| ), is considered.

Assuming that pat does occur in S, one or more suffixes must start with the last symbol of pat, s.

These suffixes will occur in a block, from rank[s] upwards.

One or more will be preceded by the 2nd-last symbol of pat, s'.

Variables lo and hi are changed depending on rank[s'] and the number of occurrences of s' before lo and before hi respectively.

function multiplicity(pat, bwt)
 { // Return the number of times, if any, that pat occurs
   // in refStr where bwt is the transform of refStr.
   var lo = 0, hi = bwt.length;  // i.e. [lo,hi)
   for( var i = pat.length - 1; hi = lo && i >= 0; i-- )
    { var pati     = pat.charAt(i);
      var patiCode = pati.charCodeAt(0);
      lo = rank[patiCode] + occ(pati, bwt, lo);
      hi = rank[patiCode] + occ(pati, bwt, hi);
    }//for
   return hi - lo;
 }//multiplicity(pat,bwt)

This is equivalent to a search on an implicit trie (or similar) of reversed prefixes. Note that there is no actual explicit trie but the BWT can be used as a substitue for one, and a range, [lo, hi), created during search, acts like a node in the implicit trie:

sorted reverse prefixes	~ trie
agcagcagact$	$
gcagcagact$a	$		a
gcagact$agca	$	agc
gact$agcagca	$agc	agc
ct$agcagcaga	$agcagcag
t$agcagcagac	$agcagcaga		c
agcagact$agc	$	ag
agact$agcagc	$agc	ag
cagcagact$ag	$		ag
cagact$agcag	$	agc
act$agcagcag	$agc	agc
$agcagcagact	$agcagcagact

For example, gca occurs twice in agcagcagact.

Search

The locations of occurrences, if any, of a pattern, pat, in S can be found by first finding the multiplicity of pat in S and then locating the positions, in S, that correspond to each BWT(S) position in the final range, [lo, hi). In other words the suffix array, the suffix# column above, can be reconstructed when needed.

function locations(pat, bwt)
 { var lo = 0, hi = bwt.length;   // [lo,hi)
   for( var i = pat.length - 1; hi > lo && i >= 0; i-- )
    { var pati     = pat.charAt(i);
      var patiCode = pati.charCodeAt(0);
      lo = rank[patiCode] + occ(pati, bwt, lo);
      hi = rank[patiCode] + occ(pati, bwt, hi);
    }//for
   for( var i = lo; i < hi; i++ )
      print( locate(i,bwt) + "," );
   println( "." );
   return;
 }//locations(pat,bwt)

The naive way to map a position, pos, in BWT(S) onto S is to recover S from that symbol back to the start of S, counting the number of steps necessary to do so.

function locateSlow(pos, bwt)
 { pos = inverse(pos, bwt);
   var count;
   for( count = 0; pos > 0; count ++ )
      pos = inverse(pos, bwt);
   return count;
 }//locateSlow(pos,bwt)

This is, of course, slow. It can be speeded up by caching a map from selected points in BWT(S) to the corresponding positions in S.

function locateFast(pos, bwt)   // Fast
 { var count;
   for( count = 0; pos % saBucketSize > 0; count++ )
      pos = inverse(pos, bwt);
   return (count+saCache[pos/saBucketSize]) % bwt.length;
 }//locateFast(pos,bwt)

The process can be halted early when a cached value is available.

The BWA [LD09] implementation, which is for DNA, divides BWT(S) up into equal sized intervals. These will not in general correspond to equal sized intervals in S so the worst-case time-complexity of BWA's search may be much slower than the best case. In contrast, the "FM-index" [FM00] gives better performance guarantees. It divides S, rather than BWT(S), up into equal sized intervals (buckets), and one of these boundaries must be encountered in bounded time, but this requires more sophisticated data structures.

Approximate Search

Exact search (above) is equivalent to descending a trie of reversed prefixes as directed by a reverse scan of the search pattern. The BW-transform of S acts as a compact substitute for the trie. Just as the trie can be traversed for approximate search, given an "allowance" of mutations between the pattern and what is considered an approximate instance of it, so approximate search can be performed [LD09] using BWT(S) as a substitute for the trie.

function approx(pat, errsLeft)
 { approxB(pat, errsLeft, pat.length-1, 0, bwt.length);
 }//approx(pat,errsLeft)

function approxB(pat, errsLeft, loc,  lo, hi)
 { if( errsLeft < 0 ) return; // fail, else >=0
   if( loc < 0 ) // done all pat, ... are we ok?
    { for( var i = lo; i < hi; i ++ )
         print( locate(i,bwt) + ',' );
      return;
    }//else loc >= 0
   approxB(pat, errsLeft-1, loc-1, lo, hi);       //del(*)
   var patLoc = pat.charCodeAt(loc);
   for( var sy = minCode; sy <= maxCode; sy ++ )
    { var rankSy   = rank[sy],
          syAsChar = String.fromCharCode(sy);
      var lo2 = rankSy + occ(syAsChar, bwt, lo),
          hi2 = rankSy + occ(syAsChar, bwt, hi);
      if( lo2 < hi2 ) // not a dead-end, yet
       { approxB(pat, errsLeft-1, loc, lo2, hi2); //ins(*)
         var e2 = errsLeft - (sy == patLoc ? 0 : 1);
         approxB(pat, e2, loc-1, lo2, hi2);
       }//if
    }//for
  return;
 }//approxB(,,,,)

Note that (approximately-) matching instances of the pattern in S may overlap and even have common start and/or end positions.

A further refinment is that approximate search can be pruned. The search above fails when another error is required but the allowance has been used up. It is possible [LD09] to estimate a useful lower bound on the number of errors required, at a given position in the pattern, to successfully complete an approximate search yielding at least one approximate match.

-- L.A., A.K., 3/2009 & 9/2009 with thanks to S.P..

[BW94] M. Burrows & D. J. Wheeler, A block-sorting lossless data compression algorithm, TR-124, SRC (Digital, Palo Alto), pp.18, 1994.: The original BW-Transform (BWT) paper.
[FM00] P. Ferragina & G. Manzini, Opportunistic data structures with applications, Proc. 41st Annual Symp. on Foundations of Comp. Sci., pp.390, 2000.: Efficient, but complex, BWT-based index.
[LD09] H. Li & R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform, J. Bioinformatics, 25(14), pp.1754-1760, 2009.: A simpler, "BWA", implementation of a BWT-based index. (Also see bioinformatics.)

www #ad:

The C++ Cookbook

↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated).
Created with "vi (Linux)", charset=iso-8859-1, fetched Friday, 19-Apr-2024 06:42:42 UTC.

Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off.