Longest Biased Interval and Longest Non-Negative Sum Interval,
J. Bioinformatics, 2003

L. Allison. Longest Biased Interval and Longest Non-Negative Sum Interval, Bioinformatics, 19(10), pp1294-1295, 1 July 2003, [doi:10.1093/bioinformatics/btg135]['05] also see [more (click)] discussion on the algorithm and on related algorithms.

Java source file: [Biased.java], and class files: [1], [2]
To run, e.g.
java Biased "AT" 0.77 dataFileName
for a particular figure (here 0.77 "AT"), or ...
java Biased "AT" -1 dataFileName
to run through a range of values.
NB. Make sure that you have a just-in-time compiler if you want to analyse long sequences.
NB. Characters other than acgtuACGTUrnyRNY are skipped but this is trivial to change.
Example of running [time (click)].
graph
-- Figure: cummulative sum, first-below & last-above --

Sung Kwon Kim suggested a 3-pass, (always) linear-time algorithm (personnal communication, 28/7/2003). I have coded this up in a program that runs both algorithms: [new code .java].


21/10/2002: Examples from PlasmoDB C.D., seq's dated 2002.8.19  [19/8/2002]

NB. Plasmodium falciparum causes malaria. Its DNA is 80% AT, contains repetitive patterns and is highly [compressible] compared to ``most'' other DNA. The data files of chromosome sequences 10, 11, 14 contain some non-{A,C,G,T} characters, e.g. M={A,C}, W={A,T}, K={G,T} etc.. For these experiments, a read-routine was used that does read these characters (to maintain position numbering), but they were treated as ``regular'' characters, not contributing to AT-richness, say, because (i) it makes only a little difference unless there are many occurrences of such characters, and (ii) it is not quite clear what is the best way to treat some of them anyway (e.g. does M count towards AT or not?). In any case, the matter is orthogonal to the algorithm; the ``special characters'' can be included in the list of desired characters, e.g. "ATW", at your choice.

Running time, on a 900MHz Pentium, Java, Linux, for the largest sequence (chr14) is 7 or 8 seconds for five passes with different values of f and this includes reading the data.

Key:   >=<min_frac>: [<start>..<end>] (<length>) <beginning>...<ending>
 
chr1
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr01-2002-8-19-Sanger
# 643292bp CTAAACCTAAACCTAA...
>=1.0:[459134..459485](352)AAAATAATAA...AATAAATAAA
>=0.95:[458479..461841](3363)TTAACATTTT...TTTATTATAT
>=0.9:[458757..465678](6922)ATGTTTCTAA...TTTATTTAAT
>=0.85:[439465..465974](26510)TTAATTGTAA...TCTTTTGATT
>=0.8:[8596..625720](617125)ATTTCTACAC...TATATTAAGA
>=0.75:[0..643291](643292)CTAAACCTAA...GGAATAGGGT
#--- end ---
 
chr2
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr02-200208019-TIGR
# 947102bp AACCCTAAACCCTAAACCCTAAACCCTA...
>=1.0:[449263..449686](424)TTATTTTAAT...TATATTATTT
>=0.95:[447292..450582](3291)TACAAAATAA...AAAAAAAATA
>=0.9:[445946..451647](5702)TATATTTTAT...GGAAAAAAAT
>=0.85:[807673..840316](32644)ATGAAAATTA...ATTTATCATA
>=0.8:[0..947101](947102)AACCCTAAAC...TCAGGGTTCA
#--- end ---
 
chr3
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr03-2002-8-19-Sanger
# 1060087bp TAAACCCTGAACCCTAAACCCT...
>=1.0:[594855..595306](452)AAAATAAATT...ATTAATAATT
>=0.95:[594084..597349](3266)AAATAAATAA...TTTTGTTAAA
>=0.9:[590628..597388](6761)AATACTAAAG...ATATACTTAT
>=0.85:[76744..100990](24247)AATATTAAAC...AACTCATAAA
>=0.8:[0..1060086](1060087)TAAACCCTGA...TCAGGTTTTA
#--- end ---
 
chr4
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr04-2002-8-19-Sanger
# 1204112bp AACCCTAAACCCTGA...
>=1.0:[650061..650262](202)TTTTAATAAT...ATAATTAAAA
>=0.95:[648539..651828](3290)TATAAACATT...TTTTATATTT
>=0.9:[647731..654304](6574)AACATTTAAA...AATAAAAAAA
>=0.85:[639114..665595](26482)CAATTTCTAA...TATATTTTTT
>=0.8:[41709..1188699](1146991)TTATTTTTTT...AGTTAAGTTA
>=0.75:[0..1204111](1204112)AACCCTAAAC...TTAGGGTTTA
#--- end ---
 
chr5
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr05-2002-8-19-Sanger
# 1343552bp CTAAACCCTGAACCCT...
>=1.0:[208403..208597](195)AAAAAAAAAT...AATAAAATAA
>=0.95:[454931..458053](3123)CAAATTTTTG...TGGAATATAA
>=0.9:[453877..459351](5475)TTTATTTTAT...TATAAGAAAT
>=0.85:[1304498..1321991](17494)TTTAAAAATG...TGTTTTAATT
>=0.8:[0..1343551](1343552)CTAAACCCTG...GGTTCAGGGT
#--- end ---
 
chr6
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr06-2002-8-19-Sanger
# 1377956bp CCTAAACCCTGAACCCTAA...
>=1.0:[39936..40118](183)ATAAAATAAA...AAAATTTTAT
>=0.95:[528479..529541](1063)ATTAAAATAA...ATTTTTAGAT
>=0.9:[1015753..1019942](4190)TTATTATTAA...TTAGTTTTAT
>=0.85:[1012661..1025787](13127)CAAAACCACC...TTTATGGATT
>=0.8:[0..1377955](1377956)CCTAAACCCT...AGGGTTTAGG
#--- end ---
 
chr7
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr07-2002-8-19-Sanger
# 1350452bp AAACCCCAACCCTAAACCCTA...
>=1.0:[714118..714370](253)TTATTTTTAA...TAATTAATAT
>=0.95:[713290..715097](1808)AATATATTTA...ATCTTAATTT
>=0.9:[105239..108481](3243)AACATAAATA...ATATTAAAAA
>=0.85:[67481..86120](18640)TTCACAATTT...AAAATATATT
>=0.8:[0..1350451](1350452)AAACCCCAAC...TAAGACAGTA
#--- end ---
 
chr8
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr08-2002-08-19-Sanger
# 1323195bp TCAAATATTCCGAAAACCCTAA...
>=1.0:[344230..344437](208)ATTATTTATA...TTTTTTTTTT
>=0.95:[1022302..1023765](1464)TTTTATATGA...ATTTAAAAAA
>=0.9:[378658..381469](2812)ATTTTGATAA...TTTTTAGTTT
>=0.85:[500245..516452](16208)TATTAATTTA...AATATATATT
>=0.8:[0..1323194](1323195)TCAAATATTC...TTAGGGTTGA
#--- end ---
 
chr9
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr09-2002-8-19-Sanger
# 1541723bp AACCCTGAACCCTAAA...
>=1.0:[1130660..1130874](215)ATAAATTAAT...TATATTAATA
>=0.95:[1241781..1244923](3143)TAATTATAAA...AATACATTTA
>=0.9:[323390..329864](6475)TAATTATATG...TTATCATAAT
>=0.85:[1373697..1416898](43202)TATATTTTGA...TATTTATTTT
>=0.8:[0..1541722](1541723)AACCCTGAAC...GTTTAGGGTT
#--- end ---
 
chr10
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr10-2002-8-19-TIGR
# 1694445bp CTAAACCCTGAACCC...
>=1.0:[937258..937557](300)TATATAATTT...ATATTATATT
>=0.95:[936412..938559](2148)ATTTATTAAT...TAAAACTTAA
>=0.9:[934530..938484](3955)AATAAAATAT...TTTTTATGAT
>=0.85:[87704..106686](18983)AATATGAACA...ACCATATAAA
>=0.8:[0..1694444](1694445)CTAAACCCTG...TTTTAGGGTT
#--- end ---
 
chr11
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr11-2002-8-19-TIGR
# 2035250bp AACCCTAAACCC...
>=1.0:[829543..830061](519)TATTAAATAT...TAAATATTTA
>=0.95:[829012..831558](2547)AATGTAAATA...TACTTATATT
>=0.9:[827029..831902](4874)ATAAACAAAT...AAATATATAT
>=0.85:[114132..140968](26837)TATTTGATAA...TATTTTTTAA
>=0.8:[0..2035249](2035250)AACCCTAAAC...TTGGGTTTAG
#--- end ---
 
chr12
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr12-2002-8-19-Stanford
# 2271477bp CTGAACCCTAAACCCTA...
>=1.0:[1282796..1283108](313)AAATAAATAA...AAATAATATT
>=0.95:[1281883..1285109](3227)AAAAAAAACG...ATATATATTT
>=0.9:[1279598..1285959](6362)ATTTTTTTTA...ATTTTAAATA
>=0.85:[2156101..2180543](24443)TAATTAGAAG...TATATTGTAT
>=0.8:[0..2271476](2271477)CTGAACCCTA...AGAGTAAGTA
#--- end ---
 
chr13_1 (NB)
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr13_1-2002-8-19-Sanger
# 2729159bp AACCCTAAACCTAAACCCTAA...
>=1.0:[918090..918461](372)AAATATTATA...TAAATATATA
>=0.95:[1766404..1767423](1020)ATATTATATT...TTATTATTAT
>=0.9:[1817357..1822403](5047)AATTAATATA...ATAATTTTTA
>=0.85:[1809128..1827314](18187)ATTTTTACTT...TACCGTAATT
>=0.8:[0..2729158](2729159)AACCCTAAAC...GTTTAGGGTT
#--- end --
 
chr13_2 (NB)
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr13_2-2002-8-19-Sanger
# 18168bp TATTTTAGTTTATATTTATTACA...
>=1.0:[10716..10816](101)TTAATTATAA...ATTAAAATAT
>=0.95:[10604..10816](213)AAAAGTATTT...ATTAAAATAT
>=0.9:[9293..9869](577)AATAAAAAAA...TATATAAATT
>=0.85:[9070..12951](3882)TTACTATATA...AATAAGATAT
>=0.8:[8084..18162](10079)AATGCCCCCA...TAAAAAAATA
>=0.75:[0..18167](18168)TATTTTAGTT...AAATAGCAAC
#--- end ---
 
chr14
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = chr14-2002-8-19-TIGR
# 3291006bp CTGAACCCTAAACCCTAAAC...
>=1.0:[1072223..1072590](368)TATATTAATA...AAATATAATT
>=0.95:[1071769..1073716](1948)TATAAAATAT...TTGTTTTATT
>=0.9:[2716237..2720216](3980)TTTTTAATAA...CTCACAGAAT
>=0.85:[2792750..2814996](22247)ATCTAAAATG...TTTTTTTTTT
>=0.8:[0..3291005](3291006)CTGAACCCTA...GTTTAGGGTT
#--- end ---
 

17/10/2002: Examples from Malaria Information Resource (CD) Release 5

Key: >=<min_frac>: [<start>..<end>] (<length>) <beginning>...<ending>
 
Example chr5: /mnt/cdrom1/genome/sanger/chr5/z010701f.txt of 26 Aug 2001
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = c5-z010701f.txt
# 1343552bp CTAAACCCTGAACCCTA...
>=1.0:[208403..208597](195)AAAAAAAAAT...AATAAAATAA
>=0.95:[454931..458053](3123)CAAATTTTTG...TGGAATATAA
>=0.9:[453877..459351](5475)TTTATTTTAT...TATAAGAAAT
>=0.85:[1304498..1321991](17494)TTTAAAAATG...TGTTTTAATT
>=0.8:[0..1343551](1343552)CTAAACCCTG...GGTTCAGGGT
#--- end ---
 
Example chr10: /mnt/cdrom1/genome/tigr/chr10/pfa1_10.txt of 26 Aug 2001
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = pfa1_10.txt
# 1694401bp CTAAACCCTGAACC...
>=1.0:[937258..937557](300)TATATAATTT...ATATTATATT
>=0.95:[936412..938559](2148)ATTTATTAAT...TAAAACTTAA
>=0.9:[934530..938484](3955)AATAAAATAT...TTTTTATGAT
>=0.85:[87704..106686](18983)AATATGAACA...ACCATATAAA
>=0.8:[0..1694400](1694401)CTAAACCCTG...TTTTAGGGTT
#--- end ---
 
Example chr11: /mnt/cdrom1/genome/tigr/chr11/pfa1_11.txt of 26 Aug 2001
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[1] = -1
# argv[2] = pfa1_11.txt
# 2035244bp AACCCTAAACCCTGAACCCTGAACCCTGAA...
>=1.0:[829539..830057](519)TATTAAATAT...TAAATATTTA
>=0.95:[829008..831554](2547)AATGTAAATA...TACTTATATT
>=0.9:[827025..831898](4874)ATAAACAAAT...AAATATATAT
>=0.85:[114130..140966](26837)TATTTGATAA...TATTTTTTAA
>=0.8:[0..2035243](2035244)AACCCTAAAC...TTGGGTTTAG
#--- end ---
 
Example chr12: /mnt/cdrom1/genome/stanford/chr12/z010524.txt (>Chr12Contig01.010524 2270790 bp) of 26 Aug 2001
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = c12-z010524f.txt
# 2270790bp CTAAACCCTGAACCCTAAAC...
>=1.0:[1281901..1282213](313)AAATAAATAA...AAATAATATT
>=0.95:[1280988..1284214](3227)AAAAAAAACG...ATATATATTT
>=0.9:[1278703..1285042](6340)ATTTTTTTTA...TTAGAAGAAA
>=0.85:[2155396..2179822](24427)TAATTAGAAG...GAAAATGAAT
>=0.8:[0..2270789](2270790)CTAAACCCTG...GTAAGGAATG
#--- end ---
 
Example chr14: /mnt/cdrom1/genome/tigr/chr14/pfa1_14.txt of 26 Aug 2001
#--- Biased.java, L.A., CSSE, Monash, .au ---
# argv[0] = AT
# argv[1] = -1
# argv[2] = pf-chr14.txt
# 3290890bp CTGAACCCTAAACCCTAAACCCTAAACCCTAA...
>=1.0 : [1072219..1072586] (368) TATATTAATA...AAATATAATT
>=0.95: [1071765..1073712] (1948) TATAAAATAT...TTGTTTTATT
>=0.9 : [2716123..2720102] (3980) TTTTTAATAA...CTCACAGAAT
>=0.85: [2792661..2814887] (22227) ATCTTGTAAA...TTTCCCCATA
>=0.8 : [0..3290889] (3290890) CTGAACCCTA...GTTTAGGGTT
#--- end ---
 
Time (chr14, bayes  900mHz Intel + Linux): real 0m7.799s; user 0m7.250s; sys 0m0.340s. Note that the time to read the input file was a significant part of the elapsed time.