Longest Biased Interval and Longest Non-Negative Sum Interval,
J. Bioinformatics, 2003
L. Allison. Longest Biased Interval and Longest Non-Negative Sum Interval, Bioinformatics, 19(10), pp1294-1295, 1 July 2003, [doi:10.1093/bioinformatics/btg135]['05] also see [more (click)] discussion on the algorithm and on related algorithms. |
- Java source file:
[Biased.java], and class files:
[1], [2]
- To run, e.g.
- java Biased "AT" 0.77 dataFileName
- for a particular figure (here 0.77 "AT"), or ...
- java Biased "AT" -1 dataFileName
- to run through a range of values.
- NB. Make sure that you have a just-in-time compiler if you want to analyse long sequences.
- NB. Characters other than acgtuACGTUrnyRNY are skipped but this is trivial to change.
- Example of running [time (click)].
- for a particular figure (here 0.77 "AT"), or ...
-- Figure: cummulative sum, first-below & last-above --
Sung Kwon Kim suggested a 3-pass, (always) linear-time algorithm (personnal communication, 28/7/2003). I have coded this up in a program that runs both algorithms: [new code .java]. |
21/10/2002: Examples from PlasmoDB C.D., seq's dated 2002.8.19 [19/8/2002]
NB. Plasmodium falciparum causes malaria. Its DNA is 80% AT, contains repetitive patterns and is highly [compressible] compared to ``most'' other DNA. The data files of chromosome sequences 10, 11, 14 contain some non-{A,C,G,T} characters, e.g. M={A,C}, W={A,T}, K={G,T} etc.. For these experiments, a read-routine was used that does read these characters (to maintain position numbering), but they were treated as ``regular'' characters, not contributing to AT-richness, say, because (i) it makes only a little difference unless there are many occurrences of such characters, and (ii) it is not quite clear what is the best way to treat some of them anyway (e.g. does M count towards AT or not?). In any case, the matter is orthogonal to the algorithm; the ``special characters'' can be included in the list of desired characters, e.g. "ATW", at your choice.
Running time, on a 900MHz Pentium, Java, Linux, for the largest sequence (chr14) is 7 or 8 seconds for five passes with different values of f and this includes reading the data.
- Key: >=<min_frac>: [<start>..<end>] (<length>) <beginning>...<ending>
- chr1
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr01-2002-8-19-Sanger
- # 643292bp CTAAACCTAAACCTAA...
>=1.0: [459134..459485] (352) AAAATAATAA...AATAAATAAA >=0.95: [458479..461841] (3363) TTAACATTTT...TTTATTATAT >=0.9: [458757..465678] (6922) ATGTTTCTAA...TTTATTTAAT >=0.85: [439465..465974] (26510) TTAATTGTAA...TCTTTTGATT >=0.8: [8596..625720] (617125) ATTTCTACAC...TATATTAAGA >=0.75: [0..643291] (643292) CTAAACCTAA...GGAATAGGGT - #--- end ---
- # argv[0] = AT
- chr2
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr02-200208019-TIGR
- # 947102bp AACCCTAAACCCTAAACCCTAAACCCTA...
>=1.0: [449263..449686] (424) TTATTTTAAT...TATATTATTT >=0.95: [447292..450582] (3291) TACAAAATAA...AAAAAAAATA >=0.9: [445946..451647] (5702) TATATTTTAT...GGAAAAAAAT >=0.85: [807673..840316] (32644) ATGAAAATTA...ATTTATCATA >=0.8: [0..947101] (947102) AACCCTAAAC...TCAGGGTTCA - #--- end ---
- # argv[0] = AT
- chr3
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr03-2002-8-19-Sanger
- # 1060087bp TAAACCCTGAACCCTAAACCCT...
>=1.0: [594855..595306] (452) AAAATAAATT...ATTAATAATT >=0.95: [594084..597349] (3266) AAATAAATAA...TTTTGTTAAA >=0.9: [590628..597388] (6761) AATACTAAAG...ATATACTTAT >=0.85: [76744..100990] (24247) AATATTAAAC...AACTCATAAA >=0.8: [0..1060086] (1060087) TAAACCCTGA...TCAGGTTTTA - #--- end ---
- # argv[0] = AT
- chr4
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr04-2002-8-19-Sanger
- # 1204112bp AACCCTAAACCCTGA...
>=1.0: [650061..650262] (202) TTTTAATAAT...ATAATTAAAA >=0.95: [648539..651828] (3290) TATAAACATT...TTTTATATTT >=0.9: [647731..654304] (6574) AACATTTAAA...AATAAAAAAA >=0.85: [639114..665595] (26482) CAATTTCTAA...TATATTTTTT >=0.8: [41709..1188699] (1146991) TTATTTTTTT...AGTTAAGTTA >=0.75: [0..1204111] (1204112) AACCCTAAAC...TTAGGGTTTA - #--- end ---
- # argv[0] = AT
- chr5
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr05-2002-8-19-Sanger
- # 1343552bp CTAAACCCTGAACCCT...
>=1.0: [208403..208597] (195) AAAAAAAAAT...AATAAAATAA >=0.95: [454931..458053] (3123) CAAATTTTTG...TGGAATATAA >=0.9: [453877..459351] (5475) TTTATTTTAT...TATAAGAAAT >=0.85: [1304498..1321991] (17494) TTTAAAAATG...TGTTTTAATT >=0.8: [0..1343551] (1343552) CTAAACCCTG...GGTTCAGGGT - #--- end ---
- # argv[0] = AT
- chr6
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr06-2002-8-19-Sanger
- # 1377956bp CCTAAACCCTGAACCCTAA...
>=1.0: [39936..40118] (183) ATAAAATAAA...AAAATTTTAT >=0.95: [528479..529541] (1063) ATTAAAATAA...ATTTTTAGAT >=0.9: [1015753..1019942] (4190) TTATTATTAA...TTAGTTTTAT >=0.85: [1012661..1025787] (13127) CAAAACCACC...TTTATGGATT >=0.8: [0..1377955] (1377956) CCTAAACCCT...AGGGTTTAGG - #--- end ---
- # argv[0] = AT
- chr7
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr07-2002-8-19-Sanger
- # 1350452bp AAACCCCAACCCTAAACCCTA...
>=1.0: [714118..714370] (253) TTATTTTTAA...TAATTAATAT >=0.95: [713290..715097] (1808) AATATATTTA...ATCTTAATTT >=0.9: [105239..108481] (3243) AACATAAATA...ATATTAAAAA >=0.85: [67481..86120] (18640) TTCACAATTT...AAAATATATT >=0.8: [0..1350451] (1350452) AAACCCCAAC...TAAGACAGTA - #--- end ---
- # argv[0] = AT
- chr8
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr08-2002-08-19-Sanger
- # 1323195bp TCAAATATTCCGAAAACCCTAA...
>=1.0: [344230..344437] (208) ATTATTTATA...TTTTTTTTTT >=0.95: [1022302..1023765] (1464) TTTTATATGA...ATTTAAAAAA >=0.9: [378658..381469] (2812) ATTTTGATAA...TTTTTAGTTT >=0.85: [500245..516452] (16208) TATTAATTTA...AATATATATT >=0.8: [0..1323194] (1323195) TCAAATATTC...TTAGGGTTGA - #--- end ---
- # argv[0] = AT
- chr9
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr09-2002-8-19-Sanger
- # 1541723bp AACCCTGAACCCTAAA...
>=1.0: [1130660..1130874] (215) ATAAATTAAT...TATATTAATA >=0.95: [1241781..1244923] (3143) TAATTATAAA...AATACATTTA >=0.9: [323390..329864] (6475) TAATTATATG...TTATCATAAT >=0.85: [1373697..1416898] (43202) TATATTTTGA...TATTTATTTT >=0.8: [0..1541722] (1541723) AACCCTGAAC...GTTTAGGGTT - # argv[0] = AT
- chr10
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr10-2002-8-19-TIGR
- # 1694445bp CTAAACCCTGAACCC...
>=1.0: [937258..937557] (300) TATATAATTT...ATATTATATT >=0.95: [936412..938559] (2148) ATTTATTAAT...TAAAACTTAA >=0.9: [934530..938484] (3955) AATAAAATAT...TTTTTATGAT >=0.85: [87704..106686] (18983) AATATGAACA...ACCATATAAA >=0.8: [0..1694444] (1694445) CTAAACCCTG...TTTTAGGGTT - # argv[0] = AT
- chr11
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr11-2002-8-19-TIGR
- # 2035250bp AACCCTAAACCC...
>=1.0: [829543..830061] (519) TATTAAATAT...TAAATATTTA >=0.95: [829012..831558] (2547) AATGTAAATA...TACTTATATT >=0.9: [827029..831902] (4874) ATAAACAAAT...AAATATATAT >=0.85: [114132..140968] (26837) TATTTGATAA...TATTTTTTAA >=0.8: [0..2035249] (2035250) AACCCTAAAC...TTGGGTTTAG - #--- end ---
- # argv[0] = AT
- chr12
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr12-2002-8-19-Stanford
- # 2271477bp CTGAACCCTAAACCCTA...
>=1.0: [1282796..1283108] (313) AAATAAATAA...AAATAATATT >=0.95: [1281883..1285109] (3227) AAAAAAAACG...ATATATATTT >=0.9: [1279598..1285959] (6362) ATTTTTTTTA...ATTTTAAATA >=0.85: [2156101..2180543] (24443) TAATTAGAAG...TATATTGTAT >=0.8: [0..2271476] (2271477) CTGAACCCTA...AGAGTAAGTA - #--- end ---
- # argv[0] = AT
- chr13_1 (NB)
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr13_1-2002-8-19-Sanger
- # 2729159bp AACCCTAAACCTAAACCCTAA...
>=1.0: [918090..918461] (372) AAATATTATA...TAAATATATA >=0.95: [1766404..1767423] (1020) ATATTATATT...TTATTATTAT >=0.9: [1817357..1822403] (5047) AATTAATATA...ATAATTTTTA >=0.85: [1809128..1827314] (18187) ATTTTTACTT...TACCGTAATT >=0.8: [0..2729158] (2729159) AACCCTAAAC...GTTTAGGGTT - #--- end --
- # argv[0] = AT
- chr13_2 (NB)
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr13_2-2002-8-19-Sanger
- # 18168bp TATTTTAGTTTATATTTATTACA...
>=1.0: [10716..10816] (101) TTAATTATAA...ATTAAAATAT >=0.95: [10604..10816] (213) AAAAGTATTT...ATTAAAATAT >=0.9: [9293..9869] (577) AATAAAAAAA...TATATAAATT >=0.85: [9070..12951] (3882) TTACTATATA...AATAAGATAT >=0.8: [8084..18162] (10079) AATGCCCCCA...TAAAAAAATA >=0.75: [0..18167] (18168) TATTTTAGTT...AAATAGCAAC - #--- end ---
- # argv[0] = AT
- chr14
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr14-2002-8-19-TIGR
- # 3291006bp CTGAACCCTAAACCCTAAAC...
>=1.0: [1072223..1072590] (368) TATATTAATA...AAATATAATT >=0.95: [1071769..1073716] (1948) TATAAAATAT...TTGTTTTATT >=0.9: [2716237..2720216] (3980) TTTTTAATAA...CTCACAGAAT >=0.85: [2792750..2814996] (22247) ATCTAAAATG...TTTTTTTTTT >=0.8: [0..3291005] (3291006) CTGAACCCTA...GTTTAGGGTT - #--- end ---
- # argv[0] = AT
17/10/2002: Examples from Malaria Information Resource (CD) Release 5
- Key: >=<min_frac>: [<start>..<end>] (<length>) <beginning>...<ending>
- Example chr5: /mnt/cdrom1/genome/sanger/chr5/z010701f.txt of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = c5-z010701f.txt
- # 1343552bp CTAAACCCTGAACCCTA...
>=1.0: [208403..208597] (195) AAAAAAAAAT...AATAAAATAA >=0.95: [454931..458053] (3123) CAAATTTTTG...TGGAATATAA >=0.9: [453877..459351] (5475) TTTATTTTAT...TATAAGAAAT >=0.85: [1304498..1321991] (17494) TTTAAAAATG...TGTTTTAATT >=0.8: [0..1343551] (1343552) CTAAACCCTG...GGTTCAGGGT - #--- end ---
- # argv[0] = AT
- Example chr10: /mnt/cdrom1/genome/tigr/chr10/pfa1_10.txt of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = pfa1_10.txt
- # 1694401bp CTAAACCCTGAACC...
>=1.0: [937258..937557] (300) TATATAATTT...ATATTATATT >=0.95: [936412..938559] (2148) ATTTATTAAT...TAAAACTTAA >=0.9: [934530..938484] (3955) AATAAAATAT...TTTTTATGAT >=0.85: [87704..106686] (18983) AATATGAACA...ACCATATAAA >=0.8: [0..1694400] (1694401) CTAAACCCTG...TTTTAGGGTT - #--- end ---
- # argv[0] = AT
- Example chr11: /mnt/cdrom1/genome/tigr/chr11/pfa1_11.txt of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[1] = -1
- # argv[2] = pfa1_11.txt
- # 2035244bp AACCCTAAACCCTGAACCCTGAACCCTGAA...
>=1.0: [829539..830057] (519) TATTAAATAT...TAAATATTTA >=0.95: [829008..831554] (2547) AATGTAAATA...TACTTATATT >=0.9: [827025..831898] (4874) ATAAACAAAT...AAATATATAT >=0.85: [114130..140966] (26837) TATTTGATAA...TATTTTTTAA >=0.8: [0..2035243] (2035244) AACCCTAAAC...TTGGGTTTAG - #--- end ---
- # argv[1] = -1
- Example chr12: /mnt/cdrom1/genome/stanford/chr12/z010524.txt (>Chr12Contig01.010524 2270790 bp) of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = c12-z010524f.txt
- # 2270790bp CTAAACCCTGAACCCTAAAC...
>=1.0: [1281901..1282213] (313) AAATAAATAA...AAATAATATT >=0.95: [1280988..1284214] (3227) AAAAAAAACG...ATATATATTT >=0.9: [1278703..1285042] (6340) ATTTTTTTTA...TTAGAAGAAA >=0.85: [2155396..2179822] (24427) TAATTAGAAG...GAAAATGAAT >=0.8: [0..2270789] (2270790) CTAAACCCTG...GTAAGGAATG - #--- end ---
- # argv[0] = AT
- Example chr14: /mnt/cdrom1/genome/tigr/chr14/pfa1_14.txt of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = pf-chr14.txt
- # 3290890bp CTGAACCCTAAACCCTAAACCCTAAACCCTAA...
>=1.0 : [1072219..1072586] (368) TATATTAATA...AAATATAATT >=0.95: [1071765..1073712] (1948) TATAAAATAT...TTGTTTTATT >=0.9 : [2716123..2720102] (3980) TTTTTAATAA...CTCACAGAAT >=0.85: [2792661..2814887] (22227) ATCTTGTAAA...TTTCCCCATA >=0.8 : [0..3290889] (3290890) CTGAACCCTA...GTTTAGGGTT - #--- end ---
- Time (chr14, bayes 900mHz Intel + Linux): real 0m7.799s; user 0m7.250s; sys 0m0.340s. Note that the time to read the input file was a significant part of the elapsed time.
- # argv[0] = AT