The bits between proteins

Dinithi Sumanaweera, Lloyd Allison, Arun S. Konagurthu

Data Compression Conference (DCC), pp.177-186, doi:10.1109/DCC.2018.00026, 27-30 March 2018

Abstract: Comparison of protein sequences via alignment is an important routine in modern biological studies. Although the technologies for aligning proteins are mature, the current state of the art continues to be plagued by many shortcomings, chiefly due to the reliance on: (i) naive objective functions, (ii) fixed substitution scores independent of the sequences being considered, (iii) arbitrary choices for gap costs, and (iv) reporting, often, one optimal alignment without a way to recognise other competing sequence alignments. Here, we address these shortcomings by applying the compression-based Minimum Message Length (MML) inference framework to the protein sequence alignment problem. This grounds the problem in statistical learning theory, handles directly the complexity-vs-fit trade-off without ad hoc gap costs, allows unsupervised inference of all the statistical parameters, and permits the visualization and exploration of competing sequence alignment landscape.

At the IEEE: [doi:10.1109/DCC.2018.00026].