Inductive Logic Programming
Predicting hexose binding sites in proteins
Since we're not doing original research, but rather intend to demonstrate a port of the Aleph ILP package to ISO Prolog running on Quantum Prolog, we cite the problem's definition in full from the original paper (ilp09):
Hexoses are 6-carbon simple sugar molecules that play a key role in different biochemical pathways, including cellular energy release, signaling, carbohydrate synthesis, and the regulation of gene expression. Hexose binding proteins belong to diverse functional families that lack significant sequence or, often, structural similarity. Despite this fact, these proteins show high specificity to their hexose ligands. The few amino acids (also called residues) present at the binding site play a large role in determining the binding site's distinctive topology and biochemical properties and hence the ligand type and the protein's functionality.
This work presents an ILP classifier that extracts rules [...] without prior biochemical knowledge. It classifies binding sites based on the extracted biochemical rules, clearly specifying the rules used to discriminate each instance.
For demonstration purposes, we're running the ILP problem in its extended form including facts about amino-acid geometries, using biochemical knowledge from publicly available sources as formulated in (bmcbioinfo11):
The background knowledge is the set of features, facts and rules known a priori. This is given to the ILP system as a basis for learning and constructing the classification rules. The piece of background knowledge central to our task is the binding site representation. [The table below] is an excerpt of the background knowledge for protein 1BDG. The
center_coords
predicate specifies the binding-site center coordinates, which is the pyranose ring centroid of the bound glucose in this structure. Thehas_aminoacid
predicate specifies each amino acid present within the protein binding site, listing its unique identifier and name. Thehas_atom
predicate details the residue atoms, specifying the protein database (PDB) atom name and its coordinates. By extracting the coordinates of the center and the various atoms, we compute their respective distances. We set a tolerance of 0.5 Angstrom on distances between atoms, a sensible error margin in a hexose binding site.
Predicates for binding site representation
center coords/2, has_atom/4, dist/4,
aminoacid/3, atom_to_center dist/4,
atom_to_atom_dist/6, diff_aminoacid/2
Excerpt of the background knowledge for protein 1BDG
center_coords(p1BDG, p(27.0,22.1,64.9)).
has_aminoacid(p1BDG, a64, phe).
has_aminoacid(p1BDG, a85, leu).
has_aminoacid(p1BDG, a86, gly).
has_aminoacid(p1BDG, a87, gly).
has_atom(p1BDG, a64, 'CD2', p(22.4,13.3,65.5)).
has_atom(p1BDG, a64, 'CE2', p(21.6,14.0,66.4)).
has_atom(p1BDG, a85, 'C', p(24.6,25.9,57.4)).
has_atom(p1BDG, a85, 'O', p(24.6,24.8,57.8)).
has_atom(p1BDG, a86, 'N', p(24.8,27.0,58.3)).
has_atom(p1BDG, a86, 'CA', p(24.9,26.8,59.7)).
Result theory
From these facts, the ILP process derives rules for protein-hexose bindings such as the following:
bind(A) :-
has_aminoacid(A, B, gly),
atom_to_center_dist(B, CA, 7.2, 0.5),
has_aminoacid(A, C, phe),
atom_to_center_dist(C, C, 7.4, 0.5),
atom_to_center_dist(C, O, 7.2, 0.5).
bind(A) :-
has_aminoacid(A, B, tyr),
atom_to_center_dist(B, CZ, 9.5, 0.5),
has_aminoacid(A, C, trp),
atom_to_center_dist(C, O, 7.6, 0.5),
has_aminoacid(A, D, gln).
bind(A) :-
has_aminoacid(A, B, glu),
atom_to_center_dist(B, N, 9.5, 0.5),
atom_to_center_dist(B, CG, 6.8, 0.5),
has_aminoacid(A, C, his).
bind(A) :-
has_aminoacid(A, B, asn),
atom_to_center_dist(B, C, 8.1, 0.5),
atom_to_center_dist(B, O, 8.3, 0.5),
atom_to_center_dist(B, CB, 7.2, 0.5),
atom_to_center_dist(B, CG, 5.7, 0.5).
bind(A) :-
has_aminoacid(A, B, trp),
atom_to_center_dist(B, CA, 9.0, 0.5),
atom_to_center_dist(B, O, 9.6, 0.5),
has_aminoacid(A, C, ser).
bind(A) :-
has_aminoacid(A, B, arg),
atom_to_center_dist(B, C, 8.1, 0.5),
atom_to_center_dist(B, CD, 9.1, 0.5).
bind(A) :-
has_aminoacid(A, B, gln),
atom_to_center_dist(B, CG, 9.0, 0.5),
has_aminoacid(A, C, asp),
atom_to_center_dist(C, CG, 5.3, 0.5).
Running the Aleph ISO Prolog port on Quantum Prolog
For running the ISO Prolog Aleph port on the hexose_aminoacid
ILP problem, we're invoking a Prolog program that includes
(via the ensure_loaded/1
directive) the Aleph ISO Port source
starting from a current working directory containing
the hexose.f
and hexose.n
positive and negative examples,
respectively, along with the hexose.pl
background theory that,
in turn, includes further utility clauses. Then we're executing
Aleph goals using an initialization/1
directive such as the
following:
:- ensure_loaded('../aleph-iso.pl').
:- initialization((
read_all('hexose'),
induce,
write_rules('write_rules.out')).
To give an idea about performance relative to popular Prolog
implementations, we're comparing against a version of SWI Prolog.
Note that up-to-date versions of SWI Prolog can no longer run
Aleph v5 as-is, since newer SWI Prolog versions include a declaration
for the false/0
predicate as non-dynamic; which is in accordance
with ISO/IEC 13211-1 (2012) but unfortunate, since Aleph uses
predicates of the form
false :- ...
to encode negative examples. For comparison, we must thus run the last version of SWI Prolog that doesn't have this nor other issues, which is the rather old SWI Prolog version 5.8.3 (the slightly newer SWI Prolog version 5.10.x will abort an attempt to run Aleph with an unrelated assertion error).
That older SWI Prolog version, though, lacks secondary term indexing, so for fairness we're switching this feature off in Quantum Prolog as well for obtaining the figures stated below.
Moreover, we're stopping the loop performed by induce/0
after
the first iteration; that is, after AtomsLeft
as of
'$aleph_global'(atoms_left,atoms_left(pos,AtomsLeft))
has reached
[2-12,14-19,21-26,28-37,40-49,51-52,57-80].
Under these conditions, we can report running times for Aleph on Quantum Prolog roughly comparable to those of SWI Prolog, at about 1,5 to 1.75min on mainstream notebook hardware.
Note both Quantum Prolog and SWI Prolog exhibit severe performance
degradation due to not using secondary term indexing, when
running times are dominated by accesses to indexable
has_aminoaid/4
calls. On the other hand, YAP Prolog,
the Prolog system on which Aleph was originally developed,
with its known performance characteristic on suitable large
Prolog databases by employing heuristic term indexing, is expected
to perform way better than either Quantum Prolog or SWI Prolog.
However, at the time of this writing, no current YAP Prolog
on the test system was available or could be successfully
compiled for comparison.
Of course, re-enabling secondary term indexing, Quantum Prolog reaches much higher execution times (up to four times faster than without), just as up-to-date versions of SWI Prolog do.
This isn't surprising since execution times of Aleph on the
Hexose/Aminoacid problem is pretty much dominated by verifying examples
against hypotheses; for a full run of Aleph, the has_aminoacid/3
predicate is called more than 100.000.000 times! We're going to
tackle exactly that problem in part 2.
Bibliography
Prolog has a rich history of applications in bioinformatics,
starting with classical discrete approaches towards gene
sequence representation (aaai88s), annotation (moeller01),
and analysis (nar84, magpie), protein motif search
(jicslp92), classification (ieicetrans95), genome
comparison (bansai95, padl98), protein structure databases
(tibc96, computchem86), rule-based metabolic simulations
(bmcbioinfo08, cabios91), and grammar-based models
(psob97, aisrra91), the latter with connections
to recent probabilistic grammar models
(eg. https://www.biorxiv.org/content/10.1101/2020.12.08.416503v1
).
Prolog also lends itself well to classic dynamic programming approaches for comparison, alignment, and mutation of nucleotide sequences useful for implementating Needleman-Wunsch's, Hirschberg's, or Smith-Waterman's algorithms.
For example, tcm90 contains an extended implementation of the classic Kabsch-Sander secondary structure definition algorithm, whereas Smith-Waterman's alignment algorithm is central to aisrra91. Needleman-Wunsch alignment has been applied to identify SARS-CoV2 mutations during the SARS/CoVid2 pandemic (jop19).
Prolog sees also use in medical practice, such as in Rule-based and logic-based tools for encoding logical knowledge for scheduling vaccinations, and clinical trials (arxiv20).
In the post-genome era, a large body of work in computational molecular biology is devoted to protein folding (protein tertiary structure prediction from DNA/RNA). With significant advances using Machine Learning. The role of Inductive-logic Programming (ILP), pioneered in the 1980s and 1990s (mis, pe92) is re-evaluated among other Machine Learning approaches.
Citing again the original paper (ilp09), where the role and contribution of ILP is described as follows:
Rule learning is especially appealing because of its easy-to-understand format. A set of if-then rules describing a certain concept is highly expressive and readable.
ILP has evolved into the Prolog-based systems Aleph, and ProGolem, a Prolog re-implementation of the pioneering Golem system into the GILPS framework. While GILPS itself is highly YAP Prolog-specific, the reproducibility of readily available test data suites on the author's web site stand out as a particularly well-understood ILP problem formulated and pioneered by ilp09.
- mis
- Shapiro, E.Y. (1981). The model inference system. IJCAI'81: Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2, August 19811064
- bmcbioinfo08
- Whelan, K., King, R. (2008). Using a logical model to predict the growth of yeast. BMC Bioinformatics. 9. 10.1186/1471-2105-9-97.
- computchem86
- Morffew, A.J., Todd, S.J. (1986). The use of prolog as a protein querying language. Comput. Chem., 10, 9-14.
- ieicetrans95
- Nishikawa, A., Satou, K., Furuichi, E., Kuhara, S., Ushijima, K. (1995). Data Classification Component in a Deductive Database System and Its Application to Protein Structural Analysis. IEICE Trans. Inf. Syst., 78-D, 1377-1387.
- aaai88s
- Searls, D.B. (1988). Representing Genetic Information with Formal Grammars. AAAI.
- tibc96>
- Gray, P.M., Kemp, G.J., Rawlings, C.J., Brown, N.P., Sander, C., Thornton, J.M., Orengo, C.M., Wodak, S.J., Richelle, J. (1996). Macromolecular structure information and databases.Trends in Biochemical Sciences, 21, 251-256.
- padl98
- Bansal, A.K., Bork, P. (1999). Applying Logic Programming to Derive Novel Functional Information of Genomes. PADL.
- bansai95
- Bansal, A.K. (1995). Establishing a framework for comparative analysis of genome sequences.
- beahrea92
- Baehr, A.L., Dunham, G.S., Matsuda, H., Michaels, G.S., Taylor, R., Overbeek, R.A., Rudd, K.E., Ginsburg, A., Joerg, D., Kazic, T., Hagstrom, R., Zawada, D.G., Smith, C.L., Yoshida, K. (1992). An Integrated Database to Support Research on Escherichia Coli.
- jomg85
- Rawlings, C.J., Taylor, W.R., Nyakairu, J., Fox, J., Sternberg, M.J. (1985). Reasoning about protein topology using the logic programming language PROLOG. Journal of Molecular Graphics, 3, 151-157.
- moeller01
- Moeller, S. (2001) An Environment for Consistent Sequence Annotation and its Application to Transmembrane Proteins
- nar84
- Lyall, A., Hammond, P., Brough, D.R., Glover, D.M. (1984). BIOLOG - a DNA sequence analysis system in PROLOG. Nucleic acids research, 12 1 Pt 2, 633-42.
- psob97
- Thieffry, D., Rosenblueth, D.A., Huerta, A.M., Salgado, H., Collado-Vides, J. (1997). Definite-clause grammars for the analysis of cis-regulatory regions in E. coli. Pacific Symposium on Biocomputing, 441-52.
- magpie
- Gaasterland T., Sensen C.W., (1996) Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture, Biochimie, Volume 78, Issue 5, 302-310
- tcm90
- Barton, G.J., Rawlings, C.J. (1990). A PROLOG approach to analysing protein structure. Tetrahedron Computer Methodology, 3, 739-756.
- jop19
- Isa Irawan, M., Mukhlash, I., Rizky, A., Ririsati Dewi, A. (2019). Application of Needleman-Wunch Algorithm to identify mutation in DNA sequences of Corona virus. Journal of Physics. Conference Series, 1218.
- pe92
- Muggleton, S., King, R., Sternberg, M.J. (1992). Protein secondary structure prediction using logic-based machine learning. Protein engineering, 5 7, 647-57.
- bmcbioinfo11
- Santos, J., Nassif, H., Page, D., Muggleton, S., Sternberg, M.J. (2011). Automated identification of protein-ligand interaction features using Inductive Logic Programming: a hexose binding case study. BMC Bioinformatics, 13, 162-162.
- arxiv20
- Norris, D.C. (2020). What Were They Thinking? Pharmacologic priors implicit in a choice of 3+3 dose-escalation design. arXiv: Methodology.
- jicslp92
- Lusk, E., Mudambi, S., Overbeek, R. (1993). Applications of the Aurora Parallel Prolog System to Computational Molecular Biology JICSLP'92 Post-Conference Joint Workshop on Distributed and Parallel Implementations of Logic Programming Systems
- aisrra91
- Taylor, R.C. (1991) Automated Insertion of Sequences into a Ribosomal RNA Alignment: An Application of Computational Linguistics in Molecular Biology, Thesis submitted to Case Western Reserve University, Argonne National Laboratory
- cabios91
- Brutlag, D.L., Galper, A., Millis, D.H. (1991). Knowledge-based simulation of DNA metabolism: prediction of enzyme action. Computer applications in the biosciences : CABIOS, 7 1, 9-19.
- ilp09
- Nassif, H., Al-Ali, H., Khuri, S., Keirouz, W., Page, D. (2009). An Inductive Logic Programming Approach to Validate Hexose Binding Biochemical Knowledge. Inductive logic programming ILP, 5989, 149-165.