Inductive Logic Programming

We're going to run a well-studied Prolog standard workload for learning hexose bindings to proteins from publicly available information. The Prolog code we're going to employ here is the Aleph Inductive-Logic Programming (ILP) package; but first, we must port the Aleph code base to ISO Prolog.

Predicting hexose binding sites in proteins

Since we're not doing original research, but rather intend to demonstrate a port of the Aleph ILP package to ISO Prolog running on Quantum Prolog, we cite the problem's definition in full from the original paper (ilp09):

Hexoses are 6-carbon simple sugar molecules that play a key role in different biochemical pathways, including cellular energy release, signaling, carbohydrate synthesis, and the regulation of gene expression. Hexose binding proteins belong to diverse functional families that lack significant sequence or, often, structural similarity. Despite this fact, these proteins show high specificity to their hexose ligands. The few amino acids (also called residues) present at the binding site play a large role in determining the binding site's distinctive topology and biochemical properties and hence the ligand type and the protein's functionality.
This work presents an ILP classifier that extracts rules [...] without prior biochemical knowledge. It classifies binding sites based on the extracted biochemical rules, clearly specifying the rules used to discriminate each instance.

For demonstration purposes, we're running the ILP problem in its extended form including facts about amino-acid geometries, using biochemical knowledge from publicly available sources as formulated in (bmcbioinfo11):

The background knowledge is the set of features, facts and rules known a priori. This is given to the ILP system as a basis for learning and constructing the classification rules. The piece of background knowledge central to our task is the binding site representation. [The table below] is an excerpt of the background knowledge for protein 1BDG. The center_coords predicate specifies the binding-site center coordinates, which is the pyranose ring centroid of the bound glucose in this structure. The has_aminoacid predicate specifies each amino acid present within the protein binding site, listing its unique identifier and name. The has_atom predicate details the residue atoms, specifying the protein database (PDB) atom name and its coordinates. By extracting the coordinates of the center and the various atoms, we compute their respective distances. We set a tolerance of 0.5 Angstrom on distances between atoms, a sensible error margin in a hexose binding site.

Predicates for binding site representation

center coords/2, has_atom/4, dist/4,
aminoacid/3, atom_to_center dist/4,
atom_to_atom_dist/6, diff_aminoacid/2

Excerpt of the background knowledge for protein 1BDG

center_coords(p1BDG, p(27.0,22.1,64.9)).
has_aminoacid(p1BDG, a64, phe).
has_aminoacid(p1BDG, a85, leu).
has_aminoacid(p1BDG, a86, gly).
has_aminoacid(p1BDG, a87, gly).
has_atom(p1BDG, a64, 'CD2', p(22.4,13.3,65.5)).
has_atom(p1BDG, a64, 'CE2', p(21.6,14.0,66.4)).
has_atom(p1BDG, a85, 'C', p(24.6,25.9,57.4)).
has_atom(p1BDG, a85, 'O', p(24.6,24.8,57.8)).
has_atom(p1BDG, a86, 'N', p(24.8,27.0,58.3)).
has_atom(p1BDG, a86, 'CA', p(24.9,26.8,59.7)).

Result theory

From these facts, the ILP process derives rules for protein-hexose bindings such as the following:

bind(A) :-
	has_aminoacid(A, B, gly),
	atom_to_center_dist(B, CA, 7.2, 0.5),
	has_aminoacid(A, C, phe),
	atom_to_center_dist(C, C, 7.4, 0.5),
	atom_to_center_dist(C, O, 7.2, 0.5).

bind(A) :-
	has_aminoacid(A, B, tyr),
	atom_to_center_dist(B, CZ, 9.5, 0.5),
	has_aminoacid(A, C, trp),
	atom_to_center_dist(C, O, 7.6, 0.5),
	has_aminoacid(A, D, gln).

bind(A) :-
	has_aminoacid(A, B, glu),
	atom_to_center_dist(B, N, 9.5, 0.5),
	atom_to_center_dist(B, CG, 6.8, 0.5),
	has_aminoacid(A, C, his).

bind(A) :-
	has_aminoacid(A, B, asn),
	atom_to_center_dist(B, C, 8.1, 0.5),
	atom_to_center_dist(B, O, 8.3, 0.5),
	atom_to_center_dist(B, CB, 7.2, 0.5),
	atom_to_center_dist(B, CG, 5.7, 0.5).

bind(A) :-
	has_aminoacid(A, B, trp),
	atom_to_center_dist(B, CA, 9.0, 0.5),
	atom_to_center_dist(B, O, 9.6, 0.5),
	has_aminoacid(A, C, ser).

bind(A) :-
	has_aminoacid(A, B, arg),
	atom_to_center_dist(B, C, 8.1, 0.5),
	atom_to_center_dist(B, CD, 9.1, 0.5).

bind(A) :-
	has_aminoacid(A, B, gln),
	atom_to_center_dist(B, CG, 9.0, 0.5),
	has_aminoacid(A, C, asp),
	atom_to_center_dist(C, CG, 5.3, 0.5).

Running the Aleph ISO Prolog port on Quantum Prolog

For running the ISO Prolog Aleph port on the hexose_aminoacid ILP problem, we're invoking a Prolog program that includes (via the ensure_loaded/1 directive) the Aleph ISO Port source starting from a current working directory containing the hexose.f and hexose.n positive and negative examples, respectively, along with the hexose.pl background theory that, in turn, includes further utility clauses. Then we're executing Aleph goals using an initialization/1 directive such as the following:

:- ensure_loaded('../aleph-iso.pl').

:- initialization((
   read_all('hexose'),
   induce,
   write_rules('write_rules.out')).

To give an idea about performance relative to popular Prolog implementations, we're comparing against a version of SWI Prolog. Note that up-to-date versions of SWI Prolog can no longer run Aleph v5 as-is, since newer SWI Prolog versions include a declaration for the false/0 predicate as non-dynamic; which is in accordance with ISO/IEC 13211-1 (2012) but unfortunate, since Aleph uses predicates of the form

false :- ...

to encode negative examples. For comparison, we must thus run the last version of SWI Prolog that doesn't have this nor other issues, which is the rather old SWI Prolog version 5.8.3 (the slightly newer SWI Prolog version 5.10.x will abort an attempt to run Aleph with an unrelated assertion error).

That older SWI Prolog version, though, lacks secondary term indexing, so for fairness we're switching this feature off in Quantum Prolog as well for obtaining the figures stated below.

Moreover, we're stopping the loop performed by induce/0 after the first iteration; that is, after AtomsLeft as of

'$aleph_global'(atoms_left,atoms_left(pos,AtomsLeft))

has reached

[2-12,14-19,21-26,28-37,40-49,51-52,57-80].

Under these conditions, we can report running times for Aleph on Quantum Prolog roughly comparable to those of SWI Prolog, at about 1,5 to 1.75min on mainstream notebook hardware.

Note both Quantum Prolog and SWI Prolog exhibit severe performance degradation due to not using secondary term indexing, when running times are dominated by accesses to indexable has_aminoaid/4 calls. On the other hand, YAP Prolog, the Prolog system on which Aleph was originally developed, with its known performance characteristic on suitable large Prolog databases by employing heuristic term indexing, is expected to perform way better than either Quantum Prolog or SWI Prolog. However, at the time of this writing, no current YAP Prolog on the test system was available or could be successfully compiled for comparison.

Of course, re-enabling secondary term indexing, Quantum Prolog reaches much higher execution times (up to four times faster than without), just as up-to-date versions of SWI Prolog do.

This isn't surprising since execution times of Aleph on the Hexose/Aminoacid problem is pretty much dominated by verifying examples against hypotheses; for a full run of Aleph, the has_aminoacid/3 predicate is called more than 100.000.000 times! We're going to tackle exactly that problem in part 2.

Bibliography

Prolog has a rich history of applications in bioinformatics, starting with classical discrete approaches towards gene sequence representation (aaai88s), annotation (moeller01), and analysis (nar84, magpie), protein motif search (jicslp92), classification (ieicetrans95), genome comparison (bansai95, padl98), protein structure databases (tibc96, computchem86), rule-based metabolic simulations (bmcbioinfo08, cabios91), and grammar-based models (psob97, aisrra91), the latter with connections to recent probabilistic grammar models (eg. https://www.biorxiv.org/content/10.1101/2020.12.08.416503v1).

Prolog also lends itself well to classic dynamic programming approaches for comparison, alignment, and mutation of nucleotide sequences useful for implementating Needleman-Wunsch's, Hirschberg's, or Smith-Waterman's algorithms.

For example, tcm90 contains an extended implementation of the classic Kabsch-Sander secondary structure definition algorithm, whereas Smith-Waterman's alignment algorithm is central to aisrra91. Needleman-Wunsch alignment has been applied to identify SARS-CoV2 mutations during the SARS/CoVid2 pandemic (jop19).

Prolog sees also use in medical practice, such as in Rule-based and logic-based tools for encoding logical knowledge for scheduling vaccinations, and clinical trials (arxiv20).

In the post-genome era, a large body of work in computational molecular biology is devoted to protein folding (protein tertiary structure prediction from DNA/RNA). With significant advances using Machine Learning. The role of Inductive-logic Programming (ILP), pioneered in the 1980s and 1990s (mis, pe92) is re-evaluated among other Machine Learning approaches.

Citing again the original paper (ilp09), where the role and contribution of ILP is described as follows:

Rule learning is especially appealing because of its easy-to-understand format. A set of if-then rules describing a certain concept is highly expressive and readable.

ILP has evolved into the Prolog-based systems Aleph, and ProGolem, a Prolog re-implementation of the pioneering Golem system into the GILPS framework. While GILPS itself is highly YAP Prolog-specific, the reproducibility of readily available test data suites on the author's web site stand out as a particularly well-understood ILP problem formulated and pioneered by ilp09.

mis: Shapiro, E.Y. (1981). The model inference system. IJCAI'81: Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2, August 19811064
bmcbioinfo08: Whelan, K., King, R. (2008). Using a logical model to predict the growth of yeast. BMC Bioinformatics. 9. 10.1186/1471-2105-9-97.
computchem86: Morffew, A.J., Todd, S.J. (1986). The use of prolog as a protein querying language. Comput. Chem., 10, 9-14.
ieicetrans95: Nishikawa, A., Satou, K., Furuichi, E., Kuhara, S., Ushijima, K. (1995). Data Classification Component in a Deductive Database System and Its Application to Protein Structural Analysis. IEICE Trans. Inf. Syst., 78-D, 1377-1387.
aaai88s: Searls, D.B. (1988). Representing Genetic Information with Formal Grammars. AAAI.
tibc96>: Gray, P.M., Kemp, G.J., Rawlings, C.J., Brown, N.P., Sander, C., Thornton, J.M., Orengo, C.M., Wodak, S.J., Richelle, J. (1996). Macromolecular structure information and databases.Trends in Biochemical Sciences, 21, 251-256.
padl98: Bansal, A.K., Bork, P. (1999). Applying Logic Programming to Derive Novel Functional Information of Genomes. PADL.
bansai95: Bansal, A.K. (1995). Establishing a framework for comparative analysis of genome sequences.
beahrea92: Baehr, A.L., Dunham, G.S., Matsuda, H., Michaels, G.S., Taylor, R., Overbeek, R.A., Rudd, K.E., Ginsburg, A., Joerg, D., Kazic, T., Hagstrom, R., Zawada, D.G., Smith, C.L., Yoshida, K. (1992). An Integrated Database to Support Research on Escherichia Coli.
jomg85: Rawlings, C.J., Taylor, W.R., Nyakairu, J., Fox, J., Sternberg, M.J. (1985). Reasoning about protein topology using the logic programming language PROLOG. Journal of Molecular Graphics, 3, 151-157.
moeller01: Moeller, S. (2001) An Environment for Consistent Sequence Annotation and its Application to Transmembrane Proteins
nar84: Lyall, A., Hammond, P., Brough, D.R., Glover, D.M. (1984). BIOLOG - a DNA sequence analysis system in PROLOG. Nucleic acids research, 12 1 Pt 2, 633-42.
psob97: Thieffry, D., Rosenblueth, D.A., Huerta, A.M., Salgado, H., Collado-Vides, J. (1997). Definite-clause grammars for the analysis of cis-regulatory regions in E. coli. Pacific Symposium on Biocomputing, 441-52.
magpie: Gaasterland T., Sensen C.W., (1996) Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture, Biochimie, Volume 78, Issue 5, 302-310
tcm90: Barton, G.J., Rawlings, C.J. (1990). A PROLOG approach to analysing protein structure. Tetrahedron Computer Methodology, 3, 739-756.
jop19: Isa Irawan, M., Mukhlash, I., Rizky, A., Ririsati Dewi, A. (2019). Application of Needleman-Wunch Algorithm to identify mutation in DNA sequences of Corona virus. Journal of Physics. Conference Series, 1218.
pe92: Muggleton, S., King, R., Sternberg, M.J. (1992). Protein secondary structure prediction using logic-based machine learning. Protein engineering, 5 7, 647-57.
bmcbioinfo11: Santos, J., Nassif, H., Page, D., Muggleton, S., Sternberg, M.J. (2011). Automated identification of protein-ligand interaction features using Inductive Logic Programming: a hexose binding case study. BMC Bioinformatics, 13, 162-162.
arxiv20: Norris, D.C. (2020). What Were They Thinking? Pharmacologic priors implicit in a choice of 3+3 dose-escalation design. arXiv: Methodology.
jicslp92: Lusk, E., Mudambi, S., Overbeek, R. (1993). Applications of the Aurora Parallel Prolog System to Computational Molecular Biology JICSLP'92 Post-Conference Joint Workshop on Distributed and Parallel Implementations of Logic Programming Systems
aisrra91: Taylor, R.C. (1991) Automated Insertion of Sequences into a Ribosomal RNA Alignment: An Application of Computational Linguistics in Molecular Biology, Thesis submitted to Case Western Reserve University, Argonne National Laboratory
cabios91: Brutlag, D.L., Galper, A., Millis, D.H. (1991). Knowledge-based simulation of DNA metabolism: prediction of enzyme action. Computer applications in the biosciences : CABIOS, 7 1, 9-19.
ilp09: Nassif, H., Al-Ali, H., Khuri, S., Keirouz, W., Page, D. (2009). An Inductive Logic Programming Approach to Validate Hexose Binding Biochemical Knowledge. Inductive logic programming ILP, 5989, 149-165.