# Algorithms for Subgraph Isomorphism

Having eliminated molecules using the bitstring screen the subgraph isomorphism search is then performed on the molecules that remain to determine which actually match the substructure. The subgraph isomorphism search is inherently slower; indeed it belongs to a class of problems known as NP-complete, meaning that to find a solution all algorithms require an amount of time that varies exponentially with the size of the problem (which for subgraph isomorphism corresponds to the number of nodes in the graphs). The "brute-force" approach to subgraph isomorphism involves testing every possible way to map the nodes of the substructure graph onto those of the database molecule and then checking whether the corresponding bonds also match. Unfortunately, if there are NS nodes in the substructure query and NM nodes in the database molecule then there are Nm!/(Nm — NS)! possible mappings that must be tested. It is clear that even for very small substructures and molecules with modest numbers of atoms this is an impractical approach.

Fortunately, more efficient methods are available [Barnard 1993]. Many of these approaches make use of heuristics (i.e. rules) that aim to improve the chances of finding a match early on or that efficiently reject candidates which cannot give rise to a match. One of the earliest practical methods uses a backtracking mechanism [Ray and Kirsch 1957]. In its simplest form this method first maps a node from the substructure graph onto a matching node in the molecule graph (such that, for example, the two matching nodes have the same atom type). The algorithm next attempts to map neighbouring nodes of the substructure node onto the corresponding neighbouring nodes in the molecule graph. This process continues until all the nodes have been successfully matched or until it proves impossible to find a match, at which point the algorithm returns (or backtracks) to the last successful match and attempts an alternative mapping. This backtracking procedure continues until either a match of all nodes can be found or until all possible mappings of the initial node have been attempted, at which point the algorithm terminates because the substructure cannot be present in the molecule. To improve the efficiency of this algorithm further it is possible to take into account additional characteristics of the graph nodes such as information about a node's neighbours. This may enable unproductive mappings to be rejected even earlier in the process.

The Ullmann algorithm [Ullmann 1976] is a widely used method for subgraph isomorphism that has been shown to be one of the most efficient algorithms for chemical structures [Downs et al. 1988a; Brint and Willett 1987]. It uses adjacency matrices to represent the molecular graphs of the query substructure and the molecule. In an adjacency matrix the rows and columns correspond to the atoms in the structure such that if atoms i and j are bonded then the elements (i, j) and (j ,i) of the matrix have the value "1". If they are not bonded then the value is assigned "0". It is thus an alternative formulation of the information in a connection table. The adjacency matrix for aspirin is shown in Figure 1-9.

The Ullmann algorithm operates as follows. Suppose there are NS atoms in the substructure query and NM atoms in the database molecule and that S is the adjacency matrix of the substructure and M is the adjacency matrix of the molecule. A matching matrix A is constructed with dimensions NS rows by NM

io o i o o o i o o o o o o o ioioooooooooo o i o i o o o o o o o o o o o i o i o o o o o o o o oooioioooiooo ioooioioooooo oooooioi ioooo o o o o o o i o o o o o o o o o o o o i o o o o o o o o o o i o o o o o i o o oooooooooioi i o o o o o o o o o o i o o o o o o o o o o o o i o o

Figure 1-9. The adjacency matrix for aspirin, using the atom numbering shown.

columns so that the rows represent the atoms of the substructure and the columns represent the atoms of the database molecule. The elements of the matching matrix take the value "1" if a match is possible between the corresponding pair of atoms and "0" otherwise. The aim is to find matching matrices in which each row contains just one element equal to "1" and each column contains no more than one element equal to "1". In this case a unique mapping has been found for every atom of the substructure and for those elements Aij that have a value "1", atom i in the substructure matches atom j in the molecule. More concisely, each matching matrix must satisfy A(AM)T = S (the superscript1 indicates the transposed matrix in which the rows and columns of the matrix have been interchanged; a summary of matrix operations and properties can be found in Appendix 1). By way of illustration, Figure 1-10 shows such a situation with one of the possible matches of the substructure onto the molecule and the corresponding matrices.

One way to implement the Ullmann algorithm would be to generate systematically all possible matrices A, testing them to determine whether they e\

4

7

3

1

0

1

0

0

0

0

0

1

0

1

0

0

0

0

0

1

0

1

0

0

0

0

0

1

0

1

0

1

0

0

0

1

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

V 111 1

 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0

Figure 1-10. Simple illustration of the operation of the Ullmann algorithm using a 5-atom substructure and a 7-atom "molecule" represented by their adjacency matrices. One of the proposed matches is shown in the bottom figure together with the corresponding matching matrix.

correspond to a viable match. However, this would be a rather time-consuming and inefficient procedure and so more effective methods have been devised. Ullmann himself suggested that neighbouring atoms could be taken into account when considering possible matches (i.e. an atom in the query is not permitted to match an atom in the molecule unless each of their neighbouring atoms also match). This step in which an initial match is improved by using local information is referred to as a refinement or relaxation step. Atom matches are initially assigned in the matching matrix based on atom type and connectivity. A depth-first search is then carried out incorporating the refinement step suggested above. Thus, the first substructure atom is initially assigned to the first molecule atom that it might match, the appropriate column in the matching matrix is set to "1" and all remaining columns in that row are set to "0". The refinement step is then applied; if it fails then the algorithm backtracks and the next potential match is tested. If it succeeds then the next row of the substructure is operated on. The process is repeated recursively until either a complete match for the substructure is discovered or until all options have been attempted. The refinement step and related techniques enable current implementations of the Ullmann algorithm to provide very efficient performance even with today's large databases.

Different approaches to the substructure search problem have been adopted by some systems that involve preprocessing the entire database into a format that then enables the searches to be performed extremely rapidly. For example, in the approach described by Christie et al. [1993] the screening stage of substructure search is implemented using inverted files. A set of bitstrings is constructed, one for each fragment included in a structural key. These bitstrings are of length equal to the number of molecules in the database, with the ith bit being set if the i th molecule contains the fragment. For molecule i to be a potential match the i th bit must be set to "1" in the bitstrings of all the fragments in the substructure. This can be determined by combining the relevant bitstrings using the logical AND operator. The i th position in A AND B is set to "1" only if this bit is set in both of the bitstrings A and B; otherwise it is set to "0". These inverted bitstrings provided a more rapid way to identify potential matches than traditional bitstrings. More complex systems have also been invented that involve encoding the atoms in each molecule in a way that reflects their environments (e.g. the type of the atom and its bonds, the number of neighbours, whether it is in a ring or in an acyclic chain). As this encoding is extended to the second and third neighbours of each atom, so the fragments become more and more discriminating. In the Hierarchical Tree Substructure Search (HTSS) system [Nagy et al. 1988] the entire database is stored in a tree-like structure based on these encoded atoms. A substructure search on the entire database is then reduced to a set of simple tree searches. Other database systems use a similar approach, including the S4 system [Hicks and Jochum 1990] developed by the Beilstein Institute and its collaborators. One potential drawback of some of these systems is that it can be difficult to update the database with new structures. This may not be a problem for databases that are released on a periodic basis, but can be difficult for systems that need to be continually updated.