Druglikeness And Compound Filters

The advent of combinatorial chemistry and HTS enabled much larger number of compounds to be synthesised and tested but it did not lead to the expected improvements in the numbers of lead molecules being identified. This observation sparked much interest in the concept of "drug-likeness" [Clark and Pickett 2000;

Walters and Murcko 2002] and attempts to determine which features of drug molecules confer their biological activity and distinguish them from general "organic" compounds.

Among the simplest types of methods that can be used to assess "drug-likeness" are substructure filters. As discussed in Chapter 7, a compound collection may include molecules that contain reactive groups known to interact in a nonspecific manner with biological targets, molecules that give "false positive" results due to interference with certain types of biological assay, or molecules which are simply inappropriate starting points for a drug discovery programme [Roche et al. 2002]. Many of these features can be defined as substructures or as substructure counts that are used to filter both real and virtual data sets. One may wish to apply such filters to the output from a HTS run, in order to eliminate known problem molecules from further consideration (of course, one would prefer to eliminate such compounds prior to the screen). They are also extremely useful when designing virtual libraries and when selecting compounds to purchase from external suppliers. One should, however, always remember that such filters tend to be rather general in nature and that for any specific target it may be necessary to modify or extend them accordingly.

Other approaches to the question of "drug-likeness" were derived by analysing the values of relatively simple properties such as molecular weight, the number of rotatable bonds and the calculated logP in known drug molecules. Considerations such as these led to the formulation of the "rule of five" [Lipinski et al. 1997] which constitutes a set of simple filters that suggest whether or not a molecule is likely to be poorly absorbed. The "rule of five" states that poor absorption or permeation is more likely when:

1. The molecular weight is greater than 500

2. The log P is greater than five

3. There are more than five hydrogen bond donors (defined as the sum of OH and NH groups)

4. There are more than ten hydrogen bond acceptors (defined as the number of N and O atoms)

Excluded from this definition are those compounds that are substrates for biological transporters. An obvious attraction of this model is that it is extremely simple to implement and very fast to compute; many implementations report the number of rules that are violated, flagging or rejecting molecules that fail two or more of the criteria. A more extensive evaluation of property distributions for a set of drugs and non-drugs has identified the most likely values for "drug-like" molecules [Oprea 2000]. For example, 70% of the "drug-like" compounds had between zero and two hydrogen bond donors, between two and nine hydrogen bond acceptors, between two and eight rotatable bonds and between one and four rings. The "rule of five" was derived following a statistical analysis of known drugs; a similar analysis has since been carried out on agrochemicals with modified sets of rules being derived that relate to the properties of herbicides and insecticides [Tice 2001].

Others have attempted to derive more sophisticated computational models of "drug-likeness" using techniques such as neural networks or decision trees. Typically these models start with a training set of drugs and non-drugs, for which a variety of descriptors are calculated. The training set and its corresponding descriptors are then used to develop the model, which is evaluated using a test set. For example, Sadowski and Kubinyi constructed a feed-forward neural network with 92 input nodes, 5 hidden nodes and 1 output node to predict "drug-likeness" [Sadowski and Kubinyi 1998]. The data set comprised compounds from the WDI (the drugs) and a set of structures extracted from the Available Chemicals Directory [ACD] (assumed to have no biological activity, and therefore to be non-drugs). Each molecule was characterised using a set of atom types originally devised by Ghose and Crippen for the purposes of predicting logP [Ghose and Crippen 1986]. The counts of each of the 92 atom types for the molecules provided the input for the neural network. These descriptors act as a form of extended molecular formula and were found to perform better than whole-molecule descriptors such as the log P itself or detailed descriptors such as a structural key or hashed fingerprint. The network was able to correctly assign 83% of the molecules from the ACD to the non-drugs class and 77% of the WDI molecules to the drugs class. Other groups have obtained comparable results [Ajay et al. 1998; Frimurer et al. 2000].

Wagener and van Geerestein [2000] used decision trees to tackle this problem. The same databases were used to identify drug and non-drug molecules and the same set of Ghose-Crippen atom type counts were used to characterise each molecule. The C5.0 algorithm [Quinlan 1993; C5.0] was employed. Its performance was comparable to that of the neural network, correctly classifying 82.6% of an independent validation set. A second model designed to reduce the false negative rate (i.e. the misclassification of drug molecules) was able to correctly classify 91.9% of the drugs but at the expense of an increased false positive rate (34.3% of non-drugs misclassified). Some of the rules in the decision tree were of particular interest; these suggested that merely testing for the presence of some simple functional groups such as hydroxyl, tertiary or secondary amino, carboxyl, phenol or enol groups would distinguish a large proportion of the drug molecules. Non-drug molecules were characterised by their aromatic nature and a low functional group count (apart from halogen atoms).

Gillet et al. [1998] used a genetic algorithm to build a scoring scheme for "drug-likeness". The scoring scheme was based on the following physicochemical properties: molecular weight; a shape index (the kappa-alpha 2 index [Hall and Kier 1991]) and numbers of the following substructural features: hydrogen bond donors; hydrogen bond acceptors; rotatable bonds; and aromatic rings. Each property was divided into a series of bins that represent ranges of values of the property, such as molecular weight ranges or counts of the number of times a particular substructure or feature occurs in a molecule. A weight is associated with each bin and a molecule is scored by determining its property values and then summing the appropriate weights across the different properties. The genetic algorithm was used to identify an optimum set of weights such that maximum discrimination between the two classes was achieved, with molecules in one class scoring highly while molecules in the other class have low scores. In the case of "drug-likeness" the genetic algorithm was trained using a sample of the SPRESI database [SPRESI] to represent "non-drug-like" compounds and a sample of the WDI to represent "drug-like" compounds. The resulting model was surprisingly effective at distinguishing between the two classes of compounds, as can be seen in Figure 8-2. It has subsequently been used to filter compounds prior to high-throughput screening [Hann et al. 1999].

One interesting development has been the introduction of "lead-likeness" as a concept distinct from "drug-likeness". The underlying premise is that during the optimisation phase of a lead molecule to give the final drug there is an increase in the molecular "complexity", as measured by properties such as molecular weight, the numbers of hydrogen bond donors and acceptors and ClogP. It has therefore been argued [Teague et al. 1999; Hann et al. 2001] that one should

Figure 8-2. Output from the genetic algorithm scoring scheme used to distinguish drugs from non-drugs showing the degree of discrimination that can be achieved. The WDI contains known drug-like molecules whereas SPRESI is a database containing general "organic" molecules, assumed to have no biological activity.

Figure 8-2. Output from the genetic algorithm scoring scheme used to distinguish drugs from non-drugs showing the degree of discrimination that can be achieved. The WDI contains known drug-like molecules whereas SPRESI is a database containing general "organic" molecules, assumed to have no biological activity.

use "lead-like" criteria when performing virtual screening at that stage rather than the "drug-like" criteria typified by the "rule of five". These arguments are supported by analyses of case-histories of drug discovery together with theoretical models of molecular complexity. Interest in lead-likeness led in turn to fragment-based approaches to drug discovery, wherein less complex molecules are screened to provide starting points for subsequent optimisation. The small size of the molecules used in such approaches means that they need to be screened at higher concentrations or using biophysical techniques such as x-ray crystallography or NMR. In addition to the practical aspects of fragment-based drug discovery there have also been associated theoretical developments, such as the "rule of three" [Congreve et al. 2003] (a fragment equivalent of the rule of five) and the concept of ligand efficiency [Hopkins et al. 2004] (a method for prioritising the output from screening experiments in order to identify the most promising initial candidates).

Was this article helpful?

0 0
Fitness Resolution Fortress

Fitness Resolution Fortress

Learning About Fitness Resolution Fortress Can Have Amazing Benefits For Your Life And Success! Start Planning To Have Excellent Health And Fitness Today!

Get My Free Ebook


Post a comment