QSAR (quantitative structure-activity relationship) is pronounced "q-sar" or "quasar". Sometimes the A stands also for affinity = reactivity or for property (quantitative structure-property relationship, QSPR). This intersecting field of cheminformatics, molecular modeling, and machine learning describes the quantitative correlation of chemical or biological activity. This allows also the prediction of the so-called "drug efficacy" of a structurally related compound.
SAR and SAR paradox
The basic assumption for all molecule based
hypothesis is that similar molecules have similar activities. This principle is also called
Structure-
Activity
Relationship (SAR). The underlying problem is therefore how to define a
small difference on a molecular level, since each kind of activity, e.g.
reaction ability,
biotransformation ability,
solubility, target activity, and so on, might depend on another difference. A good example was given in the
bioisosterism review of Patanie/LaVoie.
[G. A. Patani, E. J. LaVoie, Bioisosterism: A Rational Approach in Drug Design. Chem. Rev., 1996, 96, 3147-3176. ]
From the computer science standpoint of view the no-free-lunch theorem proves that no general algorithm can exist to define e.g. a small difference for getting always the best hypothesis.
In general one is more interested in finding strong trends. Created hypotheses rely usually on a finite number of chemical data. Thus the induction principle should be respected to avoid overfitted hypotheses and deriving overfitted and useless interpretations on structural/molecular data.
Applications
Chemical
One of the first and rather
historical QSAR applications was to predict boiling points.
[D. Bonchev, D.H. Rouvray: Chemical Graph Theory: Introduction and Fundamentals. Gordon and Breach Science Publishers, 1990, ISBN 0-85626-454-7.]
It is well known for instance that within a particular family of chemical compounds, especially of organic chemistry, that there are strong correlations between structure and observed properties. A simple example is the relationship between the number of carbons in alkanes and their boiling points. There is a clear trend in the increase of boiling point with an increase in the number carbons and this serves as a means for predicting the boiling points of higher alkanes.
Biological
The biological activity of molecules is usually measured in
assays to establish the level of inhibition of particular
signal transduction or
metabolic pathways. Chemicals can also be biologically active by being
toxic.
Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific
targets and have low
toxicity (non-specific activity). Of special interest is the prediction of
LogP, which is an important measure used in identifying "
drug-likeness" according to
Lipinski's Rule of Five.
While many Quantitative Structure Activity Relationship analyses involve the interactions of a family of molecules with an enzyme or receptor binding site, QSAR can also be used to study the interactions between the structural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from site-directed mutagenesis.[E. K. Freyhult, K. Andersson, M. G. Gustafsson, Structural modeling extends QSAR analysis of antibody-lysozyme interactions to 3D-QSAR,J. Biophys., 2003, 84, 2264-2272. PMID 12668435] In this study, a wild-type antibody specific for lysozyme and 17 single and double mutants of the antibody were investigated. Quantitative models for the affinity of the antibody-antigen interaction were developed.
Prediction methods
It is part of the
machine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also
MVUE). In general all QSAR problems can be divided into a
coding[R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, 2000. ISBN 3-52-29913-0] and
learning[R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2001. ISBN 0-471-05669-3] part.
Data mining
For the coding usually a relatively large number of features is calculated, which can lack structural interpretation ability. In combination with the later applied learning method or as preprocessing step occurs a
feature selection problem.
A typical data mining based prediction uses e.g. support vector machines, decision trees, neural networks for inducing a predictive learning model.
3D-QSAR
3D-QSAR refers to the application of
force field calculations requiring three-dimensional structures, e.g. based on protein
crystallography or molecule
superposition. It uses computed potentials, e.g. the
Lennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. It examines the steric fields (shape of the molecule) and the electrostatic fields based on the applied energy function.
[A. Leach, Molecular Modelling: Principles and Applications, Prentice Hall, 2001. ISBN 0582382106]
The created data space is then usually reduced by a following feature extraction (see also dimensionality reduction). The following learning method can be any of the already mentioned machine learning methods, e.g. support vector machines.[Schölkopf, B., K. Tsuda and J. P. Vert: Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004.]
In the literature it can be often found that chemists have a preference for partial least squares (PLS) methods, since it applies the feature extraction and induction in one step.
Molecule mining
Molecule mining approaches, a special case of
structured data mining approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore there exist also approaches using
maximum common subgraph searches or
graph kernels.
[Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. ISBN 0521585198]
[C. Helma (ed.), Predictive Toxicology, CRC, 2005. ISBN 082472397X]
Fragment based (group contribution)
It has been shown that the
logP of compound can be determined by the sum of its fragments. Fragmentary logP values have been determined statistically. This method gives mixed results and is generally not trusted to have accuracy of more than +/- 0.1 units.
[S. A. Wildman, G. M. Crippen, Prediction of Physicochemical Parameters by Atomic Contributions, J. Chem. Inf. Comput. Sci.}, 1999, 39'', 868-873. ]
References
See also
External links
Medicinal chemistry | Pharmacology | Cheminformatics | Paradoxes
Quantitative Struktur-Wirkungs-Beziehung | 定量的構造活性相関 | 定量构效关系