
Although this field is recognized as
an important research area, the over all objectives and classifications were not
done until the recent release of a comprehensive series of four volumes on
chemoinformatics a handbook containing in-depth contributions from top authors
around the world, with the content organized into chapters dealing with the
representation of molecular structures and reactions, data types and
databases/data sources, search methods, methods for data analysis as well as
applications edited by Prof. Gasteiger. The Handbook of Chemoinformatics is the
first reference work to be exclusively devoted to this developing field from
data to knowledge, and will set the standard as the premier information source
for the next decade. This handbook is a must read for experts as well as
students of chemistry and biology.
The handbook provides a comprehensive and coherent overview of the state of the art of chemoinformatics. The first volume of the handbook begins with the history of chemoinformatics aptly written by Peter Willett. The subsequent few chapters deal with the chemical nomenclature and representation of chemical structures using Graph theory and SMILES. Chemoinformatics uses a wide variety of algorithms for indexing and retrieving chemical compounds in databases. Four chapters are devoted to processing constitutional information of molecules. The computational methods for 3D structure generation and ligand and structure based design of the so-called bioactive conformation of the potential drug have been defined. Shape analysis is a powerful tool in chemistry as investigations of the molecular recognition of receptor ligand interactions near surface are likely to be more precise than anywhere near the molecule. The large amount of data generated by computer/experiments needs to be visualized to identify trends and structures, and recognize shapes and patterns. In this context, the strategic basis of molecular graphics for the optimization of information transfer between human activity and computational processes assumes great importance.
The chemoinformatics of chemical reactions is not as far developed as
that of chemical structures. The two fundamental tasks for chemical reaction
representation are prediction of outcome of a reaction and the design of a
synthesis. Both data driven and model driven reaction classification methods
used for knowledge extraction have been described. However the automatic
assignment of a reaction center has not been presented.
Data acquisition and data analysis
are important tools for building up knowledge in chemistry and to ensure that outgoing product meets all
customer requirements. The next topic Experimental Design (ED) familiarizes the
readers with this mathematical technique to plan and carry out experiments so
that maximum possible information is gained from experimental data. Standard
data formats are essential for facilitating exchange of data between scientists.
XML-eXensible Markup Language deals with electronic exchange of information and
documents in every discipline. This standardized language has a specific
extension for handling chemical information and many of its features are under
review or development.
There are various types of databases
available in the field of chemistry, which store treasures of information.
Abstracting and indexing in bibliographic databases has been described in
detail. As CAS Information system is the major provider of chemical information
since the computer age a complete section is devoted to CAS databases- CAPLUS,
Registry and its online sources- STN Express and Scifinder. The largest
information database on organic compounds is the Beilstein database and now with
its additional features like crossfire its potential has been realized.
Databases for retrieving inorganic and organometallic compounds are also
included in this chapter. The chemical structure database (CSD) provides
information on 3D structures of small organic and organometallic molecules.
Spectroscopy, patents, environmental information, molecular topology,
biochemistry databases too find a mention here. Internet is the largest
repository of data and the next section invariably leads to chemistry on the
Internet with an overview on the internet technologies used to harness chemical
knowledge. Laboratories generate a lot of data that needs to be organized and
managed. The chapter concludes with the basic structure modules and functioning
of Laboratory Information Management Systems (LIMS).
Chemical structure search is the
most important method of accessing chemical informaton. This section begins with
the methods available for 2D
structure and substructure search.
The Markush chemical structures are generic structures in patents and their
retrieval poses a problem in chemical structure searching. This article throws
light on the current state of the art of Markush Topological Search Systems.
Computable structure similarities are strongly correlated with biological
similarities (structure property principle); similarity searching is now widely
used for virtual screening as a precursor to sub structural analysis.
Chemical structure information can
be correlated with physical, chemical or biological data to make a model, which
can be used to predict new data. The third volume of the handbook focuses on
calculation of physical and chemical data through direct computational methods.
Molecular mechanics or force field methods are used often as they are rapid and
can be applied to a large number of molecules with many atoms. Some of the force
field methods for mainly small molecules are MM2, MM3, Tinker, UFF, Momec, Osmos
and for biological molecules are AMBER, CHARMM, Gromos, POLS, ECEPP, CVFF, MMFF.
The quantum mechanical methods can be applied to large molecules or large data
sets unlike molecular mechanics methods. The molecular orbital theories are
described first and the properties from quantum mechanical calculations of
interest to chemoinformatics, for instance net atomic charges, dipole moments,
polarizabilities, orbital energies are described in detail. The extra
information and details provided by quantum mechanics is important for accurate
work involving specific interactions, docking studies.
Eighth chapter of this volume
provides detailed information on descriptors for chemical compounds. As more
than 1500 descriptors are known care must be taken to choose the correct set.
The first section covers topological descriptors, which have now been superseded
by sophisticated descriptors. Searching for relationship between molecular
structure and biological activity can be efficiently done using geometric
descriptors with their large information content. Next section by Gasteiger, is
on a series of structure coding methods, different ways of encoding a molecular
structure into a vector of numerical values. He suggests a hierarchy of
structure representation: construction, 3D structure and molecular surface. The
section also touches upon descriptors of molecular chirality mainly developed in
his group. The last section in this series deals with representation of
molecular chirality as qualitative representation of chiral structure is
necessary for QSAR studies. Even though many approaches have been devised for
computer detection, specification and representation of chirality, yet
correlation with observable properties has been limited, the data seta are
smaller in comparison to non-chiral structure-property relationships.
The succeeding chapter delves into
the methods for data analysis, collectively referred to as “inductive learning
methods”. Machine learning is a common term used by computer scientists for
classification and generalization of data, basically to extract regularity from
data or harvest latent knowledge from the databases. Another method of data
analysis is multivariate data analysis, a tool commonly used in chemo metrics as
more than one variable is required to describe chemistry relevant objects. Yet
another method is Partial Least Squares, which can be used to analyze data with
strongly collinear, noisy and numerous X-variables and also model several
response variables Y.
A chapter on Artificial Neural
Network (ANN) and its applications viz., classification, mapping, modeling,
prediction of missing data, reduction of representation etc is followed by a
section on concept of Fuzzy logic. Fuzzy logic is viewed as a system of
concepts, principles and methods for modes of reasoning that are approximate
rather than exact and expressed in natural language. The authors demonstrated
that patter recognition strategies, which are related to the application of
human sense, could be transferred to an algorithmic process applicable in the
field of molecular recognition.
Evolutionary algorithms (EAs) or
evolutionary computations are stochastic search methods that are inspired by the
basic principles of Darwinian evolution and by DNA like genetics, containing a
component of randomness in their algorithmic procedure. The main algorithms used
under this term are genetic algorithms (GA), evolutionary programming (EP),
evolutionary strategies (ES), genetic programming (GP) and classifier systems
(CFS). Their vast applications in chemistry include conformational search and
structure optimization, protein ligand docking, de novo molecular design,
pharmacophoric perception, psuedo receptor modeling, chemical structure
handling, QSAR, chemometric, combinatorial libraries, crystallography,
spectroscopy, structure prediction of biological macromolecules, force field
parameterization, chemical reaction handling, sequence alignment -infact the
entire world of chemistry.
Expert systems are computer programs
derived from artificial intelligence research which aid expert in making
decisions. Next section on Expert Systems defines the various terms used under
this concept and describes development of expert systems using rule based
programming, inference engine, fuzzy logic etc. The last chapter in the third
volume delves into the application of chemoinformatics methods, though only
selected ones are described in detail. The first section on prediction of
physical and chemical properties elaborates on lipophilicity a widely applied
tool for large databases, quantified by partition coefficient P or its logarithm
log P. The existing log P data is negligible compared to the known desirable
compound hence a need to develop methods to derive log P from molecular
structure. Both the sub structural and whole molecular approaches for
quantifying log P exist with their intrinsic advantages and drawbacks. QSPR
computer assisted prediction of chemical physical and biological properties
directly from molecular structure is of great relevance. QSPR methods can be
used to predict properties such as normal boiling points, critical temperatures,
surface tension, Henry’s law constants, gas chromatographic retention times, ion
mobility etc. Three major part of QSPR studies: representation, feature
selection and mapping have been accounted. This chapter gives insight into
various descriptors, design and implementation of which is a current research
area in QSPR.
Web technology, due to its ease of
use and high interactivity offers many advantages for processing chemical
information and invariably the next section is on web-based calculation of
molecular properties. The development of Java programs and other new
technologies, servelets, VRML, XML, and CML are making web an ideal environment
for processing chemical information. Some representative examples of the web
tools and in-silico profiling of molecules at Novartis have been
described by the authors, however not all the commercially available software
packages are mentioned which would have been useful for the readers. Correlating
structural and spectroscopic information is an important aspect of
chemoinformatics, IR and NMR in particular. The digital encoding of IR spectra
and coding of the chemical structure and computational correlation between NMR
spectra and molecular structure has been described in two sections. Spc Info, CS
Search, NMR Shift DB and CNRM databases form the basis for shift prediction
tool. From these compressed representation of data such as HOSE code tables can
be generated which aid in chemical shift prediction for new structures.
Structure validation by ab initio quantum mechanical computations is now
feasible with PCs and workstations. The simultaneous use of various spectral
data provides leads to the exact structure elucidation of a molecule. The next
section throws light on the development of automatic systems for structure
elucidation CASE (Computer Assisted Structure Elucidation), only for small
organic molecules. A typical CASE process involves spectral database searching
and storage as a bit string representation.
The last volume of the Handbook is
on Chemical reactions and synthesis design. The analysis and processing of
reaction data information is very important to chemists for solving any
synthetic problem. Topology based reaction classification codes; Kohonen neural
networks help in retrieving reaction information from different sources by using
algorithmically derived hash codes. Computer Assisted Synthesis Design (CASD)
looks at technical ways of organizing communication between computer and chemist
for description of reactions. Molecules are described by a connectivity table,
matrices or numerical linear notation. These three systems lead to three methods
for coding reactions in CASD programs: Transform approach, BE-Matrices approach
and Numerical Approach. Next article features an interesting design system WODCA
(Workbench for the Organization of Data for Chemical Applications). All the
aspects of organic reactions- reaction planning, reaction prediction and
synthesis design have been dealt with. Specific examples have been given to
explain the various disconnection strategies available for the perception of
strategic bonds within a target compound.
Drug discovery is undoubtedly the
most important application of chemoinformatics. All chemoinformatics activities
viz., chemical library, virtual screening, structure activity relationships,
high throughput screening, in-silico screening, de novo ligand design, data
mining are vital to the processes of drug discovery. The drug discovery
paradigm: HTS hits-HTS active -lead series- drug candidates—launched drug has
shifted focus from good quality drug candidates to good quality leads. The
succeeding section deals with QSAR contributions in drug design. QSAR
applications in drug design include transport and distribution of drugs in
biological systems, enzyme inhibition and correlation of different kinds of
biological activities. Classical QSAR studies do not consider the 3D structures
of drugs or their chirality. The COMFA (Comparative Molecular Field analysis)
was therefore developed for deriving 3D QSAR models. It is mostly used in the
field of ligand protein interaction, describing affinity inhibition constants.
Yet another section on 3D and nD QSAR methods defines a rapid method of
determining 3D QSAR descriptors which are then converted into a QSAR model using
PLS with better predictivity called (COMMA) Comparative Molecular Moment
Analysis based on molecule’s moment of shape and charge distribution. The
methodology of nD QSAR adds to the 3D QSAR methodology by incorporating unique
physical characteristics into the available descriptor pool for creation of
models. Other types of QSAR methods 5D QSAR, RD QSAR, FEFF, MI QSAR are briefly
touched upon. The implementation of these methodologies will add wealth of
information about how small organic molecules interact with biological molecules
and macromolecules.
An overview of applications of
combinatorial chemistry in drug discovery in the next section entitled “high
throughput chemistry”. Traditionally the term high throughput chemistry
encompasses all the technologies and combinatorial chemistry and multiple
parallel syntheses of chemical entities by condensing a small number of reagents
together in all possible combinations with an aim to expedite the drug discovery
process. Some of the techniques have been described schematically such as matrix
and spilt synthesis, encoding libraries, deconvolution etc. The concept of solid
phase synthesis, solution phase synthesis, dynamic combinatorial chemistry and
combinatorial biosynthesis has been explained in detail. The advancement in HTS
and combinatorial chemistry has led to a large collection of compounds, which
require equally advanced methods for their property characterization. The field
of molecular diversity allows a selection of dissimilar compounds from a large
range of chemical space in order to discover new leads. The methods and
descriptors available to solve the problem of making diverse selection have been
summarized in this section.
Pharmacophore approach is an
intermediate between 3D QSAR as a strictly ligand based approach and full
computation at quantum mechanical level, for the dynamic interaction between the
ligand and the receptor site. Applications of the pharmacophore are in de novo
drug design, guidance for design of targeted combinatorial libraries,
interpretation of data from high throughput screening and mostly in databases
searches of 3D structure of small molecules. The current trends in pharmacophore
development include 3D substructure perception, electron conformational methods
and property-based pharmacophores.
There are different approaches used
for structure generation also known as de novo design of potential ligands that
can bind to the receptor site of an enzyme whose 3D structure is known. The
denovo design process involves steps such as analysis for the structural
information of receptor to determine the active site, meeting requirement of the
active site by placing appropriate chemical functionality in the required
location and constructing a molecular scaffold to hold them in place and finally
sorting and selecting the designed molecules by estimation of their chemical and
biological properties. In practice de novo systems are generally used in
combination with other modeling tools and initially designed structure are
modified by the medicinal chemists before any synthesis is carried out. Some of
the computer programs used are SPROUT, TOPAS, LEGEND, SEEDS in the literature,
however most of the work is not published in this area. The limitation of the
denovo design systems is that they do not take into account factors such as
transport properties, toxicity and stability.
Next section introduces the reader
to the basic concept of docking that is the formation of non-covalent ligand
receptor complexes and the docking problem ie, the task of predicting the
structure of the resulting complex. There are two opposing approaches for this
either to reformulate it to a discrete problem that can be solved with
combinatorial algorithm or to use stochastic search algorithms. Basically
docking is an energy minimization problem concerned with the search of lowest
free energy binding mode of a ligand within a protein-binding site. After search
the next step in docking is to rank the different configurations generated with
respect to their binding affinity to one ligand. Special aspects of the docking
problem such as protein flexibility, water molecules, protein homology and
combinatorial dockings have been described briefly.
The increase in structural
information on proteins and systematic evaluation of geometries of protein
ligand complexes using protein crystallography or multidimensional NMR will
expedite the process of lead discovery. However mere raw information is not
enough, it has to be evaluated, distilled and transformed to a unique data
format to store it. In structural biology the central database system is PDB
(Protein Data Bank), which is accessible to public. This section describes an
object oriented database tool, Relibase developed by the authors to handle
protein ligand information. Relibase operates on intramolecular geometries and
correlated intermolecular interaction patters and also has tools for protein
information such as sequence similarity, secondary structural elements or
solvent accessibility. Water based module in Relibase can detect surface exposed
as well as deeply buried water molecules in the protein ligand interface.
Specialized topics such as comparative analysis of ligand binding pockets and
secondary structural elements, which provide special binding motif in protein,
have also been dealt with.
The last chapter of the handbook
consists of two sections that deal with the interface of chemoinformatics and
bioinformatics – protein structure sequence and genome. The first section deals
with prediction of 3D protein structure from amino acid sequence. The databases
for known protein sequences (1,000,000) are expanding to due to implementation
of large scale genome projects but protein whose structures are known (PDB,
20,000) are considerably less in comparison. In practice the prediction of 3D
structure from sequence is challenging as energy difference between native and
unfolded proteins is extremely small and secondly the high complexity of protein
folding requires more computing time.
There are three prediction methods
that try to bridge the sequence structure gap: homology modeling, threading and
1D prediction. For proteins to perform function there is a need to maintain the
specific 3D structure. This evolutionary history is used successfully for
aligning proteins (or nucleotide) sequences. Generally advanced alignment
algorithms use programs such as BLAST and FASTA and then apply dynamic
programming algorithm. The 1D prediction can be useful precursor to 3D
prediction and the 1D predictors used are solvent accessibility, transmembrane
strands, helices and regions of structural switches. Predictions in two or three
dimensions have met with limited success so far. The section on genome
bioinformatics explores the vast information encrypted into the DNA to identify
all the genetic elements that perform any biological function. The comprehensive
analysis of a genome starts with identification of coding regions, regulatory
sites, tRNAs, rRNAs. The two major branches in high throughput analysis are
expression analysis and ‘proteomics’ ie, the study of protein products of the
genome and their interactions and functions. Though major advances have been
made in these areas, topics such as tertiary structure, prediction,
protein-protein interaction remain unsolved till today.
The volume concludes with a brief
chapter on future directions in the field of chemoinformatics by Gasteiger. He
foresees chemoinformatics gaining importance in chemistry and its
incorporation into regular
chemistry curricula. Use of computer assisted Structure Elucidation (CASE)
process and Computer Assisted Synthesis Design (CASD) would be integrated into
the daily work process of bench chemists. Chemoinformatics methods will be
extended to theoretical chemistry,simulation of reactions, modeling of
biochemical and metabolic reaction, study of proteins will be the future areas
of thrust for chemoinformatics.
Another field of great activity will be the merging of bioinformatics and
chemoinformatics; their common problems can be solved using methods developed in
both the fields. Drug design will no longer be the major domain of
chemoinformatics, other fields such as material science, non-linear optical
properties, adhesives, electrical energy, hair coloring chemicals, detergents
etc will also be part of chemo informatics. The other challenges before
chemoinformatics are multivariate optimization i.e., simulations optimization of
several properties, for example they should predict not just the activity of a
drug but also its toxicity, solubility, penetration etc. Gassteiger argues for
chemists to use electronic lab notebooks to record data, which can be used to
fill other information sources such as manuscripts, journals, books and
databases. Finally chemoinformatics should speak the language of chemists and
provide him with just the desired information and not heaps of unnecessary data.
With rich in content and originality in presentation, my personal opinion is that “this is the first set books should find a place in every chemoinformaticians desk and every university libraries in the world”.
M.Karthikeyan
Scientist
National Chemical Laboratory
Pune - 411 008
Please write your personal comments directly to me: karthi@ems.ncl.res.in
Handbook of Chemoinformatics: From Data to Knowledge,
Volumes 1-4 Edited by Johann Gasteiger (University of Erlangen-Nürnberg).
Wiley-VCH Verlag GmbH & Co. KGaA:
Weinheim. 2003. xlvii + 1870 pp.
$750.00.
ISBN
3-527-30680-3.