83,99 €
This essential guide to the knowledge and tools in the field includes everything from the basic concepts to modern methods, while also forming a bridge to bioinformatics.
The textbook offers a very clear and didactical structure, starting from the basics and the theory, before going on to provide an overview of the methods. Learning is now even easier thanks to exercises at the end of each section or chapter. Software tools are explained in detail, so that the students not only learn the necessary theoretical background, but also how to use the different software packages available. The wide range of applications is presented in the corresponding book Applied Chemoinformatics - Achievements and Future Opportunities (ISBN 9783527342013). For Master and PhD students in chemistry, biochemistry and computer science, as well as providing an excellent introduction for other newcomers to the field.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1079
Cover
Dedication
Foreword
List of Contributors
Chapter 1: Introduction
1.1 The Rationale for the Books
1.2 The Objectives of Chemoinformatics
1.3 Learning in Chemoinformatics
1.4 Outline of the Book
1.5 The Scope of the Book
1.6 Teaching Chemoinformatics
References
Chapter 2: Principles of Molecular Representations
2.1 Introduction
2.2 Chemical Nomenclature
2.3 Chemical Notations
2.4 Mathematical Notations
2.5 Specific Types of Chemical Structures
2.6 Spatial Representation of Structures
2.7 Molecular Surfaces
Selected Reading
References
Exercises
Chapter 3: Computer Processing of Chemical Structure Information
3.1 Introduction
3.2 Standard File Formats for Chemical Structure Information
3.3 Input and Output of Chemical Structures
3.4 Processing Constitutional Information
3.5 Processing 3D Structure Information
3.6 Visualization of Molecular Models
3.7 Calculation of Molecular Surfaces
3.8 Chemoinformatic Toolkits and Workflow Environments
Selected Reading
References
Exercises
Chapter 4: Representation of Chemical Reactions
4.1 Introduction
4.2 Reaction Equation
4.3 Reaction Types
4.4 Reaction Center and Reaction Mechanisms
4.5 Chemical Reactivity
4.6 Learning from Reaction Information
4.7 Building of Reaction Databases
4.8 Reaction Center Perception
4.9 Reaction Classification
4.10 Stereochemistry of Reactions
4.11 Reaction Networks
Selected Reading
References
Exercises
Chapter 5: The Data
5.1 Introduction
5.2 Data Types
5.3 Storage and Manipulation of Data
5.4 Conclusions
Selected Reading
References
Exercises
Chapter 6: Databases and Data Sources in Chemistry
6.1 Introduction
6.2 Chemical Literature and Databases
6.3 Major Chemical Database Systems
6.4 Compound Databases
6.5 Databases with Properties of Compounds
6.6 Reaction Databases
6.7 Bibliographic and Citation Databases
6.8 Full‐Text Databases
6.9 Architecture of a Structure‐Searchable Database
Selected Reading
References
Exercises
Chapter 7: Searching Chemical Structures
7.1 Introduction
7.2 Full Structure Search
7.3 Substructure Search
7.4 Similarity Search
7.5 Three‐Dimensional Structure Search Methods
7.6 Sequence Searching in Protein and Nucleic Acid Databases
7.7 Summary
Selected Reading
References
Exercise
Chapter 8: Computational Chemistry
8.1 Empirical Approaches to the Calculation of Properties
8.2 Molecular Mechanics
8.3 Molecular Dynamics
8.4 Quantum Mechanics
Chapter 9: Modeling and Prediction of Properties (QSPR/QSAR)
Chapter 10: Calculation of Structure Descriptors
10.1 Introduction
10.2 Structure Descriptors for Classification and Similarity Searching
10.3 Structure Descriptors for Quantitative Modeling
10.4 Descriptors That Are Not Calculated from the Chemical Structure
10.5 Summary and Outlook
Selected Reading
References
Exercises
Chapter 11: Data Analysis and Data Handling (QSPR/QSAR)
11.1 Methods for Multivariate Data Analysis
11.2 Artificial Neural Networks (ANNs)
11.3 Deep and Shallow Neural Networks
Chapter 12: QSAR/QSPR Revisited
12.1 Best Practices of QSAR Modeling
12.2 The Data Science of QSAR Modeling
Selected Reading
References
Exercises
Chapter 13: Bioinformatics
13.1 Introduction
13.2 Sequence Databases
13.3 Searching Sequence Databases
13.4 Characterization of Protein Families
13.5 Homology Modeling
Selected Reading
References
Exercises
Chapter 14: Future Directions
14.1 Access to Chemical Information
14.2 Representation of Chemical Compounds
14.3 Representation of Chemical Reactions
14.4 Learning from Chemical Information
14.5 Training in Chemoinformatics
Answers Section
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Index
Substance Index
End User License Agreement
Chapter 2
Table 2.1 Different types of molecular representations of phenylalanine (without stereochemistry).
Chapter 3
Table 3.1 Important file formats for the representation and exchange of chemical structure information and their respective possibilities for representing or coding the constitution, the configuration, that is, the stereochemistry, and the 3D structure or conformation.
Table 3.2 Comparison of the SMILES and the SLN syntax (see also Section 3.2.3).
Table 3.3 Format details of the ATOM record (cf. Figure 3.18).
Table 3.4 Sample STAR‐based dictionaries.
Table 3.5 Collection of selected molecule editors and molecule viewers (not complete).
Table 3.6 Freely available toolkits and workflow environments in chemo‐ and bioinformatics.
Chapter 4
Table 4.1 Substituent constants for various groups to be used in Eqs. 4.1 and 4.2.
σ
m
for substituents in meta‐position and
σ
p
for substituents in para‐position.
Chapter 5
Table 5.1 Selected artifacts in chemistry that require complex data types.
Table 5.2 Example of efficiency of an
ab initio
geometry optimization in different coordinate sets.
Table 5.3 Probability
(
P
)
of collision avoidance for 32‐ and 64‐bit hash codes.
Chapter 6
Table 6.1 CAS databases in SciFinder [1, 4].
Table 6.2 Searches for references to the analgesic paracetamol and to substituted naphthalenes in
SciFinder
and
Reaxys
.
Table 6.3 Hits for lidocaine
6.1
in
SciFinder
and
Reaxys
with different queries.
Table 6.4 Search results (hits) for query structures of Figure 6.3 in SciFinder and Reaxys.
Table 6.5 Special patent sequence databases available from STN at cost [6].
Table 6.6 Free Web portals for sequence and related databases.
Table 6.7 Searching for compounds containing the elements Li, O, P, F, and Fe.
Table 6.8 3D structure databases of inorganic, organic, and biomolecules.
Table 6.9 Selected properties of (+)‐estrone from diverse databases.
Table 6.10 Major databases with thermodynamic data available at cost.
Table 6.11 NIST databases with thermodynamic data.
Table 6.12 Major spectral databases and database systems available a cost.
Table 6.13 Databases with special spectroscopic data at no charge.
Table 6.14 Single‐step reactions in
Reaxys
and
SciFinder CASREACT
(May 2017).
Table 6.15 Selected bibliographic databases at cost [6].
Table 6.16 Selected patent databases at cost.
Chapter 7
Table 7.1 List of some popular similarity coefficients.
Table 7.2 Distance distributions for structures
A
and
B
(see Figure 7.16).
c08a
Table 8.1 Experimental mean molecular polarizabilities and values calculated by Eq. 8.2.
Table 8.2 Parameters for the dependence of orbital electronegativity on charge (Eq. 8.4).
Chapter 10
Table 10.1 Classification of molecular descriptors by data type.
Table 10.2 Classification of fragment descriptors and fingerprints.
Table 10.3 Classification of descriptors by the dimensionality of their molecular representation.
Table 10.4 Invariance properties of selection of molecular descriptors.
c11a
Table 11.1 Basic results obtained from the application of a binary classifier to a test set with
n
objects,
n
1
actually belonging to class 1 and
n
2
to class 2;
n
1
+
n
2
=
n
.
n
CA
is the number of objects belonging to class
C
(1 or 2) and being assigned to class
A
(1 or 2).
c11c
Table 11.2 Comparison of prediction errors of the kaggle test sets using DNN [3] and BNN [22].
Chapter 1
Table Figure 1.1 Inductive learning.
Chapter 2
Figure 2.1 Hierarchy levels of molecular representation providing different information contents.
Figure 2.2 Classification of different types of molecular representations with examples given in the bottom line of the boxes.
Figure 2.3 The IUPAC name of caffeine: 1,3,7‐trimethylpurine‐2,6‐dione.
Figure 2.4 Scientific notation for the information on chemical elements, for example, oxygen (O).
Figure 2.5 Identical empirical formula of two compounds: (a) phenylalanine and (b) 2‐(3‐furanyl)‐1‐pyrrolidinecarboxaldehyde and their subdivided formula representations.
Figure 2.6 Different graph theoretical representations (b) to (d) of the structure diagram of 5‐ethyl‐2‐methylheptane (a). In graph theory, only the connections are important, not the length of the edges or the angles between them. (Usually, only heavy atom types are represented, not hydrogen atoms.)
Figure 2.7 (a) The simple graph
G
1
(
V
1
,
E
1
)
has four vertices
V
1
= {
v
1
,
v
2
,
v
3
,
v
4
}
and three edges
E
1
= {
e
1
,
e
2
,
e
3
}
. The latter are all incident in vertex
v
2
. Thus, the degree of vertex
v
2
is three, that is,
d
(
v
2
) = 3
; (b) The multigraph
G
2
with
E
1
= {
e
1
, …,
e
5
}
on the right‐hand side has an additional edge between
v
2
and
v
3
.
Figure 2.8 Vertex coloring (a) and edge coloring (b) of a graph.
Figure 2.9 The example (a) of a simple undirected graph
G
1
consists of the set of vertices
V
1
= {
v
1
,
v
2
,
v
3
,
v
4
}
and the sets of edges
E
1
= {{
v
1
,
v
2
}, {
v
2
,
v
3
}, {
v
2
,
v
4
}}
. The example (b) of a simple directed graph
G
3
consists of the set of vertices
V
3
= {
v
1
,
v
2
,
v
3
,
v
4
}
and the sets of edges
E
3
= {(
v
1
,
v
2
), (
v
2
,
v
4
), (
v
3
,
v
2
)}
.
Figure 2.10 Graph examples of a walk (a) and a path (b) starting in
v
1
.
Figure 2.11 Examples of a subgraph and a supergraph. Graph
G
2
with the vertex set
V
2
= {1, 2, 3, 4, 5, 6}
and graph
G
3
with the vertex set
V
3
= {1, 2, 6, 7, 8}
are subgraphs of graph
G
1
with the vertex set
V
2
= {1, 2, 3, 4, 5, 6, 7, 8}
. Accordingly, graph
G
1
is a supergraph of the graphs
G
2
und
G
3
.
Figure 2.12 An adjacency matrix of ethanal (a) is simplified stepwise by setting all matrix elements of the lower‐left corner to zero, retaining the upper‐right corner of the matrix and (b), finally, omitting the matrix elements for hydrogen atoms.
Figure 2.13 An incidence matrix
A
I
of ethanal. In the non‐square matrix, the atoms are listed in columns and the bonds in rows. Only an undirected graph is presented here.
Figure 2.14 Distance matrices of ethanal with (a) geometric distances in angstrom and (b) topological distances. The matrix elements of (b) result from counting the number of bonds along the shortest path between the chosen atoms (Figure 2.12).
Figure 2.15 A bond matrix
A
B
of ethanol (a) and the redundant matrix (b).
Figure 2.16 A BE matrix of ethanal (a) and the redundant matrix (b).
Figure 2.17 The BE matrix of ethanal enables the determination of the number of valence electrons on each atom (the sum of each row/column).
Figure 2.18 The cross sum of the BE matrix of Figure 2.17 validates the octet rule.
Figure 2.19 The structure diagram of ethanal represented as a connection table consisting of two lists, a list of atoms and a list of bonds. The atom list contains atom indices that have been arbitrarily assigned to each atom and are used as atom references in the bond list.
Figure 2.20 Connection table of ethanal.
Figure 2.21 More compact (nonredundant) connection table of ethanal. Only non‐hydrogen atoms are considered; bonds with the lower indices are given only once (compare with Figure 2.20).
Figure 2.22 Workflow for the classification of isomeric structures of organic molecules.
Figure 2.23 Scheme of tautomerism. X, Y, and Z can be a carbon, nitrogen, phosphor, oxygen, or sulfur atom. G represents an electrofuge or nucleofuge group, in most cases a hydrogen atom.
Figure 2.24 Keto–enol (a), imine–enamine (b), and lactam–lactim (c) tautomerism.
Figure 2.25 Thioacetic acid shows tautomerism (from left to right at the top), mesomerism (from left to right at the bottom), and ionization (top to bottom on the left‐ and right‐hand sides).
Figure 2.26 The substituted phenyl derivative is a typical Markush structure (a). Here, a number of compounds are described in a one‐structure diagram by different types of variability:
R
1
substituent variation,
R
2
homology variation,
R
3
position variation, and
n
frequency variation. Phenylalanine (b) is one of these structures when
R
1
is COOH,
R
2
and
R
3
are H. Another example that is covered by the Markush notation is (c) 3,4‐dichloro‐
N
‐ethyl‐
α
‐phenyl‐benzenebutanamine.
Figure 2.27 Single bonds (a) are stored as
σ
‐systems and double bonds (b) as
π
‐systems.
Figure 2.28 Lewis representations of (a) benzene and 1,3‐butadiene (b), as examples of resonance of aromaticity and conjugated double bonds.
Figure 2.29 (a) Two zwitterionic resonance structures are required for a valid VB representation of the nitro group. (b) Improper representation of a nitro group with a pentavalent nitrogen atom. (c) The hybrid diagram needs no charged resonance structures. The
π
‐system contains four electrons on three atoms.
Figure 2.30 Lone pairs, radicals, and orbitals without electrons are represented by a
π
‐system with two, one, or zero electrons on the corresponding atom, respectively.
Figure 2.31 (a) Singlet carbine and (b) triplet carbene.
Figure 2.32 Enol ethers (a) have two different ionization potentials (IP) (b), depending on the orbitals concerned.
Figure 2.33 Example of an electron‐deficient bond in diborane with a B-H-B bond.
Figure 2.34 The multi‐haptic bonding in organometallic complexes (e.g., ferrocene) cannot be expressed adequately by a connection table.
Figure 2.35 From the constitution to the configuration and then to the conformation (3D structure) of a molecule with the example of phenylalanine (from left to right).
Figure 2.36 Workflow for the classification stereoisomerism.
Figure 2.37 Butenedioic acid with (a)
E
configuration = fumaric acid, and (b)
Z
configuration = maleic acid.
Figure 2.38 Different possibilities for displaying stereochemistry: (a) the wedge/hashed projection is the usual way for depicting stereochemistry, (b) this representations is ambiguous and should therefore not be used, and (c) the Fischer projection is particularly used for carbohydrates.
Figure 2.39 The four ligands attached to a chiral center are ranked according to the CIP rules
(OH > COOH > CH
3
> H)
; then the enantiomers of 2‐hydroxy‐propionic acid (lactic acid) can be assigned with the
R
(right‐hand) or
S
(left‐hand) configuration.
Figure 2.40 Examples of chiral molecules with different types of stereogenic units. (a) Lactic acid, (b) tetra‐substituted adamantane, (c) binaphthyl compound, and (d) substituted paracyclophane.
Figure 2.41 Cartesian coordinate system and Cartesian coordinates of chloromethane.
Figure 2.42 Internal coordinates of 1,2‐dichloroethane: bond lengths
r
1
and
r
2
, bond angle
α
, and torsion angle
τ
.
Figure 2.43 Graphical representation of the electrostatic potential on molecular surfaces of phenylalanine: (a) dots, (b) mesh or chicken wire, (c) solid, and (d) semitransparent.
Chapter 3
Figure 3.1 The standard InChI of phenylalanine.
Figure 3.2 The standard InChI and InChIKey of phenylalanine including stereochemistry.
Figure 3.3 XYZ file representing ethanal.
Figure 3.4 Z‐matrix of 1,2‐dichloroethane.
Figure 3.5 All MDL file formats based on the Molfile format.
Figure 3.6 Structure of (2
R
,3
E
)‐4‐chlorobut‐3‐en‐2‐ol.
Figure 3.7 Molfile representing (2
R
,3
E
)‐4‐chlorobut‐3‐en‐2‐ol shown in Figure 3.6 in the V2000 and V3000 format. The line numbers listed in the gray columns and the descriptions are not part of a Molfile. The fundamental structure of a Molfile is indicated in the central column.
Figure 3.8 Detailed description of the counts line of both Molfile versions (V2000 and V3000) from the molecule shown in Figure 3.6.
Figure 3.9 Detailed description of the atom block of both Molfile versions (V2000 and V3000) from the molecule shown in Figure 3.6.
Figure 3.10 Detailed description of the bond block of both Molfile versions (V2000 and V3000) from the molecule shown in Figure 3.6.
Figure 3.11 Structure of sulfamide (sulfuric diamide).
Figure 3.12 Sample SDfile for sulfamide (sulfuric diamide) including both structural information and associated data, for example, physicochemical properties.
Figure 3.13 Visualization of the 3D molecular structure of α‐conotoxin PNI1 polypeptide (PDB ID: 1pen) including secondary information (as cartoons) and water molecules.
Figure 3.14 Title section of the analyzed PDB file. (The gray rows specify the position of the 80 possible characters and the gray column indicates the line number of the file.)
Figure 3.15 Remark section of the analyzed PDB file.
Figure 3.16 Primary structure and heterogen sections of the analyzed PDB file.
Figure 3.17 Crystallographic and coordinate transformation sections of the analyzed PDB file.
Figure 3.18 Atomic coordinate data section of the analyzed PDB file.
Figure 3.19 Last lines of the analyzed PDB file.
Figure 3.20 CML file representing (2
R
,3
E
)‐4‐chlorobut‐3‐en‐2‐ol shown in Figure 3.6. The line numbers are listed in the grey column and are not part of a CML file.
Figure 3.21 The JSME as an example of a molecular editor.
Figure 3.22 The Jmol stand‐alone application as an example of a molecular viewer.
Figure 3.23 Different approaches to storing and retrieving tautomeric forms of molecules – here shown with acetylacetone. In model (a) only one tautomeric form represents the molecule, whereas in approach (b) the set of all tautomeric forms is used. Representation (c) indicates that only one generalized tautomeric form describes all the tautomers of (b).
Figure 3.24 Standard (a) and nonstandard InChIs (b) and the InChIKeys of acetylacetone to describe mobile hydrogens in tautomeric forms. A specific tautomeric structure (c) is described by the fixed hydrogen atom layer, where one hydrogen is fixed at atoms 3 and 6. (/b4‐3 describes in addition a double bond in the stereolayer.)
Figure 3.25 Six different possibilities for labeling the atoms in hypochlorous acid.
Figure 3.26 During the iteration, the EC value of each atom is calculated by summing the EC values of the directly connected non‐hydrogen atoms of the former sphere (relaxation process).
Figure 3.27 The EC values of the atoms of phenylalanine (simplified to nodes for all non‐hydrogen atoms, ignoring stereochemistry and bond types) are calculated by considering the EC values of the neighboring atoms. After each relaxation process (a–d),
k
, the number of equivalent classes (different EC values), is determined.
Figure 3.28 Canonicalization starts at the atom with the highest EC value (in the example of Figure 3.27 the atom with the value of 16), which obtains the number 1.
Figure 3.29 The molecular graph consists of three cycles (c1, c2, c3), which can be represented by the incidence vectors of their edges. The set of three cycles is dependent, since the symmetric difference (
=
exclusive or) of two of the incidence vectors results in the third one. For example, the symmetric difference of c2 and c3 results in c1. Each combination of two of these cycles forms a cycle basis. Only the combination of c1 and c2 forms a minimum cycle basis.
Figure 3.30 Two simple cases for the reduction of a graph by removing a vertex
x
according to Hanser's algorithm.
Figure 3.31 Stepwise reduction of the number of vertices according to Hanser's algorithm. The molecular graph contains 3 rings.
Figure 3.32 Cubane, (a) spatial and (b) planar representation, contains six rings (c) of size four. Every combination of five of these 4‐cycle forms a valid SSSR. The SMARTS pattern [R3] matches four of the eight carbon atoms.
Figure 3.33 Comparison of interchangeability classes (ICs), relevant cycles (RCs), and the smallest set of smallest rings (SSSR). (a) The molecular graph contains atoms A, B, and C. (b) The molecular graph consists of six rings. (c) The membership of atoms A, B, and C to an SSSR and the set of RCs and ICs are shown in table C. All of the six rings belong to the set of RCs. Each combination of two small rings and one of the large rings forms a valid SSSR. Ring 1 and ring 2 each form a separate IC. The third IC is formed by the union of the rings 3, 4, 5, and 6.
Figure 3.34 (a) The cyclophane‐like molecular graph contains three URFs that are identical to the ICs. (b) Cubane contains six URFs in contrast to only one IC. (c) The number of URFs and ICs are identical for atoms A and B. Atom C is a member of three URFs but only a member of one IC.
Figure 3.35 Stereocenters can be identified by permutation groups. Thus, the structure is separated into a skeleton and its ligands. Both are numbered independently, by the indices of the skeleton (1–4) and those of ligands (A–D) and are described in a permutation matrix.
Figure 3.36 The ordered list of 24 priority sequences of the ligands A–D around a tetrahedral stereocenter. According to the number of permutations, the list can be separated into two classes (odd number of permutations on the right‐hand side or even number of permutations on the left‐hand side).
Figure 3.37 The permutation matrices of two structures that differ through rotation by
120°
. The permutation matrix of the rotated isomer can be brought into correspondence with the permutation matrix of the reference isomer by two interchanges of two ligands (transpositions).
Figure 3.38 Automatic 3D structure generation.
Figure 3.39 Classification of automatic 3D structure generators.
Figure 3.40 General workflow to generate a 3D model by CORINA.
Figure 3.41 An unsymmetrical superphane and its superstructure.
Figure 3.42 The principle of longest pathways for acyclic fragments and molecules.
Figure 3.43 Elimination of nonbonded interactions (close contacts).
Figure 3.44 Superimposition of a set of conformations of 2
R
‐benzylsuccinate with the benzene ring fixed on the right (rotatable bonds are highlighted on the left).
Figure 3.45 Dependence of the potential energy curve of n‐butane on the torsion angle
τ
between carbon atoms C2 and C3.
Figure 3.46 General workflow scheme as utilized by the program ROTATE for the generation of multiple conformations.
Figure 3.47 Derivation of the
RMS
XYZ
deviation of two conformations.
Figure 3.48 Ring templates for a saturated six‐membered ring and a six‐membered ring with one double bond as implemented in the ring conformation table of the 3D structure generator CORINA and ROTATE.
Figure 3.49 Derivation of the Torsion Angle Library.
Figure 3.50 Derivation of a symbolic potential energy function from the torsion angle distribution of a torsion fragment.
Figure 3.51 The first dynamic molecular display of small molecules by Levinthal was driven by the “crystal ball.”
Figure 3.52 The most common molecular graphic representations of phenylalanine such as (a) wire frame, (b) capped sticks, (c) ball and stick,(d) space filling, and (e) inorganic molecules
(YBa
2
Cu
3
O
7 −
x
)
with balls and sticks (left) and the same molecule polyhedral (right).
Figure 3.53 The most common molecular graphic representations of biological molecules (lysozyme): (a) balls and sticks; (b) backbone or
Cα
‐stick; and (c) cartoon (including the cylinder, ribbon, and tube model).
Figure 3.54 Different molecular surfaces of lysozyme (5LA5.pdb): (a) van der Waals Surface, (b) Connolly surface, and (c) solvent‐accessible surface.
Figure 3.55 Dependence of the van der Waals energy on the distance between two non‐connected atom nuclei. With decreasing atomic distance, the energy between the two atoms becomes attraction, going through a minimum at the van der Waals distance. Then, upon a further decrease in the distance, a rapid increase in repulsion energy is observed.
Figure 3.56 Cross section of the 3D model of formic acid (HCOOH). The van der Waals radius of each atom of the molecule is taken and by fusing the spheres the van der Waals surface is obtained.
Figure 3.57 The Connolly surface is determined by moving a probe sphere (usually a water molecule) over the van der Waals surface.
Figure 3.58 The center of the rolling probe sphere defines the solvent‐accessible surface during movement of the probe over the van der Waals surface. Thus, the molecular surface is expanded by the radius of the solvent molecule.
Figure 3.59 Different isovalue‐based surfaces of phenylalanine: (a) isoelectronic density, (b) molecular orbitals (HOMO–LUMO), (c) isopotential surface, and (d) isosurface of electron cryo‐microscopic volume of the ribosome of
Escherichia coli
.
Chapter 4
Figure 4.1 A typical equation of a chemical reaction.
Figure 4.2 Two reaction equations showing two completely different uses for the (+) symbol: (a) giving a fully balanced single reaction and (b) combining two parallel reactions into a single equation that is also not stoichiometrically balanced.
Figure 4.3 The scheme of a biochemical reaction indicating not only reactant (substrate) and product but also enzyme, coenzyme, and regulator, as well as showing in which species the reaction occurs.
Figure 4.4 Representative, simple examples of a substitution, an addition, and an elimination reaction showing the number,
n
, of reaction partners, and its change, Δ
n
, during the reaction.
Figure 4.5 The reaction site of an elimination reaction with the bonds to be broken crossed and the bonds made in heavy lines (a) and the three mechanisms to achieve this reaction (b–d).
Figure 4.6 Illustrations of the charge distribution (a), the inductive effect (b), and the resonance effect (c), the polarizability effect (d), the steric effect (e), and the stereoelectronic effect (f).
Figure 4.7 FMO treatment of a Diels–Alder reaction: (a) reaction equation, (b) correlation diagram, and (c) orbital coefficients.
Figure 4.8 The dissociation of substituted benzoic acids (x = substituent) (a) and the hydrolysis of benzoic acid methyl esters (b).
Figure 4.9 Input of a chemical reaction by JSME also showing the atom‐to‐atom mapping.
Figure 4.10 Input of a metabolic reaction by the METIS editor.
Figure 4.11 The present pathway of reaction information from the producer to a consumer using a reaction database indicating the many steps in newly conceiving the information.
Figure 4.12 The pathway of reaction information by the use of an electronic laboratory notebook (ELN) into a reaction database indicating the seamless flow of information.
Figure 4.13 The search for oxidations of primary alcohols to carboxylic acids (a) will obtain reaction (b) as a hit, although this reaction is in reality a hydrolysis of an ester. (c) shows the correct specification of the query to obtain reactions involving the oxidation of alcohols to carboxylic acids.
Figure 4.14 The reaction of formaldehyde with hydrocyanic acid to give a cyanohydrine, and the matrix representation of this reaction.
Figure 4.15 The reaction scheme comprising the breaking and the making of two bonds and some examples of reactions following this scheme.
Figure 4.16 Different levels of specification for a bond participating in a reaction.
Figure 4.17 The reaction scheme breaking three and making three bonds and some of the reaction types that fall into this scheme.
Figure 4.18 A reaction scheme that changes the number of bonds at one atom and some specific examples.
Figure 4.19 Consecutive application of two reaction schemes to model the oxidation of thioethers to sulfoxides.
Figure 4.20 Classification of reactions by the CLASSIFY approach.
Figure 4.21 Different classification of substituents at the reaction site.
Figure 4.22 Reaction center of the dataset of 120 reactions (reacting bonds are indicated by dotted lines), and some reaction instances of this dataset.
Figure 4.23 Distribution of the dataset of 120 reactions in the Kohonen network. (a) The neurons were patterned on the basis of intellectually assigned reaction types. (b) In addition, empty neurons were patterned on the basis of their k‐nearest neighbors.
Figure 4.24 Some reactions that proceed under control of stereochemistry.
Figure 4.25 The treatment of the stereochemistry of an
S
N
2 reaction by permutation group theory.
Figure 4.26 Part of the reaction network of the pentose phosphate pathway.
Chapter 5
Figure 5.1 Folding of a structure key.
Figure 5.2 Different usage of the
(+)
‐symbol in reaction representation.
Figure 5.3 Data model for reaction representation.
Figure 5.4 Raman spectra of acetaminophen [40].
Figure 5.5 Resolved fragment of a thin‐film spectrum of liquid
N
‐methylpyrrole using Lorentz functions.
Figure 5.6 Research pathways.
Figure 5.7 A simple file of the “dat” format.
Figure 5.8 An example of the JCAMP‐DX file format.
Figure 5.9 A simple PMML file.
Figure 5.10 Overview of the wavelet transform and multi‐resolution analysis scheme.
Figure 5.11 An initial dataset has to be split into two or three parts.
Chapter 6
Figure 6.1 Lidocaine
6.1,
an important local anesthetic drug and related structures (see Table 6.3).
Figure 6.2 Markush‐type query.
Figure 6.3 Keto–enol tautomerism of acetylacetone.
Figure 6.4 Representation of π‐complexes in chemistry (right formulas) and in databases (left formulas).
Figure 6.5 (+)‐Estrone.
Figure 6.6 Chemistry‐aware three‐tiered architecture.
Chapter 7
Figure 7.1 Different types of representation of the chemical graph of 4‐methylcyclohexene (note that the numbering does not conform to the IUPAC rules): (a) labeled graph (2D – arbitrary atom labeling); (b) connection table (CT); (c) linear notations; (d)topological indices; and (e) registry numbers.
Figure 7.2 Basic principle for efficient full structure search (unique and semi‐unique techniques).
Figure 7.3 Hashing function and fast address lookup.
Figure 7.4 Mappings between the query graph (
G
Q
) and the target graph (
G
T
). The notation (2, 1, 3, 4) means that atom 1 of the query subgraph
G
Q
is mapped to atom 2 from the target graph
G
T
, query atom 2 to target atom 1, query atom 3 to target atom 3, and query atom 4 to target atom 4.
Figure 7.5 Search tree of mappings obtained by applying the backtracking algorithm to the pair of structures
G
Q
and
G
T
(see the graphs in Figure 7.4). Array (
M
1
,
M
2
,
M
3
,
M
4
) denotes the mapping 1 (of
G
T
) →
M
1
(of
G
Q
), 2 →
M
2
, 3 →
M
3
, 4 →
M
4
.
Figure 7.6 Backtracking approach realized as depth‐first search algorithm. Dotted arrows trace the route used for traversing all mappings in the search tree. Each node in the tree corresponds to a mapping between
G
Q
and
G
T
.
Figure 7.7 Screening process as part of DB substructure search: (1) calculate query keys, (2) filtrate DB by comparing the query bits with the pre‐calculated bits, and (3) use the screening result list for the final mapping.
Figure 7.8 Generation of hashed fingerprints.
Figure 7.9 Classes (groups) of topologically equivalent atoms. The bold numbers are the atom labels; the numbers in parentheses are the equivalence classes.
Figure 7.10 Maximum common substructure (MCS) of two chemical structures.
Figure 7.11 SMARTS queries examples.
Figure 7.12 The similarity search process.
Figure 7.13 The similarity concept applied in different chemical spaces.
Figure 7.14 Descriptor examples (0D, 1D, 2D, 3D) for the definition of similarity measures.
Figure 7.15 Flowchart of a pharmacophore mapping algorithm. Antihistamine pharmacophore [55] is searched against target molecule. The pharmacophore is defined by 3 features: two aromatic rings (A) and amino group (N).
Figure 7.16 Structures
A
and
B
with corresponding distance matrices (distances are given in nanometers).
Figure 7.17 Examples for DNA score matrices for the calculation of the “edit distance” similarity metric: (a) the simplest scoring matrix that corresponds to the Hamming distance and (b) the customized scoring matrix that distinguishes different types of DNA changes.
Figure 7.18 The stages of the DP algorithm: (a) forward filling of the scores
S(i,j)
and (b) backtracking to obtain the optimal alignment(s).
Figure 7.19 Flowchart of a heuristics‐based algorithm for fast sequence alignment.
Chapter 9
Figure 9.1 An indirect approach to the prediction of properties.
Chapter 10
Figure 10.1 Fragmentation of a molecule; the fragment code counts the number of occurrences of a fragment.
Figure 10.2 The derivation of fingerprints.
Figure 10.3 Screen of ChemoTyper: on the left is part of the Chemotype library; on the right is part of a dataset of 8193 organic molecules with the fitting chemotypes highlighted.
Figure 10.4 Screen of ChemoTyper: on the left are three chemotypes of the thalidomide skeleton with chemotypes differentiated by sigma charge (green) and total charge (blue); on the right‐hand side is part of a dataset that indicates hits for the two different chemotypes.
Figure 10.5 MNA descriptors for the atoms of phenol of levels 0, 1, and 2.
Figure 10.6 Self‐organizing map of bonds of a dataset of organic molecules. Black neurons (center) reactive bonds; gray neurons, non‐classified bonds; light gray, non‐reactive bonds; cross, contain both reactive and non‐reactive bonds; white, empty neurons (no bonds mapped).
Figure 10.7 Hydrogen‐depleted graph of 2,2‐dimethylbutane (
2
).
Figure 10.8 Calculation of the ECFP descriptors starting from an atom.
Figure 10.9 Scheme of an electron diffraction experiment.
Figure 10.10 Procedure for encoding a structure with an RDF code.
Figure 10.11 Radial distribution function of an organic molecule indicating how the longer C-S bond expresses itself in the peak for 1–2 and the peak for 1–3 interactions.
Figure 10.12 Comparison of the RDF code for aromatic compounds with different substitution patterns.
Figure 10.13 Comparison of the radial distribution function of the chair, boat, and twist conformations of cyclohexane (hydrogen atoms are not considered).
Figure 10.14 (a) Example of a chiral molecule. (b) The atoms A, B, C, and D are those directly bonded to the chiral center. The neighborhood of atom A is the set of atoms whose distance (in number of bonds) to A is less than their distance to B, C, and D. (c) Example of a combination of four atoms (
i
,
j
,
k
, and
l
), each at a different ligand (A, B, C, or D) of the chiral center.
Figure 10.15 Graphical representation of
f
CICC
(
u
) versus
u
for (+) (R) and (−) (
S
)‐2‐Pyrrolidinemethanol sampled at 75 evenly distributed points between −0.03 e
2
Å
−1
and +0.03 e
2
Å
−1
. Hydrogen atoms not bonded to chiral carbon atoms were not considered.
Figure 10.16 Mapping the coordinates of points on a molecular surface into a self‐organizing neural (Kohonen) network.
Figure 10.17 The surface of a torus, a plane without beginning and end, is stretched out into two dimensions by making two perpendicular cuts.
Figure 10.18 Self‐organizing map of 2‐chloro‐4‐hydroxy‐2‐methylbutane colored by the molecular electrostatic potential.
Figure 10.19 A self‐organizing map (SOM) is trained with points on the molecular surface of
n
‐butane (for visualization the surface of the four carbon atoms and their hydrogen atoms are colored differently). This SOM is now taken as a template, and points from the surface of 1‐propanol are sent through this SOM, indicating the two hydrogen atoms missing in 1‐propanol against
n
‐butane as empty (white) neurons.
Figure 10.20 An SOM trained with n‐butane is taken as a template for various 1‐substituted propane derivatives indicating the differences in the molecular surface by empty (white) neurons.
Figure 10.21 The box containing a steroid indicating the points for calculating the steric or electrostatic field in the CoMFA approach.
Figure 10.22 Extension of the QSAR method by descriptors not based on structure.
Figure 10.23 Structure descriptors contain information from physics (geometric resolution), chemistry (properties of atoms), and mathematics (mathematical transformation).
Figure 10.24 The comparison of a molecule with a human.
Chapter 12
Figure 12.1 QSAR modeling workflow. This workflow is implemented within the Chembench portal (https://chembench.mml.unc.edu/; see also Ref. [18]).
Figure 12.2 Data cycle in cheminformatics.
Figure 12.3 General workflow for comprehensive curation of chemogenomics datasets.
Figure 12.4 Example of duplicate retrieval using PubChem ID, SMILES, chemical names, InChI keys (InChI strings are not shown but are also different), and 2D similarity. Note that the computing 2D similarity as Tanimoto coefficient using CDK descriptors yields
T
c
= 1
(implicating structural duplicates) for the two curated compounds (no salts, standardized functional groups and aromatization), whereas all other representations fail to suggest that the two compounds are identical.
Chapter 13
Figure 13.1 Growth of GenBank from 1982 to 2016.
Figure 13.2 Sequencing cost per genome (3 billion base pairs) from 2001 to 2015. Please note the logarithmic scaling of the
y
‐axis.
Figure 13.3 Excerpt from GenBank entry AF123456 of a messenger RNA coding for a transcription factor in chicken (
Gallus gallus
). Dots […] denote deleted lines.
Figure 13.4 Excerpt from a pairwise sequence alignment of Abl (ABL1_HUMAN) and Src (SRC_HUMAN) tyrosine kinases illustrating the meaning of identity, similarity, and gaps.
Figure 13.5 Degree of sequence identity as a function of the point accepted mutations (PAM) during protein evolution.
Figure 13.6 BLOSUM62 scoring matrix. Amino acid exchanges that result in positive scores are highlighted in gray. The left column gives the three‐letter amino acid code in addition to the one‐letter code used for the labeling of the scoring matrix.
Figure 13.7 Example of a hit obtained from an NCBI BLAST sequence search against the UniProtKB/Swiss‐Prot database using the sequence of Src (SRC_HUMAN) as a query.
Figure 13.8 Steps of the progressive multiple sequence alignment.
Figure 13.9 (a) Section from a multiple sequence alignment of eight Src‐family tyrosine kinases (Src, Lck, Hck, Lyn, BLK, Fyn, Yes, Fgr) and the Abl1‐kinase. The sequence positions refer to SRC_HUMAN. A “*” marks a strictly conserved sequence position; “:” and “.” denote decreasing degrees of sequence similarity. The three strictly conserved residues, G277, G279, and K298, which are part of the ATP‐binding site, are marked in bold/italic. (b) Location of residues G277, G279, and K298 in the three‐dimensional structure of Src kinase (PDB entry: 4MXO [25]). The conserved residues are shown as a space‐filled presentation, and an inhibitor is shown as a sticks model. Figure prepared with the UCSF Chimera package [26].
Figure 13.10 Excerpt from the PROSITE entry PS00107 describing the protein kinase ATP‐binding region signature.
Figure 13.11 Overlay of an Src kinase homology model (black) with the Src kinase crystal structure (PDB: 4MXO [25]). The model was generated based on the crystal structure of Abl kinase (PDB: 4WA9 [38]) sharing 47% sequence identity to Src for the modeled region. Figure prepared with the UCSF Chimera package [26].
Cover
Table of Contents
Begin Reading
C1
vi
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
181
182
183
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
248
249
250
251
252
253
254
255
256
257
258
259
260
261
265
266
267
268
267
268
269
270
271
272
273
274
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
291
345
346
347
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
396
397
398
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
397
398
399
400
401
402
403
404
405
406
407
408
409
410
397
398
399
400
401
402
403
404
405
407
408
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
495
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
522
523
525
526
527
528
529
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
555
556
Edited by Thomas Engel and Johann Gasteiger
Editors
Dr. Thomas Engel
LMU München
Department of Chemistry
Butenandtstraße 5-13
81377 München
Germany
Prof. Dr. Johann Gasteiger
Universität Erlangen-Nürnberg
Computer-Chemie-Centrum
Nägelsbachstr. 25
91052 Erlangen
Germany
Cover Design
Dr. Christian R. Wick
University of Erlangen-Nürnberg
Institute for Theoretical Physics I
Nägelsbachstr. 49b (EAM)
91052 Erlangen
Germany
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.: applied for
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
© 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Print ISBN: 978-3-527-33109-3
ePDF ISBN: 978-3-527-69377-1
ePub ISBN: 978-3-527-69378-8
Mobi ISBN: 978-3-527-69379-5
oBook ISBN: 978-3-527-81366-7
Cover Design Grafik-Design Schulz, Fußgönheim, Germany
Thomas Engel
To my family especially Benedikt and to Guido Kirsten, Markus Sitzmann, and Achim Zielesny for the very valuable feedback.
Johann Gasteiger
To all the friends, colleagues and coworkers that ventured with me into the exciting field of chemoinformatics.
And to my wife Uli for never complaining about my long working hours.
If you want to build a ship, don't drum up people to collect wood and don't assign them tasks and work, but rather teach them to long for the endless immensity of the sea.
Antoine de Saint-Exupéry
Chemistry began with magic. Who but a wizard could, with a puff of smoke, turn one thing into another? The alchemists believed that the ability to transform materials was a valuable skill, so valuable in fact that they devised complex descriptions and alchemical symbols, known only to them, to represent their secret methods. Information was encoded and hidden, suffused with allegorical and religious symbolism, slowing progress. Medicinal chemists today may be particularly interested in a legendary stone called a Bezoar, found in the bodies of animals (if you knew which animal to dissect), that had universal curative properties. I'm still looking. However, to plagiarize a recent Nobel Prize winner for literature, the times they were a changin'. Departing from the secretive “alchemist” approach, Berzelius (1779–1848) suggested compounds should be named from the elements that made them up, and Archibald Scott Couper (1831–1892) devised the “connections” between “atoms,” which gave rise to structural diagrams (1858). In 1887, the symbols created by Jean Henri Hassenfratz and Pierre Auguste Adet to complement the Methode de Nomenclature Chimique were a revolutionary approach to chemical information. A jumbled, confused, and incorrect nomenclature was replaced by our modern‐day designations such as oxygen, hydrogen, and sodium chloride. The new chemistry of Lavoisier was becoming systematized. The “Age of Enlightenment” created a new philosophy of science where information, validated by experiment, could be tested by an expanding community of “scientists” (a term coined by William Whewell in 1833), placing data at the core of chemistry.
With the accumulation of knowledge, and a language to communicate chemistry, the stage was set for the creation of a new science of information in the domain of chemistry. Up stepped Friedrich Beilstein (1838–1906), who systematically collected chemical data on substances, reactions, and properties of chemical compounds in the Handbuch der organischen Chemie (Handbook of Organic Chemistry, published in 1881). The naming of compounds was a key feature that enabled the storage and retrieval of chemical information on a “grand” scale (1500 compounds). The indexing of chemical information meant chemistry could be reliably stored, common links between data established, and – most importantly – the information could be retrieved without loss. This drive for efficient indexing was the dominant feature of chemical information research for the next half century. As chemistry (and its many related disciplines) continued on an ever upward trajectory of innovation (and data collection), the paper trail required to go from perfectly reasonable questions like “how do I synthesize this compound?” to “has this compound been made before?” became rather complex and time‐consuming. I remember many happy hours spent in the library of the Wellcome Foundation trawling through the multitude of bookshelves of Chemical Abstracts to find one compound and, if lucky, a synthesis simple enough that I could perform with a yield better than my usual ten percent. Of course things got worse (or better if you were a librarian), and I recall an interesting RSC symposium in 1994 called “The Chemical Information Explosion: Chaos, Chemists and Computers.” We had clearly reached a point where someone had to invent chemoinformatics.
Although the “someone” is of course a worldwide community of scientists interested in chemical data, the term was coined by Frank Brown in 1998, and he defined it as
“The mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and optimization.”
The combination of multidisciplinarity, the reduction of data to knowledge, and the driving force of the pharmaceutical industry have been key features of the advance of chemoinformatics. The enabling technologies have been the availability of unprecedented amounts of chemical data (increasingly pubicly available) and the continuous development of new algorithms, designed specifically for chemistry, to achieve the goal of turning information into knowledge. Of course Moore's law (an observation by Gordon Moore at Intel), wherein the density of computer components (and the computation power offered) doubles every 2 years, has underpinned the hardware necessary to keep pace with the data explosion. But perhaps some of the chaos remains, hence the popularity of software such as Babel (which converts many data formats to many data formats!). Some numbers here are interesting. If we recall that the first edition of Beilsteins Handbuch contained 1500 compounds, the Chemical Abstracts Service of the American Chemical Society reported in 2015 that they had registered their 100 millionth chemical substance. What is truly transformational (if you think about it) is that a new student, with a basic knowledge of chemistry, when asked to search for a single compound from the 100M registry, gets the correct result in a microsecond. Not only that, a host of measured and predicted chemical properties, synthesis strategies, available reagents, structurally similar compounds and Internet links to a multitude of other diverse, information‐rich databases is available.
Clearly, chemoinformatics has come of age. In fact the term “chemoinformatics” has gained a certain elastic quality. The methodologies and data analysis tools developed for chemical information have evolved and extended to the data analytics of essentially any data that includes chemistry. Examples include the simulations of large systems of molecules such as proteins, machine learning (and the recent resurgence in artificial intelligence) to create predictive models, for example, metabolism, ADME properties, and quantitative structure–activity relationships of drugs (including quantum chemistry, bioinformatics, and analytical chemistry) and the detection and analysis of drug binding sites. Although much of the early work in chemoinformatics has been applied to problems of the pharmaceutical industry, the subject has been embraced across the sciences wherever chemistry is required, for example, in agricultural and food research, cosmetics, and materials science.
But in an age when computers can do “magic,” (“Any sufficiently advanced technology is indistinguishable from magic” – Arthur C. Clark), it is tempting to return to where we were in the time of Berzelius and hide the technology behind an alchemical mask of symbols, for example, a simple interface that hides highly complex search and retrieval algorithms or a machine learning application to predict metabolism. The antidote to this is of course education. A firm grounding in the principles and practice of chemoinformatics provides students and expert practitioners alike with the knowledge of the underlying algorithms, how they are implemented, their availability and of course limitations of software for a given purpose as well as future challenges for those with a keen interest in developing the field.
The best textbooks are naturally written from the viewpoint of those who are intimately connected to their subject. The chemoinformatics group at the Computer‐Chemie‐Centrum (CCC) at the University of Erlangen‐Nuremberg has been pioneers in chemoinformatics for over thirty years and are recognized as both innovators and experts at applying these methods to a large variety of chemical problems. However it is as educators that perhaps their greatest impact on the field may accrue over time. The new textbook Chemoinformatics – Basic Concepts and Methods builds on the successful first edition Chemoinformatics: A Textbook (published in 2003) and is again edited by Johann Gasteiger and Thomas Engel. This volume is complemented by an additional book Applied Chemoinformatics – Achievements and Future Opportunities, which shows the many fields chemoinformatics is now applied to. Johann Gasteiger has had a distinguished career in chemistry and is well known for his seminal contributions to chemoinformatics. He was the recipient of the 1991 Gmelin‐Beilstein Medal of the German Chemical Society for Achievements in Computer Chemistry, the 2005 Mike Lynch Award of the Chemical Structure Association, the 2006 ACS Award for Computers in Chemical and Pharmaceutical Research for his outstanding achievements in research and education in the field of Chemoinformatics, and the 1997 Herman Skolnik Award of the Division of Chemical Information of the American Chemical Society. Thomas Engel is a specialist in chemoinformatics who studied chemistry and education at the University of Würzburg and spent a significant tenure at the CCC at the University of Erlangen‐Nürnberg, followed by the Chemical Computing Group AG in Cologne and is presently at the Ludwig‐Maximilians‐Universität, Munich.
As editors, they have brought together a wide range of experts and topics, which will inform, educate, and motivate the reader to delve deeper into the subject of chemoinformatics. The new edition provides both the foundations for chemoinformatics and also a range of developing topics of active research, providing the reader with an introduction to the subject as well as advanced topics and future directions. This new edition is complemented by the Handbook of Chemoinformatics: From Data to Knowledge (by the same editors). It belongs on the bookshelf of students and experts alike, all who have an interest in the field of chemoinformatics, and especially those who see “magic” in chemistry.
Robert C. Glen
Professor of Molecular Sciences Informatics
Director of the Centre for Molecular Informatics
Department of Chemistry
University of Cambridge
Cambridge
United Kingdom
Ivan Bangov
Konstantin Preslavski Shumen University
Natural Sciences Faculty, Department of General Chemistry
115 Universitetska Street
9712 Shumen
Bulgaria
Email: [email protected]
Tim Clark
Friedrich‐Alexander‐University Erlangen‐Nuremberg
Computer‐Chemie‐Centrum and Interdisciplinary Center for Molecular Materials
Nägelsbachstrasse 25
91052 Erlangen
Germany
Thomas Engel
Ludwig‐Maximilians‐Universität München
Department Chemie
Butenandtstraße 5‐13, Haus F
81377 Munich
Germany
Johann Gasteiger
Friedrich‐Alexander‐Universität
Erlangen‐Nürnberg
Computer‐Chemie‐Centrum
Nägelsbachstrasse 25
91052 Erlangen
Germany
Alexander Golbraikh
University of North Carolina
Eshelman School of Pharmacy, Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry
Chapel Hill, NC, 27599
USA
Nikolay Kochev
University of Plovdiv
Faculty of Chemistry, Department of Analytical Chemistry and Computer Chemistry
24, Tzar Assen Street
4000 Plovdiv
Bulgaria
Email: [email protected]
Adrian Kolodzik
University of Hamburg
ZBH – Center for Bioinformatics
Bundesstraße 32
20146 Hamburg
Germany
Email: [email protected]
Harald Lanig
Friedrich‐Alexander‐University Erlangen‐Nuremberg
Central Institute for Scientific Computing (ZISC)
Martensstrasse 5a
91058 Erlangen
Germany
Email: [email protected]
Giorgi Lekishvili
Tbilisi State Medical University
Faculty of Pharmacy, Department of Medical Chemistry
33, Vazha‐Pshavela avenue
0186 Tbilisi
Georgia
Valentin Monev
Bulgarian Academy of Sciences
Institute of Organic Chemistry
Acad. Georgi Bonchev Street,
Sofia 1113
Bulgaria
Email: [email protected]
Matthias Rarey
University of Hamburg
ZBH – Center for Bioinformatics
Bundesstraße 32
20146 Hamburg
Germany
Email: [email protected]
Oliver Sacher
Molecular Networks GmbH
Neumeyerstraße 28
90411 Nürnberg
Germany
Christof Schwab
Molecular Networks GmbH
Neumeyerstraße 28
90411 Nürnberg
Germany
Joao Aires de Sousa
Universidade Nova de Lisboa
Faculdade de Ciencias e Tecnologia, Departamento de Quimica
2829‐516 Caparica
Portugal
www: http://joao.airesdesousa.com
Heinrich Sticht
Friedrich‐Alexander‐University Erlangen‐Nuremberg
Institut für Biochemie, Emil‐Fischer Centrum
Fahrstraße 17
91054 Erlangen
Germany
Lothar Terfloth
Insilico Biotechnology AG
Meitnerstrasse 9
70563 Stuttgart
Germany
Jarosław Tomczak
Informatics Unlimited Ltd
8 Station Road, Histon
Cambridge CB24 9LQ
UK
Alexander Tropsha
University of North Carolina
Eshelman School of Pharmacy, Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry
Chapel Hill, NC, 27599
USA
E-mail: [email protected].
Kurt Varmuza
Vienna University of Technology, Vienna, Austria
Institute of Statistics and Mathematical Methods in Economics, and Institute of Chemical Engineering
Wiedner Hauptstrasse 7
1040 Vienna
Austria
[email protected]; www.lcm.tuwien.ac.at/vk/
David A. Winkler
Latrobe University
Latrobe Institute for Molecular Science
3082 Bundoora
Australia
and
Monash University
Monash Institute of Pharmaceutical Sciences
Parkville 3052
Australia
Engelbert Zass
ETH Zurich (retired)
Laboratory of Organic Chemistry
Zürich
Switzerland
Jure Zupan
National Institute of Chemistry
Laboratory of Chemometrics,
Hajdrihova 19
SI‐1000, Ljubljana
Slovenia
E-mail: [email protected]
Thomas Engel and Johann Gasteiger
1Ludwig‐Maximilians‐University Munich, Department of Chemistry, Butenandtstraße 5‐13, 81377 Munich, Germany
2Computer‐Chemie‐Centrum, Universität Erlangen‐Nürnberg, Nägelsbachstr. 25, 91052 Erlangen, Germany
1.1 The Rationale for the Books
1.2 The Objectives of Chemoinformatics
1.3 Learning in Chemoinformatics
1.4 Outline of the Book
1.5 The Scope of the Book
1.6 Teaching Chemoinformatics
In 2003 we issued the book
Chemoinformatics: A Textbook
(J. Gasteiger, T. Engel, Editors, Wiley‐VCH Verlag GmbH, Weinheim, Germany, ISBN 13: 978‐3‐527‐30681‐7)
which was well accepted and contributed to the development of the field of chemoinformatics. However, with the enormous progress in chemoinformatics, it is now time for an update. As we started out on this endeavor, it became rapidly clear that all the developments require presenting the field in more than a single book. We have therefore edited two volumes:
Chemoinformatics – Basic Concept and Methods
Applied Chemoinformatics – Achievements and Future Opportunities [
1
]
In this first volume, “Basic Concept and Methods,” the essential foundations and methods that comprise the technology of chemoinformatics are presented.
The second volume, “From Methods to Applications,” shows how this technology has been applied to a variety of fields such as chemistry, drug discovery, pharmacology, toxicology, agricultural, food, and material science as well as process control. The links to the second volume are referenced in the present volume by “Applications Volume”