Chemoinformatics - Thomas Engel - E-Book

Chemoinformatics E-Book

Thomas Engel

0,0
83,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This essential guide to the knowledge and tools in the field includes everything from the basic concepts to modern methods, while also forming a bridge to bioinformatics.
The textbook offers a very clear and didactical structure, starting from the basics and the theory, before going on to provide an overview of the methods. Learning is now even easier thanks to exercises at the end of each section or chapter. Software tools are explained in detail, so that the students not only learn the necessary theoretical background, but also how to use the different software packages available. The wide range of applications is presented in the corresponding book Applied Chemoinformatics - Achievements and Future Opportunities (ISBN 9783527342013). For Master and PhD students in chemistry, biochemistry and computer science, as well as providing an excellent introduction for other newcomers to the field.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1079

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Dedication

Foreword

List of Contributors

Chapter 1: Introduction

1.1 The Rationale for the Books

1.2 The Objectives of Chemoinformatics

1.3 Learning in Chemoinformatics

1.4 Outline of the Book

1.5 The Scope of the Book

1.6 Teaching Chemoinformatics

References

Chapter 2: Principles of Molecular Representations

2.1 Introduction

2.2 Chemical Nomenclature

2.3 Chemical Notations

2.4 Mathematical Notations

2.5 Specific Types of Chemical Structures

2.6 Spatial Representation of Structures

2.7 Molecular Surfaces

Selected Reading

References

Exercises

Chapter 3: Computer Processing of Chemical Structure Information

3.1 Introduction

3.2 Standard File Formats for Chemical Structure Information

3.3 Input and Output of Chemical Structures

3.4 Processing Constitutional Information

3.5 Processing 3D Structure Information

3.6 Visualization of Molecular Models

3.7 Calculation of Molecular Surfaces

3.8 Chemoinformatic Toolkits and Workflow Environments

Selected Reading

References

Exercises

Chapter 4: Representation of Chemical Reactions

4.1 Introduction

4.2 Reaction Equation

4.3 Reaction Types

4.4 Reaction Center and Reaction Mechanisms

4.5 Chemical Reactivity

4.6 Learning from Reaction Information

4.7 Building of Reaction Databases

4.8 Reaction Center Perception

4.9 Reaction Classification

4.10 Stereochemistry of Reactions

4.11 Reaction Networks

Selected Reading

References

Exercises

Chapter 5: The Data

5.1 Introduction

5.2 Data Types

5.3 Storage and Manipulation of Data

5.4 Conclusions

Selected Reading

References

Exercises

Chapter 6: Databases and Data Sources in Chemistry

6.1 Introduction

6.2 Chemical Literature and Databases

6.3 Major Chemical Database Systems

6.4 Compound Databases

6.5 Databases with Properties of Compounds

6.6 Reaction Databases

6.7 Bibliographic and Citation Databases

6.8 Full‐Text Databases

6.9 Architecture of a Structure‐Searchable Database

Selected Reading

References

Exercises

Chapter 7: Searching Chemical Structures

7.1 Introduction

7.2 Full Structure Search

7.3 Substructure Search

7.4 Similarity Search

7.5 Three‐Dimensional Structure Search Methods

7.6 Sequence Searching in Protein and Nucleic Acid Databases

7.7 Summary

Selected Reading

References

Exercise

Chapter 8: Computational Chemistry

8.1 Empirical Approaches to the Calculation of Properties

8.2 Molecular Mechanics

8.3 Molecular Dynamics

8.4 Quantum Mechanics

Chapter 9: Modeling and Prediction of Properties (QSPR/QSAR)

Chapter 10: Calculation of Structure Descriptors

10.1 Introduction

10.2 Structure Descriptors for Classification and Similarity Searching

10.3 Structure Descriptors for Quantitative Modeling

10.4 Descriptors That Are Not Calculated from the Chemical Structure

10.5 Summary and Outlook

Selected Reading

References

Exercises

Chapter 11: Data Analysis and Data Handling (QSPR/QSAR)

11.1 Methods for Multivariate Data Analysis

11.2 Artificial Neural Networks (ANNs)

11.3 Deep and Shallow Neural Networks

Chapter 12: QSAR/QSPR Revisited

12.1 Best Practices of QSAR Modeling

12.2 The Data Science of QSAR Modeling

Selected Reading

References

Exercises

Chapter 13: Bioinformatics

13.1 Introduction

13.2 Sequence Databases

13.3 Searching Sequence Databases

13.4 Characterization of Protein Families

13.5 Homology Modeling

Selected Reading

References

Exercises

Chapter 14: Future Directions

14.1 Access to Chemical Information

14.2 Representation of Chemical Compounds

14.3 Representation of Chemical Reactions

14.4 Learning from Chemical Information

14.5 Training in Chemoinformatics

Answers Section

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 10

Chapter 11

Chapter 12

Chapter 13

Index

Substance Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Different types of molecular representations of phenylalanine (without stereochemistry).

Chapter 3

Table 3.1 Important file formats for the representation and exchange of chemical structure information and their respective possibilities for representing or coding the constitution, the configuration, that is, the stereochemistry, and the 3D structure or conformation.

Table 3.2 Comparison of the SMILES and the SLN syntax (see also Section 3.2.3).

Table 3.3 Format details of the ATOM record (cf. Figure 3.18).

Table 3.4 Sample STAR‐based dictionaries.

Table 3.5 Collection of selected molecule editors and molecule viewers (not complete).

Table 3.6 Freely available toolkits and workflow environments in chemo‐ and bioinformatics.

Chapter 4

Table 4.1 Substituent constants for various groups to be used in Eqs. 4.1 and 4.2.

σ

m

for substituents in meta‐position and

σ

p

for substituents in para‐position.

Chapter 5

Table 5.1 Selected artifacts in chemistry that require complex data types.

Table 5.2 Example of efficiency of an

ab initio

geometry optimization in different coordinate sets.

Table 5.3 Probability

(

P

)

of collision avoidance for 32‐ and 64‐bit hash codes.

Chapter 6

Table 6.1 CAS databases in SciFinder [1, 4].

Table 6.2 Searches for references to the analgesic paracetamol and to substituted naphthalenes in

SciFinder

and

Reaxys

.

Table 6.3 Hits for lidocaine

6.1

in

SciFinder

and

Reaxys

with different queries.

Table 6.4 Search results (hits) for query structures of Figure 6.3 in SciFinder and Reaxys.

Table 6.5 Special patent sequence databases available from STN at cost [6].

Table 6.6 Free Web portals for sequence and related databases.

Table 6.7 Searching for compounds containing the elements Li, O, P, F, and Fe.

Table 6.8 3D structure databases of inorganic, organic, and biomolecules.

Table 6.9 Selected properties of (+)‐estrone from diverse databases.

Table 6.10 Major databases with thermodynamic data available at cost.

Table 6.11 NIST databases with thermodynamic data.

Table 6.12 Major spectral databases and database systems available a cost.

Table 6.13 Databases with special spectroscopic data at no charge.

Table 6.14 Single‐step reactions in

Reaxys

and

SciFinder CASREACT

(May 2017).

Table 6.15 Selected bibliographic databases at cost [6].

Table 6.16 Selected patent databases at cost.

Chapter 7

Table 7.1 List of some popular similarity coefficients.

Table 7.2 Distance distributions for structures

A

and

B

(see Figure 7.16).

c08a

Table 8.1 Experimental mean molecular polarizabilities and values calculated by Eq. 8.2.

Table 8.2 Parameters for the dependence of orbital electronegativity on charge (Eq. 8.4).

Chapter 10

Table 10.1 Classification of molecular descriptors by data type.

Table 10.2 Classification of fragment descriptors and fingerprints.

Table 10.3 Classification of descriptors by the dimensionality of their molecular representation.

Table 10.4 Invariance properties of selection of molecular descriptors.

c11a

Table 11.1 Basic results obtained from the application of a binary classifier to a test set with

n

objects,

n

1

actually belonging to class 1 and

n

2

to class 2;

n

1

 + 

n

2

 = 

n

.

n

CA

is the number of objects belonging to class

C

(1 or 2) and being assigned to class

A

(1 or 2).

c11c

Table 11.2 Comparison of prediction errors of the kaggle test sets using DNN [3] and BNN [22].

List of Illustrations

Chapter 1

Table Figure 1.1 Inductive learning.

Chapter 2

Figure 2.1 Hierarchy levels of molecular representation providing different information contents.

Figure 2.2 Classification of different types of molecular representations with examples given in the bottom line of the boxes.

Figure 2.3 The IUPAC name of caffeine: 1,3,7‐trimethylpurine‐2,6‐dione.

Figure 2.4 Scientific notation for the information on chemical elements, for example, oxygen (O).

Figure 2.5 Identical empirical formula of two compounds: (a) phenylalanine and (b) 2‐(3‐furanyl)‐1‐pyrrolidinecarboxaldehyde and their subdivided formula representations.

Figure 2.6 Different graph theoretical representations (b) to (d) of the structure diagram of 5‐ethyl‐2‐methylheptane (a). In graph theory, only the connections are important, not the length of the edges or the angles between them. (Usually, only heavy atom types are represented, not hydrogen atoms.)

Figure 2.7 (a) The simple graph

G

1

(

V

1

, 

E

1

)

has four vertices

V

1

 = {

v

1

, 

v

2

, 

v

3

, 

v

4

}

and three edges

E

1

 = {

e

1

, 

e

2

, 

e

3

}

. The latter are all incident in vertex

v

2

. Thus, the degree of vertex

v

2

is three, that is,

d

(

v

2

) = 3

; (b) The multigraph

G

2

with

E

1

 = {

e

1

, …, 

e

5

}

on the right‐hand side has an additional edge between

v

2

and

v

3

.

Figure 2.8 Vertex coloring (a) and edge coloring (b) of a graph.

Figure 2.9 The example (a) of a simple undirected graph

G

1

consists of the set of vertices

V

1

 = {

v

1

, 

v

2

, 

v

3

, 

v

4

}

and the sets of edges

E

1

 = {{

v

1

, 

v

2

}, {

v

2

, 

v

3

}, {

v

2

, 

v

4

}}

. The example (b) of a simple directed graph

G

3

consists of the set of vertices

V

3

 = {

v

1

, 

v

2

, 

v

3

, 

v

4

}

and the sets of edges

E

3

 = {(

v

1

, 

v

2

), (

v

2

, 

v

4

), (

v

3

, 

v

2

)}

.

Figure 2.10 Graph examples of a walk (a) and a path (b) starting in

v

1

.

Figure 2.11 Examples of a subgraph and a supergraph. Graph

G

2

with the vertex set

V

2

 = {1,  2,  3,  4,  5,  6}

and graph

G

3

with the vertex set

V

3

 = {1,  2,  6,  7,  8}

are subgraphs of graph

G

1

with the vertex set

V

2

 = {1,  2,  3,  4,  5,  6,  7,  8}

. Accordingly, graph

G

1

is a supergraph of the graphs

G

2

und

G

3

.

Figure 2.12 An adjacency matrix of ethanal (a) is simplified stepwise by setting all matrix elements of the lower‐left corner to zero, retaining the upper‐right corner of the matrix and (b), finally, omitting the matrix elements for hydrogen atoms.

Figure 2.13 An incidence matrix

A

I

of ethanal. In the non‐square matrix, the atoms are listed in columns and the bonds in rows. Only an undirected graph is presented here.

Figure 2.14 Distance matrices of ethanal with (a) geometric distances in angstrom and (b) topological distances. The matrix elements of (b) result from counting the number of bonds along the shortest path between the chosen atoms (Figure 2.12).

Figure 2.15 A bond matrix

A

B

of ethanol (a) and the redundant matrix (b).

Figure 2.16 A BE matrix of ethanal (a) and the redundant matrix (b).

Figure 2.17 The BE matrix of ethanal enables the determination of the number of valence electrons on each atom (the sum of each row/column).

Figure 2.18 The cross sum of the BE matrix of Figure 2.17 validates the octet rule.

Figure 2.19 The structure diagram of ethanal represented as a connection table consisting of two lists, a list of atoms and a list of bonds. The atom list contains atom indices that have been arbitrarily assigned to each atom and are used as atom references in the bond list.

Figure 2.20 Connection table of ethanal.

Figure 2.21 More compact (nonredundant) connection table of ethanal. Only non‐hydrogen atoms are considered; bonds with the lower indices are given only once (compare with Figure 2.20).

Figure 2.22 Workflow for the classification of isomeric structures of organic molecules.

Figure 2.23 Scheme of tautomerism. X, Y, and Z can be a carbon, nitrogen, phosphor, oxygen, or sulfur atom. G represents an electrofuge or nucleofuge group, in most cases a hydrogen atom.

Figure 2.24 Keto–enol (a), imine–enamine (b), and lactam–lactim (c) tautomerism.

Figure 2.25 Thioacetic acid shows tautomerism (from left to right at the top), mesomerism (from left to right at the bottom), and ionization (top to bottom on the left‐ and right‐hand sides).

Figure 2.26 The substituted phenyl derivative is a typical Markush structure (a). Here, a number of compounds are described in a one‐structure diagram by different types of variability:

R

1

substituent variation,

R

2

homology variation,

R

3

position variation, and

n

frequency variation. Phenylalanine (b) is one of these structures when

R

1

is COOH,

R

2

and

R

3

are H. Another example that is covered by the Markush notation is (c) 3,4‐dichloro‐

N

‐ethyl‐

α

‐phenyl‐benzenebutanamine.

Figure 2.27 Single bonds (a) are stored as

σ

‐systems and double bonds (b) as

π

‐systems.

Figure 2.28 Lewis representations of (a) benzene and 1,3‐butadiene (b), as examples of resonance of aromaticity and conjugated double bonds.

Figure 2.29 (a) Two zwitterionic resonance structures are required for a valid VB representation of the nitro group. (b) Improper representation of a nitro group with a pentavalent nitrogen atom. (c) The hybrid diagram needs no charged resonance structures. The

π

‐system contains four electrons on three atoms.

Figure 2.30 Lone pairs, radicals, and orbitals without electrons are represented by a

π

‐system with two, one, or zero electrons on the corresponding atom, respectively.

Figure 2.31 (a) Singlet carbine and (b) triplet carbene.

Figure 2.32 Enol ethers (a) have two different ionization potentials (IP) (b), depending on the orbitals concerned.

Figure 2.33 Example of an electron‐deficient bond in diborane with a B-H-B bond.

Figure 2.34 The multi‐haptic bonding in organometallic complexes (e.g., ferrocene) cannot be expressed adequately by a connection table.

Figure 2.35 From the constitution to the configuration and then to the conformation (3D structure) of a molecule with the example of phenylalanine (from left to right).

Figure 2.36 Workflow for the classification stereoisomerism.

Figure 2.37 Butenedioic acid with (a)

E

configuration = fumaric acid, and (b)

Z

configuration = maleic acid.

Figure 2.38 Different possibilities for displaying stereochemistry: (a) the wedge/hashed projection is the usual way for depicting stereochemistry, (b) this representations is ambiguous and should therefore not be used, and (c) the Fischer projection is particularly used for carbohydrates.

Figure 2.39 The four ligands attached to a chiral center are ranked according to the CIP rules

(OH > COOH > CH

3

 > H)

; then the enantiomers of 2‐hydroxy‐propionic acid (lactic acid) can be assigned with the

R

(right‐hand) or

S

(left‐hand) configuration.

Figure 2.40 Examples of chiral molecules with different types of stereogenic units. (a) Lactic acid, (b) tetra‐substituted adamantane, (c) binaphthyl compound, and (d) substituted paracyclophane.

Figure 2.41 Cartesian coordinate system and Cartesian coordinates of chloromethane.

Figure 2.42 Internal coordinates of 1,2‐dichloroethane: bond lengths

r

1

and

r

2

, bond angle

α

, and torsion angle

τ

.

Figure 2.43 Graphical representation of the electrostatic potential on molecular surfaces of phenylalanine: (a) dots, (b) mesh or chicken wire, (c) solid, and (d) semitransparent.

Chapter 3

Figure 3.1 The standard InChI of phenylalanine.

Figure 3.2 The standard InChI and InChIKey of phenylalanine including stereochemistry.

Figure 3.3 XYZ file representing ethanal.

Figure 3.4 Z‐matrix of 1,2‐dichloroethane.

Figure 3.5 All MDL file formats based on the Molfile format.

Figure 3.6 Structure of (2

R

,3

E

)‐4‐chlorobut‐3‐en‐2‐ol.

Figure 3.7 Molfile representing (2

R

,3

E

)‐4‐chlorobut‐3‐en‐2‐ol shown in Figure 3.6 in the V2000 and V3000 format. The line numbers listed in the gray columns and the descriptions are not part of a Molfile. The fundamental structure of a Molfile is indicated in the central column.

Figure 3.8 Detailed description of the counts line of both Molfile versions (V2000 and V3000) from the molecule shown in Figure 3.6.

Figure 3.9 Detailed description of the atom block of both Molfile versions (V2000 and V3000) from the molecule shown in Figure 3.6.

Figure 3.10 Detailed description of the bond block of both Molfile versions (V2000 and V3000) from the molecule shown in Figure 3.6.

Figure 3.11 Structure of sulfamide (sulfuric diamide).

Figure 3.12 Sample SDfile for sulfamide (sulfuric diamide) including both structural information and associated data, for example, physicochemical properties.

Figure 3.13 Visualization of the 3D molecular structure of α‐conotoxin PNI1 polypeptide (PDB ID: 1pen) including secondary information (as cartoons) and water molecules.

Figure 3.14 Title section of the analyzed PDB file. (The gray rows specify the position of the 80 possible characters and the gray column indicates the line number of the file.)

Figure 3.15 Remark section of the analyzed PDB file.

Figure 3.16 Primary structure and heterogen sections of the analyzed PDB file.

Figure 3.17 Crystallographic and coordinate transformation sections of the analyzed PDB file.

Figure 3.18 Atomic coordinate data section of the analyzed PDB file.

Figure 3.19 Last lines of the analyzed PDB file.

Figure 3.20 CML file representing (2

R

,3

E

)‐4‐chlorobut‐3‐en‐2‐ol shown in Figure 3.6. The line numbers are listed in the grey column and are not part of a CML file.

Figure 3.21 The JSME as an example of a molecular editor.

Figure 3.22 The Jmol stand‐alone application as an example of a molecular viewer.

Figure 3.23 Different approaches to storing and retrieving tautomeric forms of molecules – here shown with acetylacetone. In model (a) only one tautomeric form represents the molecule, whereas in approach (b) the set of all tautomeric forms is used. Representation (c) indicates that only one generalized tautomeric form describes all the tautomers of (b).

Figure 3.24 Standard (a) and nonstandard InChIs (b) and the InChIKeys of acetylacetone to describe mobile hydrogens in tautomeric forms. A specific tautomeric structure (c) is described by the fixed hydrogen atom layer, where one hydrogen is fixed at atoms 3 and 6. (/b4‐3 describes in addition a double bond in the stereolayer.)

Figure 3.25 Six different possibilities for labeling the atoms in hypochlorous acid.

Figure 3.26 During the iteration, the EC value of each atom is calculated by summing the EC values of the directly connected non‐hydrogen atoms of the former sphere (relaxation process).

Figure 3.27 The EC values of the atoms of phenylalanine (simplified to nodes for all non‐hydrogen atoms, ignoring stereochemistry and bond types) are calculated by considering the EC values of the neighboring atoms. After each relaxation process (a–d),

k

, the number of equivalent classes (different EC values), is determined.

Figure 3.28 Canonicalization starts at the atom with the highest EC value (in the example of Figure 3.27 the atom with the value of 16), which obtains the number 1.

Figure 3.29 The molecular graph consists of three cycles (c1, c2, c3), which can be represented by the incidence vectors of their edges. The set of three cycles is dependent, since the symmetric difference (

=

exclusive or) of two of the incidence vectors results in the third one. For example, the symmetric difference of c2 and c3 results in c1. Each combination of two of these cycles forms a cycle basis. Only the combination of c1 and c2 forms a minimum cycle basis.

Figure 3.30 Two simple cases for the reduction of a graph by removing a vertex

x

according to Hanser's algorithm.

Figure 3.31 Stepwise reduction of the number of vertices according to Hanser's algorithm. The molecular graph contains 3 rings.

Figure 3.32 Cubane, (a) spatial and (b) planar representation, contains six rings (c) of size four. Every combination of five of these 4‐cycle forms a valid SSSR. The SMARTS pattern [R3] matches four of the eight carbon atoms.

Figure 3.33 Comparison of interchangeability classes (ICs), relevant cycles (RCs), and the smallest set of smallest rings (SSSR). (a) The molecular graph contains atoms A, B, and C. (b) The molecular graph consists of six rings. (c) The membership of atoms A, B, and C to an SSSR and the set of RCs and ICs are shown in table C. All of the six rings belong to the set of RCs. Each combination of two small rings and one of the large rings forms a valid SSSR. Ring 1 and ring 2 each form a separate IC. The third IC is formed by the union of the rings 3, 4, 5, and 6.

Figure 3.34 (a) The cyclophane‐like molecular graph contains three URFs that are identical to the ICs. (b) Cubane contains six URFs in contrast to only one IC. (c) The number of URFs and ICs are identical for atoms A and B. Atom C is a member of three URFs but only a member of one IC.

Figure 3.35 Stereocenters can be identified by permutation groups. Thus, the structure is separated into a skeleton and its ligands. Both are numbered independently, by the indices of the skeleton (1–4) and those of ligands (A–D) and are described in a permutation matrix.

Figure 3.36 The ordered list of 24 priority sequences of the ligands A–D around a tetrahedral stereocenter. According to the number of permutations, the list can be separated into two classes (odd number of permutations on the right‐hand side or even number of permutations on the left‐hand side).

Figure 3.37 The permutation matrices of two structures that differ through rotation by

120°

. The permutation matrix of the rotated isomer can be brought into correspondence with the permutation matrix of the reference isomer by two interchanges of two ligands (transpositions).

Figure 3.38 Automatic 3D structure generation.

Figure 3.39 Classification of automatic 3D structure generators.

Figure 3.40 General workflow to generate a 3D model by CORINA.

Figure 3.41 An unsymmetrical superphane and its superstructure.

Figure 3.42 The principle of longest pathways for acyclic fragments and molecules.

Figure 3.43 Elimination of nonbonded interactions (close contacts).

Figure 3.44 Superimposition of a set of conformations of 2

R

‐benzylsuccinate with the benzene ring fixed on the right (rotatable bonds are highlighted on the left).

Figure 3.45 Dependence of the potential energy curve of n‐butane on the torsion angle

τ

between carbon atoms C2 and C3.

Figure 3.46 General workflow scheme as utilized by the program ROTATE for the generation of multiple conformations.

Figure 3.47 Derivation of the

RMS

XYZ

deviation of two conformations.

Figure 3.48 Ring templates for a saturated six‐membered ring and a six‐membered ring with one double bond as implemented in the ring conformation table of the 3D structure generator CORINA and ROTATE.

Figure 3.49 Derivation of the Torsion Angle Library.

Figure 3.50 Derivation of a symbolic potential energy function from the torsion angle distribution of a torsion fragment.

Figure 3.51 The first dynamic molecular display of small molecules by Levinthal was driven by the “crystal ball.”

Figure 3.52 The most common molecular graphic representations of phenylalanine such as (a) wire frame, (b) capped sticks, (c) ball and stick,(d) space filling, and (e) inorganic molecules

(YBa

2

Cu

3

O

7 − 

x

)

with balls and sticks (left) and the same molecule polyhedral (right).

Figure 3.53 The most common molecular graphic representations of biological molecules (lysozyme): (a) balls and sticks; (b) backbone or

‐stick; and (c) cartoon (including the cylinder, ribbon, and tube model).

Figure 3.54 Different molecular surfaces of lysozyme (5LA5.pdb): (a) van der Waals Surface, (b) Connolly surface, and (c) solvent‐accessible surface.

Figure 3.55 Dependence of the van der Waals energy on the distance between two non‐connected atom nuclei. With decreasing atomic distance, the energy between the two atoms becomes attraction, going through a minimum at the van der Waals distance. Then, upon a further decrease in the distance, a rapid increase in repulsion energy is observed.

Figure 3.56 Cross section of the 3D model of formic acid (HCOOH). The van der Waals radius of each atom of the molecule is taken and by fusing the spheres the van der Waals surface is obtained.

Figure 3.57 The Connolly surface is determined by moving a probe sphere (usually a water molecule) over the van der Waals surface.

Figure 3.58 The center of the rolling probe sphere defines the solvent‐accessible surface during movement of the probe over the van der Waals surface. Thus, the molecular surface is expanded by the radius of the solvent molecule.

Figure 3.59 Different isovalue‐based surfaces of phenylalanine: (a) isoelectronic density, (b) molecular orbitals (HOMO–LUMO), (c) isopotential surface, and (d) isosurface of electron cryo‐microscopic volume of the ribosome of

Escherichia coli

.

Chapter 4

Figure 4.1 A typical equation of a chemical reaction.

Figure 4.2 Two reaction equations showing two completely different uses for the (+) symbol: (a) giving a fully balanced single reaction and (b) combining two parallel reactions into a single equation that is also not stoichiometrically balanced.

Figure 4.3 The scheme of a biochemical reaction indicating not only reactant (substrate) and product but also enzyme, coenzyme, and regulator, as well as showing in which species the reaction occurs.

Figure 4.4 Representative, simple examples of a substitution, an addition, and an elimination reaction showing the number,

n

, of reaction partners, and its change, Δ

n

, during the reaction.

Figure 4.5 The reaction site of an elimination reaction with the bonds to be broken crossed and the bonds made in heavy lines (a) and the three mechanisms to achieve this reaction (b–d).

Figure 4.6 Illustrations of the charge distribution (a), the inductive effect (b), and the resonance effect (c), the polarizability effect (d), the steric effect (e), and the stereoelectronic effect (f).

Figure 4.7 FMO treatment of a Diels–Alder reaction: (a) reaction equation, (b) correlation diagram, and (c) orbital coefficients.

Figure 4.8 The dissociation of substituted benzoic acids (x = substituent) (a) and the hydrolysis of benzoic acid methyl esters (b).

Figure 4.9 Input of a chemical reaction by JSME also showing the atom‐to‐atom mapping.

Figure 4.10 Input of a metabolic reaction by the METIS editor.

Figure 4.11 The present pathway of reaction information from the producer to a consumer using a reaction database indicating the many steps in newly conceiving the information.

Figure 4.12 The pathway of reaction information by the use of an electronic laboratory notebook (ELN) into a reaction database indicating the seamless flow of information.

Figure 4.13 The search for oxidations of primary alcohols to carboxylic acids (a) will obtain reaction (b) as a hit, although this reaction is in reality a hydrolysis of an ester. (c) shows the correct specification of the query to obtain reactions involving the oxidation of alcohols to carboxylic acids.

Figure 4.14 The reaction of formaldehyde with hydrocyanic acid to give a cyanohydrine, and the matrix representation of this reaction.

Figure 4.15 The reaction scheme comprising the breaking and the making of two bonds and some examples of reactions following this scheme.

Figure 4.16 Different levels of specification for a bond participating in a reaction.

Figure 4.17 The reaction scheme breaking three and making three bonds and some of the reaction types that fall into this scheme.

Figure 4.18 A reaction scheme that changes the number of bonds at one atom and some specific examples.

Figure 4.19 Consecutive application of two reaction schemes to model the oxidation of thioethers to sulfoxides.

Figure 4.20 Classification of reactions by the CLASSIFY approach.

Figure 4.21 Different classification of substituents at the reaction site.

Figure 4.22 Reaction center of the dataset of 120 reactions (reacting bonds are indicated by dotted lines), and some reaction instances of this dataset.

Figure 4.23 Distribution of the dataset of 120 reactions in the Kohonen network. (a) The neurons were patterned on the basis of intellectually assigned reaction types. (b) In addition, empty neurons were patterned on the basis of their k‐nearest neighbors.

Figure 4.24 Some reactions that proceed under control of stereochemistry.

Figure 4.25 The treatment of the stereochemistry of an

S

N

2 reaction by permutation group theory.

Figure 4.26 Part of the reaction network of the pentose phosphate pathway.

Chapter 5

Figure 5.1 Folding of a structure key.

Figure 5.2 Different usage of the

(+)

‐symbol in reaction representation.

Figure 5.3 Data model for reaction representation.

Figure 5.4 Raman spectra of acetaminophen [40].

Figure 5.5 Resolved fragment of a thin‐film spectrum of liquid

N

‐methylpyrrole using Lorentz functions.

Figure 5.6 Research pathways.

Figure 5.7 A simple file of the “dat” format.

Figure 5.8 An example of the JCAMP‐DX file format.

Figure 5.9 A simple PMML file.

Figure 5.10 Overview of the wavelet transform and multi‐resolution analysis scheme.

Figure 5.11 An initial dataset has to be split into two or three parts.

Chapter 6

Figure 6.1 Lidocaine

6.1,

an important local anesthetic drug and related structures (see Table 6.3).

Figure 6.2 Markush‐type query.

Figure 6.3 Keto–enol tautomerism of acetylacetone.

Figure 6.4 Representation of π‐complexes in chemistry (right formulas) and in databases (left formulas).

Figure 6.5 (+)‐Estrone.

Figure 6.6 Chemistry‐aware three‐tiered architecture.

Chapter 7

Figure 7.1 Different types of representation of the chemical graph of 4‐methylcyclohexene (note that the numbering does not conform to the IUPAC rules): (a) labeled graph (2D – arbitrary atom labeling); (b) connection table (CT); (c) linear notations; (d)topological indices; and (e) registry numbers.

Figure 7.2 Basic principle for efficient full structure search (unique and semi‐unique techniques).

Figure 7.3 Hashing function and fast address lookup.

Figure 7.4 Mappings between the query graph (

G

Q

) and the target graph (

G

T

). The notation (2, 1, 3, 4) means that atom 1 of the query subgraph

G

Q

is mapped to atom 2 from the target graph

G

T

, query atom 2 to target atom 1, query atom 3 to target atom 3, and query atom 4 to target atom 4.

Figure 7.5 Search tree of mappings obtained by applying the backtracking algorithm to the pair of structures

G

Q

and

G

T

(see the graphs in Figure 7.4). Array (

M

1

,

M

2

,

M

3

,

M

4

) denotes the mapping 1 (of

G

T

) → 

M

1

(of

G

Q

), 2 → 

M

2

, 3 → 

M

3

, 4 → 

M

4

.

Figure 7.6 Backtracking approach realized as depth‐first search algorithm. Dotted arrows trace the route used for traversing all mappings in the search tree. Each node in the tree corresponds to a mapping between

G

Q

and

G

T

.

Figure 7.7 Screening process as part of DB substructure search: (1) calculate query keys, (2) filtrate DB by comparing the query bits with the pre‐calculated bits, and (3) use the screening result list for the final mapping.

Figure 7.8 Generation of hashed fingerprints.

Figure 7.9 Classes (groups) of topologically equivalent atoms. The bold numbers are the atom labels; the numbers in parentheses are the equivalence classes.

Figure 7.10 Maximum common substructure (MCS) of two chemical structures.

Figure 7.11 SMARTS queries examples.

Figure 7.12 The similarity search process.

Figure 7.13 The similarity concept applied in different chemical spaces.

Figure 7.14 Descriptor examples (0D, 1D, 2D, 3D) for the definition of similarity measures.

Figure 7.15 Flowchart of a pharmacophore mapping algorithm. Antihistamine pharmacophore [55] is searched against target molecule. The pharmacophore is defined by 3 features: two aromatic rings (A) and amino group (N).

Figure 7.16 Structures

A

and

B

with corresponding distance matrices (distances are given in nanometers).

Figure 7.17 Examples for DNA score matrices for the calculation of the “edit distance” similarity metric: (a) the simplest scoring matrix that corresponds to the Hamming distance and (b) the customized scoring matrix that distinguishes different types of DNA changes.

Figure 7.18 The stages of the DP algorithm: (a) forward filling of the scores

S(i,j)

and (b) backtracking to obtain the optimal alignment(s).

Figure 7.19 Flowchart of a heuristics‐based algorithm for fast sequence alignment.

Chapter 9

Figure 9.1 An indirect approach to the prediction of properties.

Chapter 10

Figure 10.1 Fragmentation of a molecule; the fragment code counts the number of occurrences of a fragment.

Figure 10.2 The derivation of fingerprints.

Figure 10.3 Screen of ChemoTyper: on the left is part of the Chemotype library; on the right is part of a dataset of 8193 organic molecules with the fitting chemotypes highlighted.

Figure 10.4 Screen of ChemoTyper: on the left are three chemotypes of the thalidomide skeleton with chemotypes differentiated by sigma charge (green) and total charge (blue); on the right‐hand side is part of a dataset that indicates hits for the two different chemotypes.

Figure 10.5 MNA descriptors for the atoms of phenol of levels 0, 1, and 2.

Figure 10.6 Self‐organizing map of bonds of a dataset of organic molecules. Black neurons (center) reactive bonds; gray neurons, non‐classified bonds; light gray, non‐reactive bonds; cross, contain both reactive and non‐reactive bonds; white, empty neurons (no bonds mapped).

Figure 10.7 Hydrogen‐depleted graph of 2,2‐dimethylbutane (

2

).

Figure 10.8 Calculation of the ECFP descriptors starting from an atom.

Figure 10.9 Scheme of an electron diffraction experiment.

Figure 10.10 Procedure for encoding a structure with an RDF code.

Figure 10.11 Radial distribution function of an organic molecule indicating how the longer C-S bond expresses itself in the peak for 1–2 and the peak for 1–3 interactions.

Figure 10.12 Comparison of the RDF code for aromatic compounds with different substitution patterns.

Figure 10.13 Comparison of the radial distribution function of the chair, boat, and twist conformations of cyclohexane (hydrogen atoms are not considered).

Figure 10.14 (a) Example of a chiral molecule. (b) The atoms A, B, C, and D are those directly bonded to the chiral center. The neighborhood of atom A is the set of atoms whose distance (in number of bonds) to A is less than their distance to B, C, and D. (c) Example of a combination of four atoms (

i

,

j

,

k

, and

l

), each at a different ligand (A, B, C, or D) of the chiral center.

Figure 10.15 Graphical representation of

f

CICC

(

u

) versus

u

for (+) (R) and (−) (

S

)‐2‐Pyrrolidinemethanol sampled at 75 evenly distributed points between −0.03 e

2

Å

−1

and +0.03 e

2

Å

−1

. Hydrogen atoms not bonded to chiral carbon atoms were not considered.

Figure 10.16 Mapping the coordinates of points on a molecular surface into a self‐organizing neural (Kohonen) network.

Figure 10.17 The surface of a torus, a plane without beginning and end, is stretched out into two dimensions by making two perpendicular cuts.

Figure 10.18 Self‐organizing map of 2‐chloro‐4‐hydroxy‐2‐methylbutane colored by the molecular electrostatic potential.

Figure 10.19 A self‐organizing map (SOM) is trained with points on the molecular surface of

n

‐butane (for visualization the surface of the four carbon atoms and their hydrogen atoms are colored differently). This SOM is now taken as a template, and points from the surface of 1‐propanol are sent through this SOM, indicating the two hydrogen atoms missing in 1‐propanol against

n

‐butane as empty (white) neurons.

Figure 10.20 An SOM trained with n‐butane is taken as a template for various 1‐substituted propane derivatives indicating the differences in the molecular surface by empty (white) neurons.

Figure 10.21 The box containing a steroid indicating the points for calculating the steric or electrostatic field in the CoMFA approach.

Figure 10.22 Extension of the QSAR method by descriptors not based on structure.

Figure 10.23 Structure descriptors contain information from physics (geometric resolution), chemistry (properties of atoms), and mathematics (mathematical transformation).

Figure 10.24 The comparison of a molecule with a human.

Chapter 12

Figure 12.1 QSAR modeling workflow. This workflow is implemented within the Chembench portal (https://chembench.mml.unc.edu/; see also Ref. [18]).

Figure 12.2 Data cycle in cheminformatics.

Figure 12.3 General workflow for comprehensive curation of chemogenomics datasets.

Figure 12.4 Example of duplicate retrieval using PubChem ID, SMILES, chemical names, InChI keys (InChI strings are not shown but are also different), and 2D similarity. Note that the computing 2D similarity as Tanimoto coefficient using CDK descriptors yields

T

c

 = 1

(implicating structural duplicates) for the two curated compounds (no salts, standardized functional groups and aromatization), whereas all other representations fail to suggest that the two compounds are identical.

Chapter 13

Figure 13.1 Growth of GenBank from 1982 to 2016.

Figure 13.2 Sequencing cost per genome (3 billion base pairs) from 2001 to 2015. Please note the logarithmic scaling of the

y

‐axis.

Figure 13.3 Excerpt from GenBank entry AF123456 of a messenger RNA coding for a transcription factor in chicken (

Gallus gallus

). Dots […] denote deleted lines.

Figure 13.4 Excerpt from a pairwise sequence alignment of Abl (ABL1_HUMAN) and Src (SRC_HUMAN) tyrosine kinases illustrating the meaning of identity, similarity, and gaps.

Figure 13.5 Degree of sequence identity as a function of the point accepted mutations (PAM) during protein evolution.

Figure 13.6 BLOSUM62 scoring matrix. Amino acid exchanges that result in positive scores are highlighted in gray. The left column gives the three‐letter amino acid code in addition to the one‐letter code used for the labeling of the scoring matrix.

Figure 13.7 Example of a hit obtained from an NCBI BLAST sequence search against the UniProtKB/Swiss‐Prot database using the sequence of Src (SRC_HUMAN) as a query.

Figure 13.8 Steps of the progressive multiple sequence alignment.

Figure 13.9 (a) Section from a multiple sequence alignment of eight Src‐family tyrosine kinases (Src, Lck, Hck, Lyn, BLK, Fyn, Yes, Fgr) and the Abl1‐kinase. The sequence positions refer to SRC_HUMAN. A “*” marks a strictly conserved sequence position; “:” and “.” denote decreasing degrees of sequence similarity. The three strictly conserved residues, G277, G279, and K298, which are part of the ATP‐binding site, are marked in bold/italic. (b) Location of residues G277, G279, and K298 in the three‐dimensional structure of Src kinase (PDB entry: 4MXO [25]). The conserved residues are shown as a space‐filled presentation, and an inhibitor is shown as a sticks model. Figure prepared with the UCSF Chimera package [26].

Figure 13.10 Excerpt from the PROSITE entry PS00107 describing the protein kinase ATP‐binding region signature.

Figure 13.11 Overlay of an Src kinase homology model (black) with the Src kinase crystal structure (PDB: 4MXO [25]). The model was generated based on the crystal structure of Abl kinase (PDB: 4WA9 [38]) sharing 47% sequence identity to Src for the modeled region. Figure prepared with the UCSF Chimera package [26].

Guide

Cover

Table of Contents

Begin Reading

Pages

C1

vi

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

181

182

183

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

248

249

250

251

252

253

254

255

256

257

258

259

260

261

265

266

267

268

267

268

269

270

271

272

273

274

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

291

345

346

347

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

396

397

398

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

397

398

399

400

401

402

403

404

405

406

407

408

409

410

397

398

399

400

401

402

403

404

405

407

408

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

495

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

522

523

525

526

527

528

529

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

555

556

Chemoinformatics

Basic Concepts and Methods

Edited by Thomas Engel and Johann Gasteiger

Copyright

Editors

 

Dr. Thomas Engel

LMU München

Department of Chemistry

Butenandtstraße 5-13

81377 München

Germany

 

Prof. Dr. Johann Gasteiger

Universität Erlangen-Nürnberg

Computer-Chemie-Centrum

Nägelsbachstr. 25

91052 Erlangen

Germany

 

Cover Design

Dr. Christian R. Wick

University of Erlangen-Nürnberg

Institute for Theoretical Physics I

Nägelsbachstr. 49b (EAM)

91052 Erlangen

Germany

 

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.

Library of Congress Card No.: applied for

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.

© 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany

All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.

 

Print ISBN: 978-3-527-33109-3

ePDF ISBN: 978-3-527-69377-1

ePub ISBN: 978-3-527-69378-8

Mobi ISBN: 978-3-527-69379-5

oBook ISBN: 978-3-527-81366-7

Cover Design Grafik-Design Schulz, Fußgönheim, Germany

Dedication

Thomas Engel

To my family especially Benedikt and to Guido Kirsten, Markus Sitzmann, and Achim Zielesny for the very valuable feedback.

Johann Gasteiger

To all the friends, colleagues and coworkers that ventured with me into the exciting field of chemoinformatics.

And to my wife Uli for never complaining about my long working hours.

If you want to build a ship, don't drum up people to collect wood and don't assign them tasks and work, but rather teach them to long for the endless immensity of the sea.

Antoine de Saint-Exupéry

Foreword

Chemistry began with magic. Who but a wizard could, with a puff of smoke, turn one thing into another? The alchemists believed that the ability to transform materials was a valuable skill, so valuable in fact that they devised complex descriptions and alchemical symbols, known only to them, to represent their secret methods. Information was encoded and hidden, suffused with allegorical and religious symbolism, slowing progress. Medicinal chemists today may be particularly interested in a legendary stone called a Bezoar, found in the bodies of animals (if you knew which animal to dissect), that had universal curative properties. I'm still looking. However, to plagiarize a recent Nobel Prize winner for literature, the times they were a changin'. Departing from the secretive “alchemist” approach, Berzelius (1779–1848) suggested compounds should be named from the elements that made them up, and Archibald Scott Couper (1831–1892) devised the “connections” between “atoms,” which gave rise to structural diagrams (1858). In 1887, the symbols created by Jean Henri Hassenfratz and Pierre Auguste Adet to complement the Methode de Nomenclature Chimique were a revolutionary approach to chemical information. A jumbled, confused, and incorrect nomenclature was replaced by our modern‐day designations such as oxygen, hydrogen, and sodium chloride. The new chemistry of Lavoisier was becoming systematized. The “Age of Enlightenment” created a new philosophy of science where information, validated by experiment, could be tested by an expanding community of “scientists” (a term coined by William Whewell in 1833), placing data at the core of chemistry.

With the accumulation of knowledge, and a language to communicate chemistry, the stage was set for the creation of a new science of information in the domain of chemistry. Up stepped Friedrich Beilstein (1838–1906), who systematically collected chemical data on substances, reactions, and properties of chemical compounds in the Handbuch der organischen Chemie (Handbook of Organic Chemistry, published in 1881). The naming of compounds was a key feature that enabled the storage and retrieval of chemical information on a “grand” scale (1500 compounds). The indexing of chemical information meant chemistry could be reliably stored, common links between data established, and – most importantly – the information could be retrieved without loss. This drive for efficient indexing was the dominant feature of chemical information research for the next half century. As chemistry (and its many related disciplines) continued on an ever upward trajectory of innovation (and data collection), the paper trail required to go from perfectly reasonable questions like “how do I synthesize this compound?” to “has this compound been made before?” became rather complex and time‐consuming. I remember many happy hours spent in the library of the Wellcome Foundation trawling through the multitude of bookshelves of Chemical Abstracts to find one compound and, if lucky, a synthesis simple enough that I could perform with a yield better than my usual ten percent. Of course things got worse (or better if you were a librarian), and I recall an interesting RSC symposium in 1994 called “The Chemical Information Explosion: Chaos, Chemists and Computers.” We had clearly reached a point where someone had to invent chemoinformatics.

Although the “someone” is of course a worldwide community of scientists interested in chemical data, the term was coined by Frank Brown in 1998, and he defined it as

“The mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and optimization.”

The combination of multidisciplinarity, the reduction of data to knowledge, and the driving force of the pharmaceutical industry have been key features of the advance of chemoinformatics. The enabling technologies have been the availability of unprecedented amounts of chemical data (increasingly pubicly available) and the continuous development of new algorithms, designed specifically for chemistry, to achieve the goal of turning information into knowledge. Of course Moore's law (an observation by Gordon Moore at Intel), wherein the density of computer components (and the computation power offered) doubles every 2 years, has underpinned the hardware necessary to keep pace with the data explosion. But perhaps some of the chaos remains, hence the popularity of software such as Babel (which converts many data formats to many data formats!). Some numbers here are interesting. If we recall that the first edition of Beilsteins Handbuch contained 1500 compounds, the Chemical Abstracts Service of the American Chemical Society reported in 2015 that they had registered their 100 millionth chemical substance. What is truly transformational (if you think about it) is that a new student, with a basic knowledge of chemistry, when asked to search for a single compound from the 100M registry, gets the correct result in a microsecond. Not only that, a host of measured and predicted chemical properties, synthesis strategies, available reagents, structurally similar compounds and Internet links to a multitude of other diverse, information‐rich databases is available.

Clearly, chemoinformatics has come of age. In fact the term “chemoinformatics” has gained a certain elastic quality. The methodologies and data analysis tools developed for chemical information have evolved and extended to the data analytics of essentially any data that includes chemistry. Examples include the simulations of large systems of molecules such as proteins, machine learning (and the recent resurgence in artificial intelligence) to create predictive models, for example, metabolism, ADME properties, and quantitative structure–activity relationships of drugs (including quantum chemistry, bioinformatics, and analytical chemistry) and the detection and analysis of drug binding sites. Although much of the early work in chemoinformatics has been applied to problems of the pharmaceutical industry, the subject has been embraced across the sciences wherever chemistry is required, for example, in agricultural and food research, cosmetics, and materials science.

But in an age when computers can do “magic,” (“Any sufficiently advanced technology is indistinguishable from magic” – Arthur C. Clark), it is tempting to return to where we were in the time of Berzelius and hide the technology behind an alchemical mask of symbols, for example, a simple interface that hides highly complex search and retrieval algorithms or a machine learning application to predict metabolism. The antidote to this is of course education. A firm grounding in the principles and practice of chemoinformatics provides students and expert practitioners alike with the knowledge of the underlying algorithms, how they are implemented, their availability and of course limitations of software for a given purpose as well as future challenges for those with a keen interest in developing the field.

The best textbooks are naturally written from the viewpoint of those who are intimately connected to their subject. The chemoinformatics group at the Computer‐Chemie‐Centrum (CCC) at the University of Erlangen‐Nuremberg has been pioneers in chemoinformatics for over thirty years and are recognized as both innovators and experts at applying these methods to a large variety of chemical problems. However it is as educators that perhaps their greatest impact on the field may accrue over time. The new textbook Chemoinformatics – Basic Concepts and Methods builds on the successful first edition Chemoinformatics: A Textbook (published in 2003) and is again edited by Johann Gasteiger and Thomas Engel. This volume is complemented by an additional book Applied Chemoinformatics – Achievements and Future Opportunities, which shows the many fields chemoinformatics is now applied to. Johann Gasteiger has had a distinguished career in chemistry and is well known for his seminal contributions to chemoinformatics. He was the recipient of the 1991 Gmelin‐Beilstein Medal of the German Chemical Society for Achievements in Computer Chemistry, the 2005 Mike Lynch Award of the Chemical Structure Association, the 2006 ACS Award for Computers in Chemical and Pharmaceutical Research for his outstanding achievements in research and education in the field of Chemoinformatics, and the 1997 Herman Skolnik Award of the Division of Chemical Information of the American Chemical Society. Thomas Engel is a specialist in chemoinformatics who studied chemistry and education at the University of Würzburg and spent a significant tenure at the CCC at the University of Erlangen‐Nürnberg, followed by the Chemical Computing Group AG in Cologne and is presently at the Ludwig‐Maximilians‐Universität, Munich.

As editors, they have brought together a wide range of experts and topics, which will inform, educate, and motivate the reader to delve deeper into the subject of chemoinformatics. The new edition provides both the foundations for chemoinformatics and also a range of developing topics of active research, providing the reader with an introduction to the subject as well as advanced topics and future directions. This new edition is complemented by the Handbook of Chemoinformatics: From Data to Knowledge (by the same editors). It belongs on the bookshelf of students and experts alike, all who have an interest in the field of chemoinformatics, and especially those who see “magic” in chemistry.

Robert C. Glen

Professor of Molecular Sciences Informatics

Director of the Centre for Molecular Informatics

Department of Chemistry

University of Cambridge

Cambridge

United Kingdom

List of Contributors

Ivan Bangov

Konstantin Preslavski Shumen University

Natural Sciences Faculty, Department of General Chemistry

115 Universitetska Street

9712 Shumen

Bulgaria

Email: [email protected]

Tim Clark

Friedrich‐Alexander‐University Erlangen‐Nuremberg

Computer‐Chemie‐Centrum and Interdisciplinary Center for Molecular Materials

Nägelsbachstrasse 25

91052 Erlangen

Germany

Thomas Engel

Ludwig‐Maximilians‐Universität München

Department Chemie

Butenandtstraße 5‐13, Haus F

81377 Munich

Germany

Johann Gasteiger

Friedrich‐Alexander‐Universität

Erlangen‐Nürnberg

Computer‐Chemie‐Centrum

Nägelsbachstrasse 25

91052 Erlangen

Germany

Alexander Golbraikh

University of North Carolina

Eshelman School of Pharmacy, Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry

Chapel Hill, NC, 27599

USA

Nikolay Kochev

University of Plovdiv

Faculty of Chemistry, Department of Analytical Chemistry and Computer Chemistry

24, Tzar Assen Street

4000 Plovdiv

Bulgaria

Email: [email protected]

Adrian Kolodzik

University of Hamburg

ZBH – Center for Bioinformatics

Bundesstraße 32

20146 Hamburg

Germany

Email: [email protected]

Harald Lanig

Friedrich‐Alexander‐University Erlangen‐Nuremberg

Central Institute for Scientific Computing (ZISC)

Martensstrasse 5a

91058 Erlangen

Germany

Email: [email protected]

Giorgi Lekishvili

Tbilisi State Medical University

Faculty of Pharmacy, Department of Medical Chemistry

33, Vazha‐Pshavela avenue

0186 Tbilisi

Georgia

Valentin Monev

Bulgarian Academy of Sciences

Institute of Organic Chemistry

Acad. Georgi Bonchev Street,

Sofia 1113

Bulgaria

Email: [email protected]

Matthias Rarey

University of Hamburg

ZBH – Center for Bioinformatics

Bundesstraße 32

20146 Hamburg

Germany

Email: [email protected]

Oliver Sacher

Molecular Networks GmbH

Neumeyerstraße 28

90411 Nürnberg

Germany

Christof Schwab

Molecular Networks GmbH

Neumeyerstraße 28

90411 Nürnberg

Germany

Joao Aires de Sousa

Universidade Nova de Lisboa

Faculdade de Ciencias e Tecnologia, Departamento de Quimica

2829‐516 Caparica

Portugal

www: http://joao.airesdesousa.com

Heinrich Sticht

Friedrich‐Alexander‐University Erlangen‐Nuremberg

Institut für Biochemie, Emil‐Fischer Centrum

Fahrstraße 17

91054 Erlangen

Germany

Lothar Terfloth

Insilico Biotechnology AG

Meitnerstrasse 9

70563 Stuttgart

Germany

Jarosław Tomczak

Informatics Unlimited Ltd

8 Station Road, Histon

Cambridge CB24 9LQ

UK

Alexander Tropsha

University of North Carolina

Eshelman School of Pharmacy, Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry

Chapel Hill, NC, 27599

USA

E-mail: [email protected].

Kurt Varmuza

Vienna University of Technology, Vienna, Austria

Institute of Statistics and Mathematical Methods in Economics, and Institute of Chemical Engineering

Wiedner Hauptstrasse 7

1040 Vienna

Austria

[email protected]; www.lcm.tuwien.ac.at/vk/

David A. Winkler

Latrobe University

Latrobe Institute for Molecular Science

3082 Bundoora

Australia

and

Monash University

Monash Institute of Pharmaceutical Sciences

Parkville 3052

Australia

[email protected]

Engelbert Zass

ETH Zurich (retired)

Laboratory of Organic Chemistry

Zürich

Switzerland

Jure Zupan

National Institute of Chemistry

Laboratory of Chemometrics,

Hajdrihova 19

SI‐1000, Ljubljana

Slovenia

E-mail: [email protected]

1Introduction

Thomas Engel and Johann Gasteiger

1Ludwig‐Maximilians‐University Munich, Department of Chemistry, Butenandtstraße 5‐13, 81377 Munich, Germany

2Computer‐Chemie‐Centrum, Universität Erlangen‐Nürnberg, Nägelsbachstr. 25, 91052 Erlangen, Germany

Outline

1.1 The Rationale for the Books

1.2 The Objectives of Chemoinformatics

1.3 Learning in Chemoinformatics

1.4 Outline of the Book

1.5 The Scope of the Book

1.6 Teaching Chemoinformatics

1.1 The Rationale for the Books

In 2003 we issued the book

Chemoinformatics: A Textbook

(J. Gasteiger, T. Engel, Editors, Wiley‐VCH Verlag GmbH, Weinheim, Germany, ISBN 13: 978‐3‐527‐30681‐7)

which was well accepted and contributed to the development of the field of chemoinformatics. However, with the enormous progress in chemoinformatics, it is now time for an update. As we started out on this endeavor, it became rapidly clear that all the developments require presenting the field in more than a single book. We have therefore edited two volumes:

Chemoinformatics – Basic Concept and Methods

Applied Chemoinformatics – Achievements and Future Opportunities [

1

]

In this first volume, “Basic Concept and Methods,” the essential foundations and methods that comprise the technology of chemoinformatics are presented.

The second volume, “From Methods to Applications,” shows how this technology has been applied to a variety of fields such as chemistry, drug discovery, pharmacology, toxicology, agricultural, food, and material science as well as process control. The links to the second volume are referenced in the present volume by “Applications Volume”