Chemometrics - Matthias Otto - E-Book

Chemometrics E-Book

Matthias Otto

0,0
93,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Serving as both an introductory text and practical approach, covers all relevant chemometric topics from basic statistics, via modeling and databases right up to the latest artificial intelligence and neural network developments.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 553

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright

Preface

Overview of the Book

New to this Edition

Acknowledgements

List of Abbreviations

1 What is Chemometrics?

1.1 The Computer‐Based Laboratory

1.2 Statistics and Data Interpretation

1.3 Computer‐Based Information Systems/Artificial Intelligence

General Reading

Questions and Problems

2 Basic Statistics

2.1 Descriptive Statistics

2.2 Statistical Tests

2.3 Analysis of Variance

General Reading

Questions and Problems

3 Signal Processing and Time Series Analysis

3.1 Signal Processing

3.2 Time Series Analysis

General Reading

Questions and Problems

4 Optimization and Experimental Design

4.1 Systematic Optimization

4.2 Objective Functions and Factors

4.3 Experimental Design and Response Surface Methods

4.4 Sequential Optimization: Simplex Method

General Reading

Questions and Problems

5 Pattern Recognition and Classification

5.1 Preprocessing of Data

5.2 Unsupervised Methods

5.3 Supervised Methods

General Reading

Questions and Problems

6 Modeling

6.1 Univariate Linear Regression

6.2 Multiple Linear Regression

6.3 Nonlinear Methods

General Reading

Questions and Problems

7 Analytical Databases

7.1 Representation of Analytical Information

7.2 Library Search

7.3 Simulation of Spectra

General Reading

Questions and Problems

8 Knowledge Processing and Soft Computing

8.1 Artificial Intelligence and Expert Systems

8.2 Neural Networks

8.3 Fuzzy Theory

8.4 Genetic Algorithms and Other Global Search Strategies

General Reading

Questions and Problems

9 Quality Assurance and Good Laboratory Practice

9.1 Validation and Quality Control

9.2 Accreditation and Good Laboratory Practice

General Reading

Questions and Problems

Appendix

Statistical Distributions

Digital Filters

Experimental Designs

Matrix Algebra

Software

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Relationship between binary and decimal numbers.

Table 1.2 Truth values for logical connectives of predicates

p

and

q

based ...

Table 1.3 Chemometric methods for data evaluation and interpretation.

Chapter 2

Table 2.1 Spectrophotometric measurements (absorbances) of a sample solutio...

Table 2.2 Frequency distribution of measurements from Table 2.1.

Table 2.3 Descriptive statistics for the spectrophotometric measurements in...

Table 2.4 Examples of error propagation for different dependencies of the a...

Table 2.5 Important areas according to the error integral.

Table 2.6 Quantile of the one‐sided Student's

t

distribution for three sign...

Table 2.7 Overview of hypothesis testing based on the Student's

t

‐test.

Table 2.8

F

‐quantiles for

α

 = 0.05...

Table 2.9 Relationship between testing hypotheses and the errors of the fir...

Table 2.10 Critical values for the

Q

‐test at the 1% risk level.

Table 2.11 Critical values for Grubbs' test at two significance levels.

Table 2.12 Data scheme for a one‐way analysis of variance.

Table 2.13 Potassium concentration in mg·l

−1

from triple determinatio...

Table 2.14 Results of a one‐way analysis of variance for the potassium dete...

Table 2.15 Data design for a two‐way analysis of variance.

Table 2.16 Analytical determinations of manganese (percentage mass) in stee...

Table 2.17 ANOVA for two‐way analysis of variance of the data in Table 2.16...

Table 2.18 Schematic representation of factors and the response for a multi...

Chapter 3

Table 3.1 Coefficients of the Savitzky–Golay filter for smoothing based on ...

Table 3.2 Kalman filter algorithm.

Table 3.3 Individual values for the time series in Figure 3.20.

Chapter 4

Table 4.1 Selectivity measures in analytics.

Table 4.2 Aggregation of performance characteristics to an objective functi...

Table 4.3 Full factorial design at two levels, 2

3

design.

Table 4.4 Factorial design for four factors at two levels.

Table 4.5 Factorial design for five factors at two levels.

Table 4.6 Plackett and Burman fractional factorial design for estimating th...

Table 4.7 A 2 × 2 Latin square design in different representations.

Table 4.8 4 × 4 Latin square.

Table 4.9 A two‐level supersaturated design for 12 runs and 16 factors.

Table 4.10 A 2

3

screening design and factor levels for estimation of the fa...

Table 4.11 Central composite design for three factors consisting of a full‐...

Table 4.12 Box–Behnken design for three factors.

Table 4.13 Number of factors and experimental points for the Box–Behnken de...

Table 4.14 Factor levels and Box–Behnken design for studying the ceruloplas...

Table 4.15 Full factorial 2

3

design arranged in two blocks.

Table 4.16 Choice of initial simplexes for up to nine variables coded in th...

Chapter 5

Table 5.1 Elemental contents of hair of different subjects in parts per mil...

Table 5.2 Eigenvalues and explained variances for the hair data in Table 5....

Table 5.3 Absorbances in milliabsorbance units for the spectrochromatogram ...

Table 5.4 Concentrations of calcium and phosphate in six blood serum sample...

Table 5.5 Parameters for hierarchical cluster analysis by means of the gene...

Table 5.6 Iodine content of hair samples from different patients.

Table 5.7 Elemental content of an unknown hair sample in parts per million....

Table 5.8 Classification vector for the hair samples in Table 5.1.

Chapter 6

Table 6.1 Computation of the sums of squares (SS) for a complete ANOVA in l...

Table 6.2

x

y

data.

Table 6.3 ANOVA table for linear regression of the data in Table 6.2.

Table 6.4 ANOVA table for OLS calibration of NIR spectra in the analysis of...

Table 6.5 pH dependence of the retention (

k

′) of anthranilic acid in HPLC....

Table 6.6 Parameter for the model of retention (

k

′) of anthranilic acid in ...

Table 6.7 Error for modeling the pH dependence of the retention (

k

′) of ant...

Chapter 7

Table 7.1 Some analytical databases.

Table 7.2 Important information about a JCAMP/DX exchange file for an IR sp...

Table 7.3 Coding of spectra in the molecular database of the Organic Instit...

Table 7.4 Symbols for the HOSE substructure code ordered by priority.

Table 7.5 Representation of a chemical structure based on undirected graphs...

Table 7.6 Connection matrix for the molecule phosgene (cf. Figure 7.4).

Table 7.7 Bond atoms and bonds in a connection table of phosgene (cf. Figur...

Table 7.8 A nonredundant connection table for phosgene (cf. Figure 7.4).

Table 7.9 Example relation for characterization of an analytical sample.

Table 7.10 Hit list for comparison of UV spectra of corticoids in methanol/...

Table 7.11 Molecular descriptors categorized by data type.

Table 7.12 Molecular descriptors categorized by dimensionality of data.

Chapter 8

Table 8.1 Program to find an element in a list of elements by means of list...

Table 8.2 Examples of expert systems in analytics.

Table 8.3 Transfer functions for neural nets.

Table 8.4 Exclusive OR (XOR problem).

Table 8.5 Applications of neural networks in analytics.

Table 8.6 Calculation of modifiers for the membership functions of the ling...

Table 8.7 Applications of fuzzy theory in analytics [22].

Table 8.8 Selection in the genetic algorithm in Example 8.13.

Table 8.9 Crossovers in the genetic algorithm in Example 8.13.

Table 8.10 Applications of genetic algorithms.

Chapter 9

Table 9.1 Control quantities in quality control chart.

Table 9.2

D

factors for calculation of the limits of the range chart for th...

Table 9.3 General criteria in the directive series EN 45000.

Appendix

Table A.1 Probability density function (ordinate values) of the standardize...

Table A.2 Areas for the standard normal variate

z

(Eq. (2.28)) of the normal...

Table A.3 Two‐ and one‐sided Student's

t

‐distribution for different risk lev...

Table A.4

F

‐distribution for the risk levels

α

 = 0.025 (

lightface type

)...

Table A.5 Chi‐square distribution for different degrees of freedom

f

at diff...

Table A.6 Kolmogorov–Smirnov test statistic

d

(1−

α

,

n

) to test for a no...

Table A.7 Coefficients for computing first derivatives.

Table A.8 Coefficients for computing second derivatives.

Table A.9 Two-level designs (

half‐cell designs

) for three, four, and f...

Table A.10

Central composite design

for four factors with triplicate measure...

Table A.11

Box–Behnken design

for four factors with triplicate measure...

Table A.12 Mixture designs (lattice designs) for three and four factors.

List of Illustrations

Chapter 1

Figure 1.1 Signal dependence on time of an analog (a) and a digital detector...

Figure 1.2 Digitization of an analog signal by an analog‐to‐digital converte...

Figure 1.3 Local area network (LAN) to connect analytical instruments, a rob...

Figure 1.4 Anthropomorphic (a) and cylindrical (b) geometry of robot arms.

Chapter 2

Figure 2.1 Histogram for the measurements in Table 2.1 and the theoretical d...

Figure 2.2 Probability density distribution function of the Gaussian distrib...

Figure 2.3 Probability density function of the Poisson distribution accordin...

Figure 2.4 Illustration of the central limit theorem for the distribution of...

Figure 2.5 Box‐and‐whisker plot for the data in Table 2.1 with an additional...

Figure 2.6 Examples for the integration of the Gaussian distribution in the ...

Figure 2.7 Distribution function (error integral) for the Gaussian distribut...

Figure 2.8 Illustration of critical areas for one‐sided tests at the upper (...

Figure 2.9 Schematic plot of the observed frequency,

h

i

, and the theoretical...

Figure 2.10 Determination of the test statistics of Kolmogorov–Smirnov's tes...

Figure 2.11 Errors of the first and second kind at larger (a) and smaller (b...

Figure 2.12 NIR spectra of textiles polyamide (blue line) and velvet (ochre ...

Figure 2.13 Sums of squares for a reference distribution of factor B found b...

Chapter 3

Figure 3.1 Moving‐average filter for a filter width of 2

m

...

Figure 3.2 Filtering of a discrete analytical signal consisting of

k

data po...

Figure 3.3 Relative error for smoothing Gaussian and Lorentz peaks in depend...

Figure 3.4 Shape of a Gaussian peak (see Eq. (2.2)) and a Lorentz peak.

b

1

,...

Figure 3.5 Application of a Kalman filter for smoothing a time‐dependent sig...

Figure 3.6 Second derivative of a peak based on a Lorentz function.

Figure 3.7 Visual resolution of two Lorentz peaks with a resolution of 0.4 f...

Figure 3.8 Trapezoidal rule for integration of a signal.

Figure 3.9 Fourier transformation: the sum signal (a) contains the sine func...

Figure 3.10 Walsh function as the basis function for the Hadamard transforma...

Figure 3.11 Low‐pass filter (a) and high‐pass filter (b) in the frequency do...

Figure 3.12 Decomposition (deconvolution) of a peak (

solid line

) into two un...

Figure 3.13 Transformation of signals from 32 data points and back‐transform...

Figure 3.14 Interpolating spline function.

Figure 3.15 Wavelets of the classes Haar (a), Daubechies (b), Coiflet (c), S...

Figure 3.16 Shifting the Daubechies‐7 wavelet function

ψ

(

x

) (a) and by ...

Figure 3.17 Multilevel decomposition of a signal by discrete wavelet transfo...

Figure 3.18 Signal from mass spectrometry of Staphylococcus nuclease (SNase)...

Figure 3.19 Decomposition of the signal in Figure 3.18 into...

Figure 3.20 (a) Simulated chromatogram for a reference signal (solid line) a...

Figure 3.21 Time series for monthly recorded concentration of sulfur as sulf...

Figure 3.22 Pointwise correlations for the time series in Figure 3.21 for di...

Figure 3.23 Autocorrelation function for the time series in Figure 3.21. The...

Figure 3.24 Autocorrelation function of uncorrelated data.

Figure 3.25 Autocorrelation function for weakly correlated data according to...

Figure 3.26 Autocorrelation function for a time series with drift, as found ...

Figure 3.27 Schematic demonstration of cross‐correlation between an input si...

Chapter 4

Figure 4.1 Response versus factors plot. In the case of response surface met...

Figure 4.2 Sensitivity for a straight‐line calibration as functional depende...

Figure 4.3 Random and systematic errors for measurements in the signal domai...

Figure 4.4 Illustration of the analytical resolution for differently separat...

Figure 4.5 Peak separation according to Kaiser.

Figure 4.6 Goal region for the simultaneous investigation of two objective c...

Figure 4.7 Experimental observations for constructing a calibration graph in...

Figure 4.8 Full factorial design at two levels, 2

3

design.

x

1

,

x

2

, and

x

3

re...

Figure 4.9 Fractional factorial design at two levels, 2

3−1

half‐fracti...

Figure 4.10 Factor effects in the kinetic–enzymatic oxidation of

p

‐phenylene...

Figure 4.11 Full‐factorial three‐level design for two factors, 3

2

design.

Figure 4.12 Central composite design for three factors.

Figure 4.13 Box–Behnken design for three factors.

Figure 4.14 Mixture design for three factors at 10 levels based on a (3,3)‐l...

Figure 4.15 Surface (a) and contour (b) plots of enzymatic rate versus the f...

Figure 4.16 Sequential optimization with single‐factor‐at‐a‐time strategy in...

Figure 4.17 Fixed‐size simplex according to Nelder and Mead along an unknown...

Figure 4.18 Variable‐size simplex.

Figure 4.19 Simplex search for optimum PPD concentration and pH value for th...

Chapter 5

Figure 5.1 Demonstration of translation and scaling procedures: the original...

Figure 5.2 Projection of a swarm of objects from the original two dimensions...

Figure 5.3 Scree plot for the principal component model of the hair data in ...

Figure 5.4 Principal component scores for the hair data in Table 5.1.

Figure 5.5 Principal component loadings of the hair data in Table 5.1.

Figure 5.6 Biplot for the simultaneous characterization of the cores and loa...

Figure 5.7 Biplot for the simultaneous characterization of the scores and lo...

Figure 5.8 Simplified spectrochromatogram for an incompletely resolved peak ...

Figure 5.9 Simplified UV spectra of the compounds benzo[

k

]fluoranthene (•), ...

Figure 5.10 Comparison of the target spectrum anthracene (▴) with the predic...

Figure 5.11 Schematic representation of evolving eigenvalues,

λ

, in dep...

Figure 5.12 Source signals (a) and mixed signals (b).

Figure 5.13 Estimation of signals by PCA (a) and ICA (b).

Figure 5.14 Unfolding of three‐way arrays for data analysis by conventional ...

Figure 5.15 The parallel factor analysis (PARAFAC) model for

F

factors.

Figure 5.16 Electrospray ionization–mass spectrometry (ESI–MS) spectrum of a...

Figure 5.17 Charge states for a single sample (a) derived from ESI‐MS spectr...

Figure 5.18 Folding states of the protein myoglobin derived from PARAFAC dec...

Figure 5.19 Tucker3 model with different component numbers

P

,

Q

, and

R

in th...

Figure 5.20 City‐block distance (a) and Euclidean distance (b) for two featu...

Figure 5.21 Dendrogram for the clinical–analytical data from Table 5.4.

Figure 5.22 Graphical method for grouping of the hair data in Table 5.1 for ...

Figure 5.23 Chernoff faces for distinguishing of serum samples of diseased a...

Figure 5.24 Representation of iodine data from hair samples in Table 5.6; th...

Figure 5.25 Linear learning machine (LLM): representation of iodine data fro...

Figure 5.26 Decision lines based on linear regression (

LR

) and linear discri...

Figure 5.27 Discriminant function of the LDA of the hair data in Table 5.1 b...

Figure 5.28 Differently spread (a) and differently directed (b) objects of t...

Figure 5.29 Separation boundary for classification of objects into two class...

Figure 5.30

k

‐NN classification by cross‐validated models with one neighbori...

Figure 5.31 Fraction of misclassified objects in dependence on the number of...

Figure 5.32 Soft independent modeling of class analogies (SIMCA) models for ...

Figure 5.33 Separable case (a) and nonseparable (overlap) case (b) of the de...

Figure 5.34 Partitioning in CART: The panel (a) demonstrates the partitionin...

Figure 5.35 Decision boundaries for classification of four classes by CART i...

Figure 5.36 Dependence of the fraction of misclassifications in

bagged

CART ...

Figure 5.37 Dependence of the fraction of misclassifications in

boosted

CART...

Figure 5.38 Application of the

random forests

algorithm for classification o...

Figure 5.39 Plot of the proximity matrix with the same class labeling as in ...

Figure 5.40 Simulated two‐class data (a) and their classification by (b) LDA...

Figure 5.41 Star (a) and face (b) plots for grouping spectra at selected wav...

Chapter 6

Figure 6.1 Illustration of analysis of variance (ANOVA) for linear regressio...

Figure 6.2 Plot of the

x

y

data in Table 6.2 in connection with an ANO...

Figure 6.3 Confidence bands for the prediction of individual

y

values (

broke

...

Figure 6.4 Residual analysis in linear regression. (a) Time‐dependent observ...

Figure 6.5

x

y

‐Values for the case of heteroscedastic data.

Figure 6.6 Original NIR spectra (a) and OPLS corrected spectra (b).

Figure 6.7 NIR spectra of textiles with different off‐set and spreads at the...

Figure 6.8 Straight modeling in the presence of an outlier and an influentia...

Figure 6.9 NIR spectrum of wheat.

Figure 6.10 Recovery function for resubstitution of samples in case of inver...

Figure 6.11 Regression diagnostics for influential observations and outliers...

Figure 6.12 Error for calibration of 30 wheat samples by means of 176 wavele...

Figure 6.13 Ridge trace for the calibration problem in Example 6.11, where t...

Figure 6.14 Trace plot of coefficients fit by lasso (a) and the 176 regressi...

Figure 6.15 Scheme of

N

‐way PLS regression of responses,

Y

...

Figure 6.16 Plot of the transformation of an

x

variable in the alternating c...

Figure 6.17 Demonstration of the nonlinear relationship between latent varia...

Figure 6.18 Modeling of the retention (

k

′) of anthranilic acid in HPLC in de...

Figure 6.19 Recovery functions for cross‐validated predictions of benzanthra...

Chapter 7

Figure 7.1 Structure of analytical databases.

Figure 7.2 Evaluation of the peak list of a spectrum.

Figure 7.3 Spheres around a carbon atom (

bold face

) as the basis for encodin...

Figure 7.4 Chemical structure of phosgene represented as an undirected graph...

Figure 7.5 Example of a Markush structure with the general groups

G

1

of phen...

Figure 7.6 Inverted list for an IR spectral library.

ID

Identity number of s...

Figure 7.7 Connection of bits by exclusive OR (XOR).

Figure 7.8 Comparison of an unknown spectrum (a) with a candidate library sp...

Figure 7.9 Set‐oriented comparison of two spectra: (a) Unknown original spec...

Figure 7.10 Example of substructure search based on connection tables of the...

Figure 7.11 Simulation of the

1

H NMR spectrum for cumene (above right) and 1...

Chapter 8

Figure 8.1 Modules of artificial intelligence (AI) research in analogy to th...

Figure 8.2 Representation of knowledge in the form of semantic nets (a) and ...

Figure 8.3 Strategies for depth‐first search (a) and breadth‐first search (b...

Figure 8.4 Structure of an expert system.

Figure 8.5 Analogy between a biological (a) and an artificial neuron (b).

Figure 8.6 Structure of an artificial neural network.

Figure 8.7 Operation of a single neuron.

Figure 8.8 Bidirectional associative memory (BAM) consisting of

n

input and

Figure 8.9 Insolubility of the XOR problem by using a network based on a

sin

...

Figure 8.10 Solution of the XOR problem with a network that contains two neu...

Figure 8.11 Multilayer perceptron as basis for a backpropagation network.

Figure 8.12 Decision boundaries of a feed‐forward neural network trained by ...

Figure 8.13 Artificial neural network for classification of four classes bas...

Figure 8.14 Structure of self‐organizing networks in linear arrangement (a) ...

Figure 8.15 Neighborhood function in the form of a triangle (a) and a Mexica...

Figure 8.16 Simple recurrent network with one input

x

and one output

y

.

h

an...

Figure 8.17 1D convolutional layer for with a filter of size 3 and stride 2....

Figure 8.18 Confusion matrix for classification of textiles by NIR‐spectra w...

Figure 8.19 Membership function for the detectability of elements by X‐ray f...

Figure 8.20 Membership functions for characterization of uncertainty of expe...

Figure 8.21 Two‐dimensional membership function for fitting a straight line....

Figure 8.22 Membership functions for the linguistic variable “high” accordin...

Figure 8.23 Membership functions for truth values after E. Baldwin.

Figure 8.24 Intersection (a) and union (b) of two fuzzy sets and the cardina...

Figure 8.25 Adaptive neuro‐fuzzy inference system.

Figure 8.26 Simulated data (a) and assignment of cases in the test data set ...

Figure 8.27 Example of a fuzzy difference of two numbers, that is, about 20 ...

Chapter 9

Figure 9.1 Sequence of objective quantities in a control chart. The

dotted l

...

Guide

Cover

Table of Contents

Title Page

Copyright

Preface

List of Abbreviations

Begin Reading

Index

End User License Agreement

Pages

iv

vii

viii

ix

xi

xii

xiii

xiv

1

2

3

4

5

6

7

8

9

10

11

12

13

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

Chemometrics

Statistics and Computer Application in Analytical Chemistry

Fourth Edition

 

Matthias Otto

 

 

 

 

Author

Prof. Matthias Otto

TU Bergakademie Freiberg

Institut für Analytische Chemie

Leipziger Str. 29

09599 Freiberg

Germany

Cover Image: Courtesy of Matthias Otto

All books published by WILEY‐VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.

Library of Congress Card No.: applied for

British Library Cataloguing‐in‐Publication Data

A catalogue record for this book is available from the British Library.

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at <http://dnb.d-nb.de>.

© 2024 WILEY‐VCH GmbH, Boschstraße 12, 69469 Weinheim, Germany

All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.

Print ISBN: 978‐3‐527‐35266‐1

ePDF ISBN: 978‐3‐527‐84378‐7

ePub ISBN: 978‐3‐527‐84379‐4

oBook ISBN: 978‐3‐527‐84380‐0

Preface

The idea to write a textbook on chemometrics originated from my lecturing for undergraduate and graduate students. At the TU Bergakademie Freiberg (Germany), I have been giving lectures and computer exercises about chemometrics for more than 30 years. These activities were supplemented by lecturing as a visiting professor at the Vienna University of Technology (Austria), the Technical University of Munich (Germany), and the Universidad de Concepción (Chile).

Over the years, I've found that students work willingly and enthusiastically in chemometrics. The initial difficulties with the computer exercises are scarcely imaginable today since many students arrive at the university more or less as computer freaks. Problems, however, are envisaged with respect to the appropriate evaluation of chemical measurements, because a lot of statistical–mathematical knowledge is required. In most countries, unfortunately the basic statistical and mathematical education of chemists is poor compared to that of, say, physicists.

Therefore, my textbook on chemometrics concerns the major topics in statistical–mathematical evaluation of chemical, especially analytical measurements. It is dedicated to the evaluation of experimental observations, but not to theoretical aspects of chemistry.

Overview of the Book

The book is subdivided into nine chapters. In the first chapter, the subjects of chemometrics and their application areas are introduced. Chapter 2 provides the statistical fundamentals required to describe chemical data and apply statistical tests. The methods of signal processing for filtering data and for characterizing data as time series are the subject of Chapter 3. In Chapter 4, the methods for effective experimentation based on experimental design and optimization are covered. The methods are outlined in such a way that they can be equally applied to optimize a chemical synthesis, an analytical procedure, or a drug formulation. The methods of pattern recognition and the assignment of data in the sense of classification are presented in Chapter 5, which consists of sections on unsupervised and supervised learning. After introducing the methods of data preprocessing, the typical chemometric methods for the analysis of multidimensional data are outlined. Chapter 6 is dedicated to modeling of relationships ranging from straight‐line regression to methods of multiple and nonlinear regression analysis. In Chapter 7, analytical databanks are discussed, i.e., the computer‐accessible representation of chemical structures and spectra, including the use of LIMS systems. More recent developments in chemometrics are considered in Chapter 8. Apart from the fundamentals of artificial intelligence, applications of expert systems, of neural networks, of the theory of fuzzy sets, and of global search methods, such as genetic algorithms, are discussed. Actual applications of statistical methods in the chemical laboratory are described in Chapter 9 for internal and external quality assurance, validation, accreditation, and good laboratory practice.

In the Appendix, the reader will find statistical tables, recommendations for software, and an introduction to linear algebra. The application of chemometric methods should be made easier by following the Learning Objectives found in each chapter, studying the dozens of worked examples, and solving the Questions and Problems at the end of each chapter.

The textbook is not only written for chemometrics courses within the chemistry curriculum, but also for the individual study of chemists, biochemists, pharmacists, mineralogists, geologists, biologists, and scientists of related disciplines. In this context, the book should be considered useful for colleagues from industry, and it might be used, e.g., if multivariate methods are needed to run a NIR spectrometer, to apply statistical tests in quality assurance, or to investigate quantitative structure–activity relationships.

New to this Edition

The present edition is the first one printed in full color. Thus, I have redrawn all of the more than 200 Figures in the book. Some Figures have been completely redesigned with respect to the didactic use of color.

Apart from updating the general reading references, new examples, e.g., for the Kalman filter and supersaturated designs, new questions, and margin notes have been added. Novel methods are also introduced: ASCA (ANOVA Simultaneous Component Analysis), random forest classifications, shrinkage methods in the form of ridge and lasso regression, as well as deep learning networks based on multi‐layered feedforward, convolutional, and recurrent neural networks.

Acknowledgements

As usual, the text would not have been written without the comments and suggestions of my colleagues in chemometrics. The first steps in using the computer application analytics I did together with Heinz Zwanziger many years ago, while later on I ran many courses on chemometrics together with Wolfhard Wegscheider; I would particularly like to acknowledge them both on this occasion.

Finally, I would like to thank the publisher, WILEY‐VCH, for producing the book. I am especially grateful to Jenny Cossham (Associate Editorial Director at John Wiley & Sons, Inc., Chichester, UK), who initialized the call for a 4th edition, to Dr. Sakeena Quraishi and Tanya Domeier (WILEY‐VCH, Weinheim, Germany), for overtaking the Editor tasks after the retirement of J. Cossham, to Kubra Ameen (Wiley, Chennai, India) as the managing editor, and finally to Farhath Fathima (K&L Content Manager, Wiley) as the production editor of this edition.

July 2023

Matthias Otto

TU Bergakademie Freiberg, Germany

List of Abbreviations

ACE

alternating conditional expectations

ADC

analog‐to‐digital converter

ANFIS

adaptive neuro‐fuzzy inference system

ANOVA

analysis of variance

ASCA

ANOVA simultaneous component analysis

CART

classification and regression trees

CAS

chemical abstracts service

CNN

convolutional neural network

CPU

central processing unit

CRF

chromatographic response function

DAC

digital‐to‐analog converter

DTW

dynamic time warping

DWT

discrete wavelet transformation

EPA

environmental protection agency

ESI

electrospray ionization

FA

factor analysis

FFT

fast Fourier transformation

FHT

fast Hadamard transformation

FT

Fourier transformation

GC

gas chromatography

HORD

hierarchically ordered ring description

HOSE

hierarchically ordered spherical description of environment

HPLC

high performance liquid chromatography

HT

Hadamard transformation

I/O

input/output

ICA

independent component analysis

IND

indicator function

IR

infrared

ISO

international standardization organization

JCAMP

Joint Committee on Atomic and Molecular Data

KNN

k

‐nearest neighbor method

LAN

local area network

LDA

linear discriminant analysis

LIMS

laboratory information management system

LISP

list processing language

LLM

linear learning machine

LSTM

long short‐term memory network

MANOVA

multidimensional ANOVA

MARS

multivariate adaptive regression splines

MCR

multivariate curve resolution

MS

mass spectrometry

MSDC

mass spectrometry data center

MSS

mean sum of squares

NIPALS

nonlinear iterative partial least squares

NIR

near infrared

NIST

National Institute of Standards and Technology

NLR

nonlinear regression

NMR

nuclear magnetic resonance

NPLS

nonlinear partial least squares

N‐PLS

N‐way partial least squares

OLS

ordinary least squares

OPLS

orthogonal partial least squares

PARAFAC

parallel factor analysis

PCA

principal component analysis

PCR

principal component regression

PLS

partial least squares

PRESS

predictive residual sum of squares

PROLOG

programming in logic

RAM

random access memory

RDA

regularized discriminant analysis

RE

real error

RNN

recurrent network

ROM

read‐only memory

RR

recovery rate

RSD

relative standard deviation

RSM

response surface method

SA

simulated annealing

SEC

standard error of calibration

SEP

standard error of prediction

SIMCA

soft independent modeling of class analogies

SMILES

simplified molecular input line entry specification

SNV

standard normal variate

SS

sum of squares

SVD

singular value decomposition

SVM

support vector machines

TS

Tabu search

TTFA

target transformation factor analysis

UV

ultraviolet

VIS

visual

VM

vector machines

WT

wavelet transformation.

Symbols

α

Significance level (risk), separation factor, complexity parameter

A

Area

b

Breadth (width), regression parameter

χ

2

Quantile of chi‐squared distribution

cov

Covariance

C

Variance covariance matrix

δ

Error, chemical shift

d

Distance measure, test statistic of by Kolmogorov–Smirnov

D

Difference test statistic

D

k

Cook's test statistic

η

Learning coefficient

e

Error

E

(.)

Expectation value

E

Residual matrix

f

Degree of freedom, function

f

(

x

)

Probability density function

F

Quantile of Fisher distribution

F

(⋅)

Function in frequency space

f

(

t

)

Function in the time domain

F

Matrix of scores

G

Geometric mean

G

(⋅)

Smoothing function in frequency space

g

(

t

)

Smoothing function in the time domain

h

Communality

H

Harmonic mean

H

Hat matrix, Hadamard transformation matrix

H

0

Null hypothesis

H

1

Alternative hypothesis

H

(⋅)

Filter function in frequency space

h

(

t

)

Filter function in the time domain

I

Unit step function

I

Identity matrix

I

50

Interquartile range

J

Jacobian matrix

k

Kurtosis

k

Capacity factor

K

A

Protolysis (acid) constant

λ

Eigenvalue, Poisson parameter

L

Loading matrix

μ

Population mean

m

Membership function

m

r

Moment of distribution

nf

Neighborhood function

N

Analytical Resolution, plate number

N

(

ν

)

Noise

p

portion

P

Probability

Q

Quartile, Dixon statistics

r

Correlation coefficient, radius

R

Correlation matrix

R

S

Chromatographic resolution

R

2

Coefficient of determination

σ

Standard deviation

s

Estimation of standard deviation, skewness

s

r

Estimation of relative standard deviation

S

Similarity measure

τ

Time lag

t

Quantile of Student distribution

T

Test quantity of Grubbs' test

T

Matrix of principal components scores, transformation matrix

U

Matrix of left eigenvectors

R

Range, region in CART

V

Matrix of right eigenvectors

w

Singular value, weight

W

Warping path

x

(Independent) variable

x

Vector of independent variables

X

Matrix of independent variables

x

Arithmetic mean

ξ

Slack variable

y

(Dependent) variable

y

*

Transformed (filtered) value

z

Standard normal deviate, signal position, objective function.

1What is Chemometrics?

Learning Objectives

To define chemometrics

To learn how to count with bits and how to perform arithmetic or logical operations in a computer

To understand the principal terminology for computer systems and the meaning of robotics and automation.

The development of the discipline of chemometrics is strongly related to the use of computers in chemistry. Some analytical groups in the 1970s were already working with statistical and mathematical methods that are ascribed nowadays to chemometric methods. Those early investigations were connected to the use of mainframe computers.

The notation chemometrics was introduced in 1972 by the Swede Svante Wold and the American Bruce R. Kowalski. The foundation of the International Chemometrics Society in 1974 led to the first description of this discipline. In the following years, several conference series were organized, for example, Computer Application in Analytics (COMPANA), Computer‐Based Analytical Chemistry (COBAC), and Chemometrics in Analytical Chemistry (CAC). Some journals devoted special sections to papers on chemometrics. Later, novel chemometric journals were started, such as the Journal of Chemometrics (Wiley) and Chemometrics and Intelligent Laboratory Systems (Elsevier).

An actual definition of chemometrics is:

the chemical discipline that uses mathematical and statistical methods, (a) to design or select optimal measurementprocedures and experiments, and (b) to provide maximum chemical information by analyzing chemical data.

The discipline of chemometrics originates in chemistry. Typical applications of chemometric methods are the development of quantitative structure–activity relationships and the evaluation of analytical–chemical data. The data flood generated by modern analytical instrumentation is one reason that analytical chemists, in particular, develop applications of chemometric methods. Chemometric methods in analytics are a discipline that uses mathematical and statistical methods to obtain relevant information on material systems.

With the availability of personal computers at the beginning of the 1980s, a new age commenced for the acquisition, processing, and interpretation of chemical data. In fact, today, every scientist uses software, in one form or another, that is related to mathematical methods or processing of knowledge. As a consequence, the necessity emerges for a deeper understanding of those methods.

The education of chemists in mathematics and statistics is usually unsatisfactory. Therefore, one of the initial aims of chemometrics was to make complicated mathematical methods practicable. Meanwhile, the commercialized statistical and numerical software simplifies this process, so that all important chemometric methods can be taught in appropriate computer demonstrations.

Apart from the statistical–mathematical methods, the topics of chemometrics are also related to problems of the computer‐based laboratory, to methods for handling chemical or spectroscopic databases, and to methods of artificial intelligence.

In addition, chemometricians contribute to the development of all these methods. As a rule, these developments are dedicated to particular practical requirements, such as the automatic optimization of chromatographic separations or in prediction of the biological activity of a chemical compound.

1.1 The Computer‐Based Laboratory

Nowadays, the computer is an indispensable tool in research and development. The computer is linked to analytical instrumentation; it serves as a tool for acquiring data, word processing, and handling databases and quality assurance systems. In addition, the computer is the basis for modern communication techniques such as electronic mails or video conferences. In order to understand the important principles of computer usage, some fundamentals are considered here, that is, coding and processing of digital information, the main components of the computer, programming languages, computer networking, and automation processes.

Analog and Digital Data

The use of digital data provides several advantages over the use of analog data. Digital data are less noise sensitive. The only noise arises from round‐off errors due to finite representation of the digits of a number. They are less prone to, for instance, electrical interferences, and they are compatible with digital computers.

As a rule, primary data are generated as analog signals either in a discrete or a continuous mode (Figure 1.1). For example, monitoring the intensity of optical radiation by means of a photocell provides a continuous signal. Weak radiation, however, could be monitored by detecting individual photons with a photomultiplier.

Figure 1.1 Signal dependence on time of an analog (a) and a digital detector (b).

Usually, the analog signals generated are converted into digital data by an analog‐to‐digital converter (ADC) as explained as follows:

Binary versus Decimal Number System

In a digital measurement, the number of pulses occurring within a specified set of boundary conditions is counted. The easiest way to count is to have the pulses represented as binary numbers. In this way, only two electronic states are required. To represent the decimal numbers from 0 to 9, one would need 10 different states. Typically, the binary numbers 0 and 1 are represented electronically by voltage signals of 0.5 and 5 V, respectively. Binary numbers characterize coefficients of the power of 2, so that any number in the decimal system can be described.

Example 1.1 Binary Number Representation

The decimal number 77 is expressed as binary number by 1001101, that is,

1

0

0

1

1

0

1

1 × 2

6

0 × 2

5

0 × 2

4

1 × 2

3

1 × 2

2

0 × 2

1

1 × 2

0

 =

64

+0

+0

+8

+4

+0

+1 = 77

Table 1.1 summarizes further relationships between binary and decimal numbers. Every binary number is composed of individual bits (binary digits). The digit lying farthest to the right is termed the least significant digit, and the one on the left is the most significant digit.

Table 1.1 Relationship between binary and decimal numbers.

Binary number

Decimal number

0

0

1

1

10

2

11

3

100

4

101

5

110

6

111

7

1000

8

1001

9

1010

10

1101

13

10 000

16

100 000

32

1 000 000

64

How are calculations done using binary numbers? Arithmetic operations are similar but simpler than those for decimal numbers. In addition, for example, four combinations are feasible:

0

0

1

1

+0

+1

+0

+1

0

1

1

10

Note that for the addition of the binary numbers 1 plus 1, a 1 is carried over to the next higher power of 2.

Example 1.2 Calculation with Binary Numbers

Consider the addition of 21 + 5 in the case of a decimal (a) and of a binary number (b):

a.

21

b

10101

+5

101

26

11010

Apart from arithmetic operations in the computer, logical reasoning is necessary too. This might be in the course of an algorithm or in connection with an expert system. Logical operations with binary numbers are summarized in Table 1.2.

Table 1.2 Truth values for logical connectives of predicates p and q based on binary numbers.

p

q

p

AND

q

p

OR

q

IF

p

THEN

q

NOT

p

1

1

1

1

1

0

1

0

0

1

0

0

1

0

1

1

1

0

0

0

0

1

1 True and 0 false.

It should be mentioned that a very compact representation of numbers is based on the hexadecimal number system. However, hexadecimal numbers are easily converted into binary data, so the details need not be explored here.

Digital and Analog Converters

Analog‐to‐Digital Converters (ADCs)

In order to benefit from the advantages of digital data evaluation, the analog signals are converted into digital ones. An analog signal consists of an infinitely dense sequence of signal values in a theoretically infinitely small resolution. The conversion of analog into digital signals in the ADC results in a definite reduction of information. For conversion, signal values are sampled in a predefined time interval and quantified in an n‐ary raster (Figure 1.2). The output signal is a code word consisting of n bits. Using n bits, 2n different levels can be coded, for example, an 8‐bit ADC has a resolution of 28 = 256 amplitude levels.

Figure 1.2 Digitization of an analog signal by an analog‐to‐digital converter (ADC).

Digital‐to‐Analog Converters (DACs)

Converting digital into analog information is necessary if an external device is to be controlled or if the data have to be represented by an analog output unit. The resolution of the analog signal is determined by the number of processed bits in the converter. A 10‐bit DAC provides 210 = 1024 different voltage increments. Its resolution is then 1/1024 or approximately 0.1%.

Computer Terminology

The representation of numbers in a computer by bits has already been considered. The combination of 8 bits is called a byte. A series of bytes arranged in sequence to represent a piece of data is termed a word. Typical word sizes are 8, 16, 32, or 64 bits, or 1, 2, 4, and 8 bytes.

Words are processed in registers. A sequence of operations in a register enables algorithms to be performed. One or several algorithms make up a computer program.

The physical components of a computer form the hardware. Hardware includes clocks, memory units, computer data storage, monitor, mouse, keyboard, graphics, and sound cards. Programs and instructions for the computer, including documentation and data, represent the software.

Components of Computers

Central Processing Units and Buses

A bus consists of a set of parallel conductors that forms a main transition path in a computer.

The heart of a computer is the central processing unit (CPU). In a microprocessor or minicomputer, this unit consists of a highly integrated chip.

The different components of a computer, its memory, and the peripheral devices, such as printers or scanners, are joined by buses. To guarantee rapid communication among the various parts of a computer, information is exchanged on the basis of a definitive word size, for example, 32 bits, simultaneously over parallel lines of the bus. A data bus serves the exchange of data into and out of the CPU. The origin and the destination of the data in the bus are specified by the address bus. For example, an address bus with 16 lines can address 216 = 65536 different registers or other locations in the computer or in its memory. Control and status information to and from the CPU are administrated in the control bus. The peripheral devices are controlled by an external bus system, for example, an SPI (Serial Peripheral Interface), I2C (Inter‐Integrated Circuit), or UART (Universal Asynchronous Receiver–Transmitter) interface for serial data transfer or the IEEE‐488 interface for parallel transfer of data.

Memory

The microcomputer or microprocessor contains typically two kinds of memory: random access memory (RAM) and read‐only memory (ROM). The term RAM is somewhat misleading and historically reasoned, since random access is feasible for RAM and ROM alike. The RAM can be used to read and write information. In contrast, information in a ROM is written once, so that it can be read, but not reprogrammed. ROMs are needed in microcomputers or pocket calculators in order to perform fixed programs, for example, for calculation of logarithms or standard deviations.

Larger programs and data collections are stored in bulk storage devices. In the beginning of the computer age, magnetic tapes were the standard here. Nowadays, CDs, DVDs, Blu‐Rays, and USB flash drives are used providing a storage capacity of several gigabytes. In addition, every computer is equipped with a hard disk of at least several gigabytes. The access time to retrieve the stored information is in the order of a few milliseconds.

Input/Output Systems

Communication with the computer is carried out by input–output (I/O) operations. Typical input devices are keyboard, mouse, scanners, and the signals of an analytical instrument. Output devices are screens and printers, as well as flash drives and disks. To convert analog information into digital or vice versa, the aforementioned ADCs or DACs are used.

Programs

Programming a computer at 0 and 1 states or bits is possible using machine code. Since this kind of programming is rather time‐consuming, higher‐level languages have been developed where whole groups of bit operations are assembled. However, these so‐called assembler languages are still difficult to handle. Therefore, high‐level algorithmic languages, such as FORTRAN, BASIC, PASCAL, C++, or Python, are more common in analytical chemistry. With high‐level languages, the instructions for performing an algorithm can easily be formulated in a computer program. Thereafter, these instructions are translated into machine code by means of a compiler.

For logical programming, additional high‐level languages exist, for example, LISP (List Processing language) or PROLOG (Programming in Logic). Further developments are found in the so‐called Shells, which can be used directly for building expert systems.

Networking

A very effective communication between computers, analytical instruments, and databases is based on networks. There are local nets, for example, within an industrial laboratory, as well as national or worldwide networks. Local area networks (LANs) are used to transfer information about analysis samples, measurements, research projects, or in‐house databases. A typical LAN is demonstrated in Figure 1.3. It contains a laboratory information management system (LIMS), where all information about the sample or the progress of a project can be stored and further processed (cf. Section 7.1).

Figure 1.3 Local area network (LAN) to connect analytical instruments, a robot, and a laboratory information management system (LIMS).

Worldwide networking is feasible, for example, via Internet or CompuServe. These nets are used to exchange electronic mail (e‐mail) or data with universities, research institutions, or industries.

Robotics and Automation

Apart from acquiring and processing analytical data, the computer can also be used to control or supervise automatic procedures. To automate manual procedures, a robot is applied. A robot is a reprogrammable device that can perform a task more cheaply and effectively than a human being.

The typical geometric shapes of a robot arm are sketched in Figure 1.4. The anthropomorphic geometry (Figure 1.4a) is derived from the human torso; that is, there is a waist, a shoulder, an elbow, and a wrist. Although this type of robot is mainly found in the automobile industry, it can also be used for manipulation of liquid or solid samples.

Figure 1.4 Anthropomorphic (a) and cylindrical (b) geometry of robot arms.

In the chemical laboratory, the cylindrical geometry dominates (Figure 1.4b). The revolving robot arm can be moved in horizontal and vertical directions. The typical operations of a robot are as follows:

Manipulation

of test tubes or glassware around the robotic work area

Weighing

for the determination of a sample amount or for checking unit operations, for example, addition of solvent

Liquid handling,

in order to dilute or add reagent solutions

Conditioning

of a sample by heating or cooling

Separations

based on filtrations or extractions

Measurements

by analytical procedures, such as spectrophotometry or chromatography

Control and supervision

of the different analytical steps.

Programming of a robot is based on software dedicated to its actual manufacture. The software consists of elements to control the peripheral devices (robot arm, balance, pumps), to switch the devices on and off, and to provide instructions on the basis of logical structures, for example, IF–THEN rules.

Alternatives for automation in a laboratory are discrete analyzers and flowing systems. By means of discrete analyzers, unit operations such as dilution, extraction, and dialyses can be automated. Continuous flow analyzers or flow injection analyses serve similar objectives for automation, for example, for the determination of clinical parameters in blood serum.

The transfer of manual operations to a robot or an automated system provides the following advantages:

High productivity and/or minimization of costs

Improved precision and trueness of results

Increased assurance for performing laboratory operations

Easier validation of the different steps of an analytical procedure.

The increasing degree of automation in the laboratory leads to more and more measurements that are available online on the computer and have to be further processed by chemometric data evaluation methods.

1.2 Statistics and Data Interpretation

Table 1.3 provides an overview of chemometric methods. The main emphasis is on statistical–mathematical methods. Random data are characterized and tested by the descriptive and inference methods of statistics, respectively. Their importance increases in connection with the aims of quality control and quality assurance. Signal processing is carried out by means of algorithms for smoothing, filtering, derivation, and integration. Transformation methods such as the Fourier or Hadamard transformations also belong in this area.

Table 1.3 Chemometric methods for data evaluation and interpretation.

Descriptive and inference statistics

Signal processing

Experimental design

Modeling

Optimization

Pattern recognition

Classification

Artificial intelligence methods

Image processing

Information and system theory

Efficient experimentation is based on the methods of experimental design and its quantitative evaluation. The latter can be performed by means of mathematical models or graphical representations. Alternatively, sequential methods are applied, such as the simplex method, instead of these simultaneous methods of experimental optimization. There, the optimum conditions are found by systematic search for the objective criterion, for example, the maximum yield of a chemical reaction, in the space of all experimental variables.

To find patterns in data and to assign samples, materials, or, in general, objects, to those patterns, multivariate methods of data analysis are applied. Recognition of patterns, classes, or clusters is feasible with projection methods, such as principal component analysis or factor analysis, or with cluster analysis. To construct class models for the classification of unknown objects, we will introduce discriminant analyses.

To characterize the information content of analytical procedures, information theory is used in chemometrics.

1.3 Computer‐Based Information Systems/Artificial Intelligence

A further subject of chemometrics is the computer‐based processing of chemical structures and spectra.

There, it might be necessary to extract a complete or partial structure from a collection of molecular structures or to compare an unknown spectrum with the spectra of a spectral library.

For both kinds of queries, methods for representation and manipulation of structures and spectra in databases are needed. In addition, problems of data exchange formats, for example, between a measured spectrum and a spectrum of a database, are to be decided.

If no comparable spectrum is found in a spectral library, then methods for spectra interpretation become necessary. For interpretation of atomic and molecular spectra, in principle, all the statistical methods for pattern recognition are appropriate (cf. Section 1.2). In addition, methods of artificial intelligence are used. They include methods of logical reasoning and tools for developing expert systems. Apart from the methods of classical logic in this context, methods of approximate reasoning and fuzzy logic can also be exploited. These interpretation systems constitute methods of knowledge processing in contrast to data processing based on mathematical–statistical methods.

Knowledge acquisition is mainly based on expert knowledge; for example, the infrared spectroscopist is asked to contribute his knowledge in the development of an interpretation system for infrared spectra. Additionally, methods are required for automatic knowledge acquisition in the form of machine learning.

Methods based on fuzzy theory, neural nets, and evolutionary strategies are denoted as soft computing.

The methods of artificial intelligence and machine learning are not restricted to the interpretation of spectra. They can also be used to develop expert systems, for example, for the analysis of drugs or the synthesis of an organic compound.

Novel methods are based on biological analogs, such as neural networks and evolutionary strategies, for example, genetic algorithms. Future areas of research for chemometricians will include the investigation of fractal structures in chemistry and of models based on the theory of chaos.

General Reading

1

Sharaf, M.A., Illman, D.L., and Kowalski, B.R. (1986).

Chemometrics, Chemical Analysis Series

, vol. 82. New York: John Wiley & Sons, Inc.

2

Massart, D.L., Vandeginste, B.G.M., Deming, S.N. et al. (1988).

Chemometrics–A Textbook

. Amsterdam: Elsevier.

3

Brown, S.D., Tauler, R., and Walczak, B. (eds) (2020).

Comprehensive Chemometrics – Chemical and Biochemical Data Analysis

, 2nd edn. Amsterdam: Elsevier.

4

Varmuza, K. and Filzmoser, P. (2009).

Introduction to Multivariate Statistical Analysis in Chemometrics

. Boca Raton, FL, Berlin: CRC Press.

Questions and Problems

Calculate the resolution for 10‐, 16‐, and 20‐bit analog‐to‐digital converters.

How many bits are stored in an 8‐byte word?

What is the difference between procedural and logical programming languages?

Discuss typical operations of an analytical robot.

2Basic Statistics

Learning Objectives

To introduce the fundamentals of descriptive and inference statistics

To highlight important distributions such as normal, Poisson, Student's

t

, Fisher's

F

, and

χ

2

To understand the measures for characterizing the location and dispersion of a data set

To discuss the Gaussian error propagation law

To learn statistical tests for comparison of data sets and for testing distributions or outliers

To distinguish between one‐ and two‐sided statistical tests at the lower and upper ends of a distribution

To estimate the effect of experimental factors on the basis of univariate and multivariate analyses of variance.

In analytical chemistry, statistics are needed to evaluate analytical data and measurements and to preprocess, reduce, and interpret the data.

As a rule, analytical data are to some degree uncertain. There are three sources of uncertainty:

Variability

Measurement uncertainty

Vagueness.

A high degree of variability in data is typically observed with data from living beings, reflecting the rich variability of nature. For example, the investigation of tissue samples provides a very variable pattern of individual compounds for each human individual.

Measurement uncertainty is connected with the impossibility of observing or measuring to an arbitrary level of precision and without systematic errors (bias). This is the type of uncertainty the analyst has to consider most frequently.

Vagueness is introduced by using a natural or professional language to describe an observation, for example, if the characterization of a property is uncertain. Typical vague descriptions represent sensory variables, such as sweet taste, raspberry‐colored appearance, or aromatic smell.

For description of the uncertainty due to variability and measurement uncertainty, statistical methods are used. Vague circumstances are characterized by fuzzy methods (cf. Section 8.3).

2.1 Descriptive Statistics

Sources of uncertainty in analytical measurements are random and systematic errors. Random errors are determined by the limited precision of measurements. They can be diminished by repetitive measurements. To characterize random errors, probability‐based approaches are used where the measurements are considered as random, independent events.

Apart from descriptive statistics, there exists inference statistics (cf. Section 2.2).

Systematic errors (bias) represent a constant or multiplicative part of the experimental error. This error cannot be decreased by repetitive measurements. In analytics, the trueness of values, that is, the deviation of the mean from the true value, is related to a systematic error. Appropriate measurements with standards are used to enable the recognition of systematic errors in order to correct them at later measurements.