87,99 €
The third edition of this long-selling introductory textbook and ready reference covers all pertinent topics, from basic statistics via modeling and databases right up to the latest regulatory issues. The experienced and internationally recognized author, Matthias Otto, introduces the statistical-mathematical evaluation of chemical measurements, especially analytical ones, going on to provide a modern approach to signal processing, designing and optimizing experiments, pattern recognition and classification, as well as modeling simple and nonlinear relationships. Analytical databases are equally covered as are applications of multiway analysis, artificial intelligence, fuzzy theory, neural networks, and genetic algorithms. The new edition has 10% new content to cover such recent developments as orthogonal signal correction and new data exchange formats, tree based classification and regression, independent component analysis, ensemble methods and neuro-fuzzy systems. It still retains, however, the proven features from previous editions: worked examples, questions and problems, additional information and brief explanations in the margin.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 511
Cover
Title Page
Copyright
List of Abbreviations
Symbols
Chapter 1: What is Chemometrics?
1.1 The Computer-Based Laboratory
1.2 Statistics and Data Interpretation
1.3 Computer-Based Information Systems/Artificial Intelligence
General Reading
Questions and Problems
Chapter 2: Basic Statistics
2.1 Descriptive Statistics
2.2 Statistical Tests
2.3 Analysis of Variance
General Reading
Questions and Problems
Chapter 3: Signal Processing and Time Series Analysis
3.1 Signal Processing
3.2 Time Series Analysis
General Reading
Questions and Problems
Chapter 4: Optimization and Experimental Design
4.1 Systematic Optimization
4.2 Objective Functions and Factors
4.3 Experimental Design and Response Surface Methods
4.4 Sequential Optimization: Simplex Method
General Reading
Questions and Problems
Chapter 5: Pattern Recognition and Classification
5.1 Preprocessing of Data
5.2 Unsupervised Methods
5.3 Supervised Methods
General Reading
Questions and Problems
Chapter 6: Modeling
6.1 Univariate Linear Regression
6.2 Multiple Linear Regression
6.3 Nonlinear Methods
General Reading
Questions and Problems
Chapter 7: Analytical Databases
7.1 Representation of Analytical Information
7.2 Library Search
7.3 Simulation of Spectra
General Reading
Questions and Problems
Chapter 8: Knowledge Processing and Soft Computing
8.1 Artificial Intelligence and Expert Systems
8.2 Neural Networks
8.3 Fuzzy Theory
8.4 Genetic Algorithms and Other Global Search Strategies
General Reading
Questions and Problems
Chapter 9: Quality Assurance and Good Laboratory Practice
9.1 Validation and Quality Control
9.2 Accreditation and Good Laboratory Practice
General Reading
Questions and Problems
Appendix
Statistical Distributions
Digital Filters
Experimental Designs
Matrix Algebra
Software
Index
End User License Agreement
vii
viii
ix
x
1
2
3
4
5
6
7
8
9
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
371
372
373
374
375
376
377
378
379
380
381
382
383
Cover
Table of Contents
Begin Reading
Chapter 1: What is Chemometrics?
Figure 1.1 Signal dependence on time of an analog (a) and a digital detector (b).
Figure 1.2 Digitization of an analog signal by an analog-to-digital converter (ADC).
Figure 1.3 Local area network (LAN) to connect analytical instruments, a robot, and a laboratory information management system (LIMS).
Figure 1.4 Anthropomorphic (a) and cylindrical (b) geometry of robot arms.
Chapter 2: Basic Statistics
Figure 2.1 Histogram for the measurements of Table 2.1 and the theoretical distribution function according to a Gaussian distribution (
solid line
).
Figure 2.2 Probability density distribution function of the Gaussian distribution according to Eq. (2.1) with the mean
μ
and the standard deviation
σ
.
Figure 2.3 Probability function of the Poisson distribution according to Eq. (2.10) for two values of the parameter λ.
Figure 2.4 Illustration of the central limit theorem for the distribution of means taken from a binomial distribution
X
.
Figure 2.5 Box-and-whisker plot for the data in Table 2.1 with an additional outlier at an absorbance of 0.373.
Figure 2.6 Examples for the integration of the Gaussian distribution in the ranges for the deviate
z
from 0.25 to 1.25 (a) and for
z =
±2 (b).
Figure 2.7 Distribution function (error integral) for the Gaussian distribution.
Figure 2.8 Illustration of critical areas for one-sided tests at upper (a) and lower (b) end and for two-sided tests (c).
Figure 2.9 Schematic plot of the observed frequency,
h
i
, and the theoretically expected frequency,
f
, according to a normal distribution.
Figure 2.10 Determination of the test statistics of Kolmogorov–Smirnov's test as the maximum difference
d
max
between the empirical cumulative frequency distribution of the data and the hypothetical distribution function
F
0
(
x
).
Figure 2.11 Errors of the first and second kind at larger (a) and smaller (b) standard deviation of the distribution.
Chapter 3: Signal Processing and Time Series Analysis
Figure 3.1 Moving-average filter for a filter width of 2
m
+ 1 = 3, that is,
m
= 1. Note that for the extreme points, no filtered data can be calculated, since they are needed for computing the first and last average. • Original signal value, ○ filtered signal value.
Figure 3.2 Filtering of a discrete analytical signal consisting of
k
data points with different filters: (1) A 5-point moving-average filter, (2) an 11-point moving-average filter, (3) a 5-point Savitzky–Golay filter, and (4) interpolated signal.
Figure 3.3 Relative error for smoothing Gaussian and Lorentz peaks in dependence on the relative filter width (Eq. (3.3)).
Figure 3.4 Shape of a Gaussian peak (see Eq. (2.2)) and a Lorentz peak.
b
1
,
b
2
, and
b
3
are constants.
Figure 3.5 Second derivative of a peak based on a Lorentz function.
Figure 3.6 Visual resolution of two Lorentz peaks with a resolution of 0.4 full width at half maximum (a) and after formation of the second derivative (b).
Figure 3.7 Trapezoidal rule for integration of a signal.
Figure 3.8 Fourier transformation: the sum signal (a) contains the sine functions with the periods
t
1
= 1 s and
t
2
= 1/3 s (b). The intensity is given in dependence on the frequency after Fourier transformation in (c).
Figure 3.9 Walsh function as the basis function for Hadamard transformation (HT).
Figure 3.10 Low-pass filter (a) and high-pass filter (b) in the frequency domain.
Figure 3.11 Decomposition (deconvolution) of a peak (
solid line
) into two underlying individual peaks (
dashed lines
) by means of Fourier transformation (FT).
Figure 3.12 Transformation of signals from 32 data points and back transformation by FT and HT using different numbers of coefficients. The numbers correspond to the remaining coefficients except the first coefficient, which represents the average (FT) or the sum (HT) of the spectrum.
Figure 3.13 Interpolating spline function.
Figure 3.14 Wavelets of classes Haar (a), Daubechies (b), Coiflet (c), Symmlet (d), Morlet (e), and Mexican hat (f).
Figure 3.15 Shifting the Daubechies-7 wavelet function ψ(
x
) (a) and by the value
b
(b).
Figure 3.16 Multilevel decomposition of a signal by discrete wavelet transformation (DWT).
Figure 3.17 Signal from mass spectrometry of staphylococcus nuclease (SNase).
Figure 3.18 Decomposition of the signal in Figure 3.17 into approximation A
3
and details D
1
to D
3
.
Figure 3.19 (a) Simulated chromatogram for a reference signal (solid line) and a sample signal (dotted line). (b) Alignment of reference and sample signal curves by dynamic time warping.
Figure 3.20 Time series for monthly recorded concentration of sulfur as sulfate.
Figure 3.21 Pointwise correlations for the time series in Figure 3.20 for different time lags of τ = 0 (a), 1 (b), 7 (c), and 12 (d) data points with the correlation coefficients
r
(τ) = 1.000, 0.243, 0.0209, and 0.365, respectively.
Figure 3.22 Autocorrelation function for the time series in Figure 3.20. The correlation at τ = 0 (cf. Figure 3.21) is not drawn in the figure.
Figure 3.23 Autocorrelation function of uncorrelated data.
Figure 3.24 Autocorrelation function for weakly correlated data according to a process of first order. The time series is based on glucose determinations in urine over time.
Figure 3.25 Autocorrelation function for a time series with drift, as found for measurements with a chemical sensor.
Figure 3.26 Schematic demonstration of cross-correlation between an input signal
x
and output signal
y
. The time lag can be derived from the shift in the correlation maximum.
Chapter 4: Optimization and Experimental Design
Figure 4.1 Response versus factors plot: In the case of response surface methods (RSMs), the response is described by a mathematical model (
dotted contour lines
of the response surface). By using search methods, the response is measured along a search path, here along a simplex path (cf. “Analytical Performance Characteristics” Section).
Figure 4.2 Sensitivity for a straight-line calibration as functional dependence of signal on concentration.
Figure 4.3 Random and systematic errors for measurements in the signal domain.
Figure 4.4 Illustration of the analytical resolution for differently separated peaks or bands.
Figure 4.5 Peak separation according to Kaiser.
Figure 4.6 Goal region for the simultaneous investigation of two objective criteria, for example, selectivity and analysis time. The
bold line
characterizes Pareto optimal points.
Figure 4.7 Experimental observations for constructing a calibration graph in systematic (a) and randomized order (b).
Figure 4.8 Full factorial design at two levels, 2
3
design.
x
1
,
x
2
, and
x
3
represent the three factors.
Figure 4.9 Fractional factorial design at two levels, 2
3−1
half-fraction design.
x
1
,
x
2
, and
x
3
represent the three factors.
Figure 4.10 Factor effects in the kinetic–enzymatic oxidation of
p
-phenylenediamine (PPD) by the enzyme ceruloplasmin.
Figure 4.11 Full-factorial three-level design for two factors, 3
2
design.
Figure 4.12 Central composite design for three factors.
Figure 4.13 Box–Behnken design for three factors.
Figure 4.14 Mixture design for three factors at 10 levels based on a (3,3)-lattice design.
Figure 4.15 Surface (a) and contour (b) plots of enzymatic rate versus the factors PPD concentration and pH. The
contour lines
in b represent iso-rate lines.
Figure 4.16 Sequential optimization with single-factor-at-a-time strategy in the case of a response surface without factor interactions (a) and for a surface with interactions (b).
Figure 4.17 Fixed-size simplex according to Nelder and Mead along an unknown response surface.
Figure 4.18 Variable-size simplex.
Figure 4.19 Simplex search for optimum PPD concentration and pH value for the enzymatic determination of ceruloplasmin (cf. Example 4.7).
Chapter 5: Pattern Recognition and Classification
Figure 5.1 Demonstration of translation and scaling procedures: the original data in (a) are centered in (b) and autoscaled in (c). Notice that the autoscaling decreases the between-groups distance in the direction of greatest within-groups scatter and increases it in perpendicular direction in the sense of sphericization of groups.
Figure 5.2 Projection of a swarm of objects from the original two dimensions onto one dimension, that is, the score vector
t
1
, according to the criterion of maximum variance.
Figure 5.3 Scree plot for the principal component model of the hair data in Table 5.1.
Figure 5.4 Principal component scores for the hair data in Table 5.1.
Figure 5.5 Principal component loadings of the hair data in Table 5.1.
Figure 5.6 Biplot for the simultaneous characterization of the cores and loadings of two principal components of the hair data in Table 5.1.
Figure 5.7 Biplot for the simultaneous characterization of the scores and loadings of two principal components of the hair data in Table 5.1 computed on the basis of the NIPALS algorithm.
Figure 5.8 Simplified spectrochromatogram for an incompletely resolved peak in HPLC with diode array detection. Absorbance is given in milliabsorbance units.
Figure 5.9 Simplified UV spectra of the compounds benzo[
k
]fluoranthene (•), benzo[
b
]fluoranthene (), perylene (), and anthracene () as the targets.
Figure 5.10 Comparison of the target spectrum anthracene (•) with the predicted spectrum () in Eq. (5.83).
Figure 5.11 Schematic representation of evolving eigenvalues, λ, in dependence on intrinsically ordered data.
Figure 5.12 Source signals (a) and mixed signals (b).
Figure 5.13 Estimation of signals by PCA (a) and ICA (b).
Figure 5.14 Unfolding of three-way arrays for data analysis by conventional factorial methods.
Figure 5.15 The parallel factor analysis (PARAFAC) model for
F
factors.
Figure 5.16 Electrospray ionization–mass spectrometry (ESI-MS) spectrum of apomyoglobin.
Figure 5.17 Charge states for a single sample (a) derived from ESI-MS spectra in dependence on the pH value of solution. The four factors found after decomposing the array
X
into modes
A
(b),
B
(c), and
C
(d) by PARAFAC is also given.
Figure 5.18 Folding states of the protein myoglobin derived from PARAFAC decomposition of charge states observed in ESI-MS spectra in dependence on pH values.
Figure 5.19 Tucker3 model with different component numbers
P
,
Q
, and
R
in the three modes.
Figure 5.20 City-block distance (a) and Euclidean distance (b) for two features.
Figure 5.21 Dendrogram for the clinical–analytical data from Table 5.4.
Figure 5.22 Graphical method for grouping of the hair data in Table 5.1 for representatives of the three subject groups
a
,
b
, and
c
(for assignments cf. Table 5.8). (A) Stars, (B) sun-rays, and (C) Chernoff faces.
Figure 5.23 Chernoff faces for distinguishing of serum samples of diseased and healthy patients on the basis of 20 clinical analyses.
Figure 5.24 Representation of iodine data from hair samples in Table 5.6; the two groups are labeled as
open
and
filled circles.
Figure 5.25 Linear learning machine (LLM): representation of iodine data of Table 5.6 augmented by an additional dimension and separated by a straight-line boundary with the normal weight vector
w
.
Figure 5.26 Decision lines based on linear regression (
LR
) and linear discriminant analysis (
LDA
) for separating the objects represented by
filled
and
empty circles.
Figure 5.27 Discriminant function of the LDA of the hair data in Table 5.1 based on the classification vector in Table 5.8. The centroids of the classes are labeled by a
cross.
Figure 5.28 Differently spread (a) and differently directed (b) objects of the full and empty circled classes. (c) For case (a) the density function around the class centroids is given.
Figure 5.29 Separation boundary for classification of objects into two classes with
k
= 1.
Figure 5.30
k
-NN classification by cross-validated models with one neighboring object (a) and four neighboring objects (b) for simulated data.
Figure 5.31 Fraction of misclassified objects in dependence on the number of neighbors used in
k
-NN classification models.
Figure 5.32 Soft independent modeling of class analogies (SIMCA) models for different numbers of significant principal components (
PCs
).
Figure 5.33 Separable case (a) and nonseparable (overlap) case (b) of the decision problem.
Figure 5.34 Partitioning in CART: The panel (a) demonstrates the partitioning of a two-dimensional feature space into four regions by recursive binary splitting. In the panel (b), the resulting tree is shown.
Figure 5.35 Decision boundaries for classification of four classes by CART in a single tree.
Figure 5.36 Dependence of the fraction of misclassifications in
bagged
CART models on the maximum number of splits (a) and the decision boundaries for 16 splits per layer (b).
Figure 5.37 Dependence of the fraction of misclassifications in
boosted
CART models on the maximum number of splits (a) and the decision boundaries for 14 splits per layer (b).
Figure 5.38 Simulated two-class data (a) and their classification by (b) LDA, (c) QDA, (d) SVM, (e) CART, and (f)
k
-NN.
Chapter 6: Modeling
Figure 6.1 Illustration of analysis of variance (ANOVA) for linear regression on the data in Table 6.2. See Table 6.2 for abbreviations.
Figure 6.2 Plot of the
x–y
data in Table 6.2 in connection with an ANOVA according to Table 6.1.
Figure 6.3 Confidence bands for the prediction of individual
y
values (
broken lines,
Eq. (6.25)) and for the mean from many
y
values (
solid lines,
Eq. (6.27)) for the data in Table 6.2 of Example 6.1.
Figure 6.4 Residual analysis in linear regression. (a) Time-dependent observations. (b) Heteroscedasticity. (c) Linear effects. (d) Quadratic effect.
Figure 6.5
x–y
-Values for the case of heteroscedastic data.
Figure 6.6 Original NIR spectra (a) and OPLS corrected spectra (b).
Figure 6.7 Straight modeling in the presence of an outlier and an influential observation.
Figure 6.8 NIR spectrum of wheat.
Figure 6.9 Recovery function for resubstitution of samples in case of inverse OLS calibration of protein (mass%) by means of NIR spectra.
Figure 6.10 Regression diagnostics for influential observations and outliers. (a) Cook's distance for recognition of influential observations. (b) Residual plot in dependence on the calculated
y
values. (c) Jackknifed residuals according to Eq. (6.102).
Figure 6.11 Error for calibration of 30 wheat samples by means of 176 wavelengths if principal component regression (
PCR
) (a) and partial least squares (
PLS
) regression (b) are used for data evaluation. The predictive residual sums of squares (
PRESS
) values are computed by Eq. (6.99).
PRESS
corresponds to the error due to resubstitution (Eq. (6.99)) and PRESS
CV
to the error estimated by cross-validation (Eq. (6.101)).
Figure 6.12 Scheme of
N
-way PLS regression of responses,
Y
(
K
×
M
), on a three-way data array,
X
,
after decomposition to one component
t
with its weighing vectors
w.
Figure 6.13 Plot of the transformation of an
x
variable in the alternating conditional expectations (ACE) method. The plot is based on the data in Example 6.11, that is, the
x
variable corresponds here to a pH value.
Figure 6.14 Demonstration of the nonlinear relationship between latent variables for the univariate data in Example 6.10.
Figure 6.15 Modeling of the retention (
k
′) of anthranilic acid in HPLC in dependence on the pH value.
Figure 6.16 Recovery functions for cross-validated predictions of benzanthracene from UV-spectra based on a single tree (a) and a bagged tree model (b).
Chapter 7: Analytical Databases
Figure 7.1 Structure of analytical databases.
Figure 7.2 Evaluation of the peak list of a spectrum.
Figure 7.3 Spheres around a carbon atom (
bold face
) as the basis for encoding the structure by the hierarchically ordered spherical description of environment (HOSE) code.
Figure 7.4 Chemical structure of phosgene represented as an undirected graph.
Figure 7.5 Example of a Markush structure with the general groups
G
1
of phenyl, naphthenyl,
N
-pyridinyl, and
G
2
of cycloalkyl.
Figure 7.6 Inverted list for an IR spectral library.
ID
Identity number of spectrum.
Figure 7.7 Connection of bits by exclusive OR (XOR).
Figure 7.8 Comparison of an unknown spectrum (A) with a candidate library spectrum (B) by the exclusive OR connective (XOR).
Figure 7.9 Set-oriented comparison of two spectra: (a) Unknown original spectrum. (b) Signal positions for the unknown spectrum. (c) Library spectrum with intervals for signal positions. (d) Comparison of (b) and (c) by intersection of the two sets (AND-connective).
Figure 7.10 Example of substructure search based on connection tables of the chemical structures (for representation of the connection table see Table 7.8).
Figure 7.11 Simulation of the
1
H NMR spectrum for cumene (above right) and 1,3,5-trimethylbenzene (below right) to enable verification of the measured spectrum (shown left) as being that of cumene.
Chapter 8: Knowledge Processing and Soft Computing
Figure 8.1 Modules of artificial intelligence (AI) research in analogy to mental faculties of human beings.
Figure 8.2 Representation of knowledge in the form of semantic nets (a) and frames (b).
Figure 8.3 Strategies for in-depth search (a) and in-breadth search (b).
Figure 8.4 Program to find an element in a list of elements by means of list processing language (
LISP
) and a recursive programming in logic (
PROLOG
) procedure.
Figure 8.5 Structure of an expert system.
Figure 8.6 Analogy between a biological (a) and an artificial neuron (b).
Figure 8.7 Structure of an artificial neural network.
Figure 8.8 Operation of a single neuron.
Figure 8.9 Bidirectional associative memory (BAM) consisting of
n
input and
p
output neurons.
Figure 8.10 Insolubility of the XOR problem by using a network based on a
single
separation plane.
Figure 8.11 Solution of the XOR problem with a network that contains two neurons in the hidden layer.
Figure 8.12 Multilayer perceptron as basis for a backpropagation network.
Figure 8.13 Structure of self-organizing networks in linear arrangement (a) of the competitive layer and in Kohonen representation (b).
Figure 8.14 Neighborhood function in the form of a triangle (a) and a Mexican hat (b).
Figure 8.15 Artificial neural network for classification of four classes based on two-dimensional input data.
Figure 8.16 Decision boundaries of a feedforward neural network trained by a
Bayesian regulation
(a) and a
conjugate gradient
backpropagation algorithm (b)
.
Figure 8.17 Membership function for the detectability of elements by X-ray fluorescence analysis.
Solid line
Classical (crisp) set,
broken line
fuzzy set.
Figure 8.18 Membership functions for characterization of uncertainty of experimental observations in the form of exponential (a), quadratic (b), and linear (c) functions.
Figure 8.19 Two-dimensional membership function for fitting a straight line.
Figure 8.20 Membership functions for the linguistic variable “high” according to Eq. (8.40) together with different modifiers.
Figure 8.21 Membership functions for truth values after E. Baldwin.
Figure 8.22 Intersection (a) and union (b) of two fuzzy sets and the cardinality of a fuzzy set (c).
Figure 8.23 Adaptive neuro-fuzzy inference system
.
Figure 8.24 Simulated data (a) and assignment of cases in the test data set (b).
Figure 8.25 Example of a fuzzy difference of two numbers, that is, about 20 minus about 14 gives about 6.
Chapter 9: Quality Assurance and Good Laboratory Practice
Figure 9.1 Sequence of objective quantities in a control chart. The
dotted lines
characterize the lower and upper warning limits. The
broken lines
describe the lower and upper control limits.
Chapter 1: What is Chemometrics?
Table 1.1 Relationship between binary and decimal numbers
Table 1.2 Truth values for logical connectives of predicates
p
and
q
based on binary numbers
Table 1.3 Chemometric methods for data evaluation and interpretation
Chapter 2: Basic Statistics
Table 2.1 Spectrophotometric measurements (absorbances) of a sample solution from 15 repetitive measurements
Table 2.2 Frequency distribution of measurements from Table 2.1
Table 2.3 Descriptive statistics for the spectrophotometric measurements in Table 2.1
Table 2.4 Examples of error propagation for different dependencies of the analytical observation,
y
, on the factors,
x
Table 2.5 Important areas according to the error integral
Table 2.6 Quantile of the one-sided Student's
t
distribution for three significance levels
α
and different degrees of freedom
f
. Note how the distribution approaches the Gaussian distribution if the degrees of freedom tend to infinity (cf. Table 2.5)
Table 2.7 Overview on hypothesis testing based on Student's
t
-test
Table 2.8
F
-quantiles for
α
= 0.05 (
normal
) and
α =
0.01 (
bold
) and for different degrees of freedom
f
1
and
f
2
Table 2.9 Relationship between testing hypotheses and the errors of the first and second kind
Table 2.10 Critical values for the
Q
-test at the 1% risk level
Table 2.11 Critical values for Grubbs' test at two significance levels
Table 2.12 Data scheme for a one-way analysis of variance
Table 2.13 Potassium concentration in mg·l
−1
from triple determinations in four different laboratories
Table 2.14 Results of a one-way analysis of variance for the potassium determinations as in Table 2.13
Table 2.15 Data design for a two-way analysis of variance
Table 2.16 Analytical determinations of manganese (percentage mass) in steel carried out in four different laboratories based on three analytical principles
Table 2.17 ANOVA for two-way analysis of variance of the data in Table 2.16
Table 2.18 Schematic representation of factors and the response for a multiway (multifactor) analysis of variance at factor levels 1–4
Chapter 3: Signal Processing and Time Series Analysis
Table 3.1 Coefficients of the Savitzky–Golay filter for smoothing based on a quadratic/cubic polynomial according to Eq. (3.2)
Table 3.2 Kalman filter algorithm
Table 3.3 Individual values for the time series in Figure 3.19
Chapter 4: Optimization and Experimental Design
Table 4.1 Selectivity measures in analytics
Table 4.2 Aggregation of performance characteristics to an objective function (explanation in text)
Table 4.3 Full factorial design at two levels, 2
3
design
Table 4.4 Factorial design for four factors at two levels
Table 4.5 Factorial design for five factors at two levels
Table 4.6 Plackett and Burman fractional factorial design for estimating the main effects of 11 factors at 2 levels
Table 4.7 A 2 × 2 Latin square design in different representations
Table 4.8 4 × 4 Latin square
Table 4.9 A 2
3
screening design and factor levels for estimation of the factors pH value, temperature (
T
), and
p
-phenylenediamine concentration (PPD)
Table 4.10 Central composite design for three factors consisting of a full-factorial two-level design and star design
Table 4.11 Box–Behnken design for three factors
Table 4.12 Number of factors and experimental points for the Box–Behnken design with three replications in the center of each design
Table 4.13 Factor levels and Box–Behnken design for studying the ceruloplasmin (CP) determination by RSM
Table 4.14 Full factorial 2
3
design arranged in two blocks
Table 4.15 Choice of initial simplexes for up to nine variables coded in the interval between 0 and 1
Chapter 5: Pattern Recognition and Classification
Table 5.1 Elemental contents of hair of different subjects in parts per million
Table 5.2 Eigenvalues and explained variances for the hair data in Table 5.1
Table 5.3 Absorbances in milliabsorbance units for the spectrochromatogram in Figure 5.8
Table 5.4 Concentrations of calcium and phosphate in six blood serum samples
Table 5.5 Parameters for hierarchical cluster analysis by means of the general distance formula after Lance and Williams in Eq. (5.101)
Table 5.8 Classification vector for the hair samples in Table 5.1
Table 5.6 Iodine content of hair samples from different patients
Table 5.7 Elemental content of an unknown hair sample in parts per million
Chapter 6: Modeling
Table 6.1 Computation of the sums of squares (
SS
) for a complete ANOVA in linear regression
Table 6.2
x–y
data
Table 6.3 ANOVA Table for linear regression of the data in Table 6.2
Table 6.4 ANOVA Table for OLS calibration of NIR spectra in the analysis of protein in wheat
Table 6.5 pH dependence of the retention (
k
′) of anthranilic acid in HPLC
Table 6.6 Parameter for the model of retention (
k
′) of anthranilic acid in HPLC according to Eq. (6.112)
Table 6.7 Error for modeling the pH dependence of the retention (
k
′) of anthranilic acid in HPLC
Chapter 7: Analytical Databases
Table 7.1 Some analytical databases
Table 7.2 Important information of a JCAMP/DX exchange file for an IR spectrum
Table 7.3 Coding of spectra in the molecular database of the Organic Institute of ETH Zurich (OCETH)
Table 7.4 Symbols for the HOSE substructure code ordered by priority
Table 7.5 Representation of a chemical structure based on undirected graphs
Table 7.6 Connection matrix for the molecule phosgene (cf. Figure 7.4)
Table 7.7 Bond atoms and bonds in a connection Table of phosgene (cf. Figure 7.4).
Table 7.8 A nonredundant connection Table for phosgene (cf. Figure 7.4).
Table 7.9 Example relation for characterization of an analytical sample
Table 7.10 Hit list for comparison of UV spectra of corticoids in methanol/water (70/30%) eluent based on the correlation coefficient (Eq. (5.12) in Section 5.1)
Table 7.11 Molecular descriptors categorized by data type
Table 7.12 Molecular descriptors categorized by dimensionality of data
Chapter 8: Knowledge Processing and Soft Computing
Table 8.1 Examples of expert systems in analytics
Table 8.2 Transfer functions for neural nets
Table 8.3 Exclusive OR (XOR problem).
Table 8.4 Applications of neural networks in analytics
Table 8.5 Calculation of modifiers for the membership functions of the linguistic variables “high” according to Eq. (8.40)
Table 8.6 Applications of fuzzy theory in analytics 20
Table 8.7 Selection in the genetic algorithm in Example 8.11
Table 8.8 Crossovers in the genetic algorithm in Example 8.11
Table 8.9 Applications of genetic algorithms
Chapter 9: Quality Assurance and Good Laboratory Practice
Table 9.1 Control quantities in quality control chart
Table 9.2
D
factors for calculation of the limits of the range chart for the probabilities
P
of 95% and 99%
Table 9.3 General criteria in the directive series EN 45000
Appendix
Table A.1 Probability density function (ordinate values) of the standardized normal distribution
Table A.2 Areas for the standard normal variate
z
(Eq. (2.28)) of the normal distribution
Table A.3 Two- and one-sided Student's
t
-distribution for different risk levels
α
and the degrees of freedom from
f
= 1 to
f
= 20
Table A.4
F
-distribution for the risk levels
α
= 0.025 (
lightface type
) and
α
= 0.01 (
boldface type
) at the degrees of freedom
f
1
and
f
2
Table A.5 Chi-square distribution for different degrees of freedom
f
at different probabilities
P, χ
2
(
P
;
f
)
Table A.6 Kolmogorov–Smirnov test statistic
d
(1 −
α
,
n
) to test for a normal distribution at different significance levels
α.
Table A.7 Coefficients for computing first derivatives (Savitzky and Golay, 1964, see chapter 3.2 in General Reading Section)
Table A.8 Coefficients for computing second derivatives (Savitzky and Golay, 1964)
Table A.9 Two-level designs (
half-cell designs
) for three, four, and five factors
Table A.10
Central composite design
for four factors with triplicate measurements in the center of the design
Table A.11
Box–Behnken design
for four factors with triplicate measurements in the center of the design
Table A.12 Mixture designs (lattice designs) for three and four factors
Matthias Otto
Third Edition
Author
Matthias Otto
TU Bergakademie Freiberg
Inst. für Analytische Chemie
Leipziger Str. 29
09599 Freiberg
Germany
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.: applied for
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at <http://dnb.d-nb.de>.
© 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form–by photoprinting, microfilm, or any other means–nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Print ISBN: 978-3-527-34097-2
ePDF ISBN: 978-3-527-69936-0
ePub ISBN: 978-3-527-69938-4
Mobi ISBN: 978-3-527-69939-1
oBook ISBN: 978-3-527-69937-7
Cover Design Schulz Grafik-Design, Fußgönheim, Germany
ACE
alternating conditional expectations
ADC
analog-to-digital converter
ANFIS
adaptive neuro-fuzzy inference system
ANOVA
analysis of variance
CART
classification and regression trees
CAS
chemical abstracts service
CPU
central processing unit
CRF
chromatographic response function
DAC
digital-to-analog converter
DTW
dynamic time warping
DWT
discrete wavelet transformation
EPA
environmental protection agency
ESI
electrospray ionization
FA
factor analysis
FFT
fast Fourier transformation
FHT
fast Hadamard transformation
FT
Fourier transformation
GC
gas chromatography
HORD
hierarchically ordered ring description
HOSE
hierarchically ordered spherical description of environment
HPLC
high performance liquid chromatography
HT
Hadamard transformation
I/O
input/output
ICA
independent component analysis
IND
indicator function
IR
infrared
ISO
international standardization organization
JCAMP
Joint Committee on Atomic and Molecular Data
KNN
k
-nearest neighbor method
LAN
local area network
LDA
linear discriminant analysis
LIMS
laboratory information management system
LISP
list processing language
LLM
linear learning machine
MANOVA
multidimensional ANOVA
MARS
multivariate adaptive regression splines
MCR
multivariate curve resolution
MS
mass spectrometry
MSDC
mass spectrometry data center
MSS
mean sum of squares
NIPALS
nonlinear iterative partial least squares
NIR
near infrared
NIST
National Institute of Standards and Technology
NLR
nonlinear regression
NMR
nuclear magnetic resonance
NPLS
nonlinear partial least squares
N-PLS
N-way partial least squares
OLS
ordinary least squares
OPLS
orthogonal partial least squares
PARAFAC
parallel factor analysis
PCA
principal component analysis
PCR
principal component regression
PLS
partial least squares
PRESS
predictive residual sum of squares
PROLOG
programming in logic
RAM
random access memory
RDA
regularized discriminant analysis
RE
real error
ROM
read-only memory
RR
recovery rate
RSD
relative standard deviation
RSM
response surface method
SA
simulated annealing
SEC
standard error of calibration
SEP
standard error of prediction
SIMCA
soft independent modeling of class analogies
SMILES
simplified molecular input line entry specification
SS
sum of squares
SVD
singular value decomposition
SVM
support vector machines
TS
Tabu search
TTFA
target transformation factor analysis
UV
ultraviolet
VIS
visual
VM
vector machines
WT
wavelet transformation
α
Significance level (risk), separation factor, complexity parameter
A
Area
b
Breadth (width), regression parameter
χ
2
Quantile of chi-squared distribution
cov
Covariance
C
Variance covariance matrix
δ
Error, chemical shift
d
Distance measure, test statistic of by Kolmogorov–Smirnov
D
Difference test statistic
D
k
Cook's test statistic
η
Learning coefficient
e
Error
E
(.)
Expectation value
E
Residual matrix
f
Degree of freedom, function
f
(
x
)
Probability density function
F
Quantile of Fisher distribution
F
(·)
Function in frequency space
f
(
t
)
Function in the time domain
F
Matrix of scores
G
Geometric mean
G
(·)
Smoothing function in frequency space
g
(
t
)
Smoothing function in the time domain
h
Communality
H
Harmonic mean
H
Hat matrix, Hadamard transformation matrix
H
0
Null hypothesis
H
1
Alternative hypothesis
H
(·)
Filter function in frequency space
h
(
t
)
Filter function in the time domain
I
Unit step function
I
Identity matrix
I
50
Interquartile range
J
Jacobian matrix
k
Kurtosis
k
Capacity factor
K
A
Protolysis (acid) constant
λ
Eigenvalue, Poisson parameter
L
Loading matrix
μ
Population mean
m
Membership function
m
r
Moment of distribution
nf
Neighborhood function
N
Analytical Resolution, plate number
N
(ν)
Noise
p
portion
P
Probability
Q
Quartile, Dixon statistics
r
Correlation coefficient, radius
R
Correlation matrix
R
S
Chromatographic resolution
R
2
Coefficient of determination
σ
Standard deviation
s
Estimation of standard deviation, skewness
s
r
Estimation of relative standard deviation
S
Similarity measure
τ
Time lag
t
Quantile of Student distribution
T
Test quantity of Grubbs' test
T
Matrix of principal components scores, transformation matrix
U
Matrix of left eigenvectors
R
Range, region in CART
V
Matrix of right eigenvectors
w
Singular value, weight
W
Warping path
x
(Independent) variable
x
Vector of independent variables
X
Matrix of dependent variables
Arithmetic mean
ξ
Slack variable
y
(Dependent) variable
y
*
Transformed (filtered) value
z
Standard normal deviate, signal position, objective function
To define chemometrics
To learn how to count with bits and how to perform arithmetic or logical operations in a computer
To understand the principal terminology for computer systems and the meaning of robotics and automation.
The development of the discipline chemometrics is strongly related to the use of computers in chemistry. Some analytical groups in the 1970s were already working with statistical and mathematical methods that are ascribed nowadays to chemometric methods. Those early investigations were connected to the use of mainframe computers.
The notation chemometrics was introduced in 1972 by the Swede Svante Wold and the American Bruce R. Kowalski. The foundation of the International Chemometrics Society in 1974 led to the first description of this discipline. In the following years, several conference series were organized, for example, Computer Application in Analytics (COMPANA), Computer-Based Analytical Chemistry (COBAC), and Chemometrics in Analytical Chemistry (CAC). Some journals devoted special sections to papers on chemometrics. Later, novel chemometric journals were started, such as the Journal of Chemometrics (Wiley) and Chemometrics and Intelligent Laboratory Systems (Elsevier).
An actual definition of chemometrics is:
the chemical discipline that uses mathematical and statistical methods, (a) to design or select optimal measurement procedures and experiments, and (b) to provide maximum chemical information by analyzing chemical data.
The discipline of chemometrics originates in chemistry. Typical applications of chemometric methods are the development of quantitative structure activity relationships and the evaluation of analytical–chemical data. The data flood generated by modern analytical instrumentation is one reason that analytical chemists in particular develop applications of chemometric methods. Chemometric methods in analytics is the discipline that uses mathematical and statistical methods to obtain relevant information on material systems.
With the availability of personal computers at the beginning of the 1980s, a new age commenced for the acquisition, processing, and interpretation of chemical data. In fact, today, every scientist uses software, in one form or another, that is related to mathematical methods or to processing of knowledge. As a consequence, the necessity emerges for a deeper understanding of those methods.
The education of chemists in mathematics and statistics is usually unsatisfactory. Therefore, one of the initial aims of chemometrics was to make complicated mathematical methods practicable. Meanwhile, the commercialized statistical and numerical software simplifies this process, so that all important chemometric methods can be taught in appropriate computer demonstrations.
Apart from the statistical–mathematical methods, the topics of chemometrics are also related to problems of the computer-based laboratory, to methods for handling chemical or spectroscopic databases, and to methods of artificial intelligence.
In addition, chemometricians contribute to the development of all these methods. As a rule, these developments are dedicated to particular practical requirements, such as the automatic optimization of chromatographic separations or in prediction of the biological activity of a chemical compound.
Nowadays, the computer is an indispensable tool in research and development. The computer is linked to analytical instrumentation; it serves as a tool for acquiring data, word processing, and handling databases and quality assurance systems. In addition, the computer is the basis for modern communication techniques such as electronic mails or video conferences. In order to understand the important principles of computer usage, some fundamentals are considered here, that is, coding and processing of digital information, the main components of the computer, programming languages, computer networking, and automation processes.
The use of digital data provides several advantages over the use of analog data. Digital data are less noise sensitive. The only noise arises from round-off errors due to finite representation of the digits of a number. They are less prone to, for instance, electrical interferences, and they are compatible with digital computers.
As a rule, primary data are generated as analog signals either in a discrete or a continuous mode (Figure 1.1). For example, monitoring the intensity of optical radiation by means of a photocell provides a continuous signal. Weak radiation, however, could be monitored by detecting individual photons by a photomultiplier.
Figure 1.1 Signal dependence on time of an analog (a) and a digital detector (b).
Usually, the analog signals generated are converted into digital data by an analog-to-digital converter (ADC) as explained as follows.
In a digital measurement, the number of pulses occurring within a specified set of boundary conditions is counted. The easiest way to count is to have the pulses represented as binary numbers. In this way, only two electronic states are required. To represent the decimal numbers from 0 to 9, one would need 10 different states. Typically, the binary numbers 0 and 1 are represented electronically by voltage signals of 0.5 and 5 V, respectively. Binary numbers characterize coefficients of the power of 2, so that any number of the decimal system can be described.
The decimal number 77 is expressed as binary number by 1001101, that is,
1
0
0
1
1
0
1
1 × 2
6
0 × 2
5
0 × 2
4
1 × 2
3
1 × 2
2
0 × 2
1
1×2
0
=
64
+0
+0
+8
+4
+0
+1 = 77
Table 1.1 summarizes further relationships between binary and decimal numbers. Every binary number is composed of individual bits (binary digits). The digit lying farthest to the right is termed the least significant digit and the one on the left is the most significant digit.
Table 1.1 Relationship between binary and decimal numbers
Binary number
Decimal number
0
0
1
1
10
2
11
3
100
4
101
5
110
6
111
7
1000
8
1001
9
1010
10
1101
13
10000
16
100000
32
1000000
64
How are calculations done using binary numbers? Arithmetic operations are similar but simpler than those for decimal numbers. In addition, for example, four combinations are feasible:
0
0
1
1
+0
+1
+0
+1
0
1
1
10
Note that for addition of the binary numbers 1 plus 1, a 1 is carried over to the next higher power of 2.
Consider addition of 21 + 5 in the case of a decimal (a) and of a binary number (b):
a.
21
b.
10101
+5
101
26
11010
Apart from arithmetic operations in the computer, logical reasoning is necessary too. This might be in the course of an algorithm or in connection with an expert system. Logical operations with binary numbers are summarized in Table 1.2.
Table 1.2 Truth values for logical connectives of predicates p and q based on binary numbers
p
q
p
AND
q
p
OR
q
IF
p
THEN
q
NOT
p
1
1
1
1
1
0
1
0
0
1
0
—
0
1
0
1
1
1
0
0
0
0
1
—
1 True and 0 false.
It should be mentioned that a very compact representation of numbers is based on the hexadecimal number system. However, hexadecimal numbers are easily converted into binary data, so the details need not be explored here.
In order to benefit from the advantages of digital data evaluation, the analog signals are converted into digital ones. An analog signal consists of an infinitely dense sequence of signal values in a theoretically infinitely small resolution. The conversion of analog into digital signals in the ADC results in a definite reduction of information. For conversion, signal values are sampled in a predefined time interval and quantified in an n-ary raster (Figure 1.2). The output signal is a code word consisting of n bits. Using n bits, 2n different levels can be coded, for example, an 8-bit ADC has a resolution of 28 = 256 amplitude levels..
Figure 1.2 Digitization of an analog signal by an analog-to-digital converter (ADC).
Converting digital into analog information is necessary if an external device is to be controlled or if the data have to be represented by an analog output unit. The resolution of the analog signal is determined by the number of processed bits in the converter. A 10-bit DAC provides 210 = 1024 different voltage increments. Its resolution is then 1/1024 or approximately 0.1%.
Representation of numbers in a computer by bits has already been considered. The combination of 8 bits is called a byte. A series of bytes arranged in sequence to represent a piece of data is termed a word. Typical word sizes are 8, 16, 32, or 64 bits or 1, 2, 4, and 8 bytes.
Words are processed in registers. A sequence of operations in a register enables algorithms to be performed. One or several algorithms make up a computer program.
The physical components of a computer form the hardware. Hardware includes the disk and hard drives, clocks, memory units, and registers for arithmetic and logical operations. Programs and instructions for the computer, including the tapes and disks for their storage, represent the software.
A bus consists of a set of parallel conductors that forms a main transition path in a computer.
The heart of a computer is the central processing unit (CPU). In a microprocessor or minicomputer, this unit consists of a highly integrated chip.
The different components of a computer, its memory, and the peripheral devices, such as printers or scanners, are joined by buses. To guarantee rapid communication among the various parts of a computer, information is exchanged on the basis of a definitive word size, for example, 16 bits, simultaneously over parallel lines of the bus. A data bus serves the exchange of data into and out of the CPU. The origin and the destination of the data in the bus are specified by the address bus. For example, an address bus with 16 lines can address 216 = 65536 different registers or other locations in the computer or in its memory. Control and status information to and from the CPU are administrated in the control bus. The peripheral devices are controlled by an external bus system, for example, an RS-232 interface for serial data transfer or the IEEE-488 interface for parallel transfer of data.
The microcomputer or microprocessor contains typically two kinds of memory: random access memory (RAM) and read-only memory (ROM). The term RAM is somewhat misleading and historically reasoned, since random access is feasible for RAM and ROM alike. The RAM can be used to read and write information. In contrast, information in a ROM is written once, so that it can be read, but not reprogrammed. ROMs are needed in microcomputers or pocket calculators in order to perform fixed programs, for example, for calculation of logarithms or standard deviations.
Larger programs and data collections are stored in bulk storage devices. In the beginning of the computer age, magnetic tapes were the standard here. Nowadays CD's, DVD', and Blu-Ray's are used providing a storage capacity of several gigabytes. In addition, every computer is equipped with a hard disk of at least several gigabytes. The access time to retrieve the stored information is in the order of a few milliseconds.
Communication with the computer is carried out by input–output (I/O) operations. Typical input devices are keyboard, magnetic tapes and disks, and the signals of an analytical instrument. Output devices are screens, printers, and plotters, as well as tapes and disks. To convert analog information into digital or vice versa, the aforementioned ADCs or DACs are used.
Programming a computer at 0 and 1 states or bits is possible using machine code. Since this kind of programming is rather time consuming, higher level languages have been developed where whole groups of bit operations are assembled. However, these so-called assembler languages are still difficult to handle. Therefore, high-level algorithmic languages, such as FORTRAN, BASIC, PASCAL, or C, are more common in analytical chemistry. With high-level languages, the instructions for performing an algorithm can easily be formulated in a computer program. Thereafter, these instructions are translated into machine code by means of a compiler.
For logical programming, additional high-level languages exist, for example, LISP (List Processing language) or PROLOG (Programming in Logic). Further developments are found in the so-called Shells, which can be used directly for building expert systems.
A very effective communication between computers, analytical instruments, and databases is based on networks. There are local nets, for example, within an industrial laboratory as well as national or worldwide networks. Local area networks (LANs) are used to transfer information about analysis samples, measurements, research projects, or in-house databases. A typical LAN is demonstrated in Figure 1.3. It contains a laboratory information management system (LIMS), where all information about the sample or the progresses in a project can be stored and further processed (cf. Section 7.1).
Figure 1.3 Local area network (LAN) to connect analytical instruments, a robot, and a laboratory information management system (LIMS).
Worldwide networking is feasible, for example, via Internet or CompuServe. These nets are used to exchange electronic mails (e-mail) or data with universities, research institutions, or industry.
Apart from acquiring and processing analytical data, the computer can also be used to control or supervise automatic procedures. To automate manual procedures, a robot is applied. A robot is a reprogrammable device that can perform a task more cheaply and effectively than a human being.
Typical geometric shapes of a robot arm are sketched in Figure 1.4. The anthropomorphic geometry (Figure 1.4a) is derived from the human torso, that is, there is a waist, a shoulder, an elbow, and a wrist. Although this type of robot is mainly found in the automobile industry, it can also be used for manipulation of liquid or solid samples.
Figure 1.4 Anthropomorphic (a) and cylindrical (b) geometry of robot arms.
In the chemical laboratory, the cylindrical geometry dominates (Figure 1.4b). The revolving robot arm can be moved in horizontal and vertical directions. Typical operations of a robot are as follows:
Manipulation
of test tubes or glassware around the robotic work area
Weighing,
for the determination of a sample amount or for checking unit operations, for example, addition of solvent
Liquid handling,
in order to dilute or add reagent solutions
Conditioning
of a sample by heating or cooling
Separations
based on filtrations or extractions
Measurements
by analytical procedures, such as spectrophotometry or chromatography
Control and supervision
of the different analytical steps.
Programming of a robot is based on software dedicated to the actual manufacture. The software consists of elements to control the peripheral devices (robot arm, balance, pumps), to switch the devices on and off, and to provide instructions on the basis of logical structures, for example, IF–THEN rules.
Alternatives for automation in a laboratory are discrete analyzers and flowing systems. By means of discrete analyzers, unit operations such as dilution, extraction and dialyses can be automated. Continuous flow analyzers or flow injection analyses serve similar objectives for automation, for example, for the determination of clinical parameters in blood serum.
The transfer of manual operations to a robot or an automated system provides the following advantages:
High productivity and/or minimization of costs
Improved precision and trueness of results
Increased assurance for performing laboratory operations
Easier validation of the different steps of an analytical procedure.
The increasing degree of automation in the laboratory leads to more and more measurements that are available online in the computer and have to be further processed by chemometric data evaluation methods.
Table 1.3 provides an overview of chemometric methods. The main emphasis is on statistical–mathematical methods. Random data are characterized and tested by the descriptive and inference methods of statistics, respectively. Their importance increases in connection with the aims of quality control and quality assurance. Signal processing is carried out by means of algorithms for smoothing, filtering, derivation, and integration. Transformation methods such as the Fourier or Hadamard transformations also belong in this area.
Table 1.3 Chemometric methods for data evaluation and interpretation
Descriptive and inference statistics
Signal processing
Experimental design
Modeling
Optimization
Pattern recognition
Classification
Artificial intelligence methods
Image processing
Information and system theory
Efficient experimentation is based on the methods of experimental design and its quantitative evaluation. The latter can be performed by means of mathematical models or graphical representations. Alternatively, sequential methods are applied, such as the simplex method, instead of these simultaneous methods of experimental optimization. There, the optimum conditions are found by systematic search for the objective criterion, for example, the maximum yield of a chemical reaction, in the space of all experimental variables.
To find patterns in data and to assign samples, materials, or, in general, objects, to those patterns, multivariate methods of data analysis are applied. Recognition of patterns, classes, or clusters is feasible with projection methods, such as principal component analysis or factor analysis, or with cluster analysis. To construct class models for classification of unknown objects, we will introduce discriminant analyses.
To characterize the information content of analytical procedures, information theory is used in chemometrics.
A further subject of chemometrics is the computer-based processing of chemical structures and spectra.
There, it might be necessary to extract a complete or partial structure from a collection of molecular structures or to compare an unknown spectrum with the spectra of a spectral library.
For both kinds of queries, methods for representation and manipulation of structures and spectra in databases are needed. In addition, problems of data exchange formats, for example, between a measured spectrum and a spectrum of a database, are to be decided.
If no comparable spectrum is found in a spectral library, then methods for spectra interpretation become necessary. For interpretation of atomic and molecular spectra, in principle, all the statistical methods for pattern recognition are appropriate (cf. Section 1.2). In addition, methods of artificial intelligence are used. They include methods of logical reasoning and tools for developing expert systems. Apart from the methods of classical logic in this context, methods of approximate reasoning and of fuzzy logic can also be exploited. These interpretation systems constitute methods of knowledge processing in contrast to data processing based on mathematical–statistical methods.
Knowledge acquisition is mainly based on expert knowledge, for example, the infrared spectroscopist is asked to contribute his knowledge in the development of an interpretation system for infrared spectra. Additionally, methods are required for automatic knowledge acquisition in the form of machine learning.
Methods based on fuzzy theory, neural nets, and evolutionary strategies are denoted as soft computing.
The methods of artificial intelligence and machine learning are not restricted to the interpretation of spectra. They can also be used to develop expert systems, for example, for the analysis of drugs or the synthesis of an organic compound.
Novel methods are based on biological analogs, such as neural networks and evolutionary strategies, for example, genetic algorithms. Future areas of research for chemometricians will include the investigation of fractal structures in chemistry and of models based on the theory of chaos.
1. Sharaf, M.A., Illman, D.L., and Kowalski, B.R. (1986)
Chemometrics
, Chemical Analysis Series, vol.
82
, John Wiley & Sons, Inc., New York.
2. Massart, D.L., Vandeginste, B.G.M., Deming, S.N., Mi-chotte, Y., and Kaufmann, L. (1988)
Chemometrics–a Textbook
, Elsevier, Amsterdam.
3. Brown, S.D., Tauler, R., and Walczak, B. (eds) (2009)
Comprehensive Chemometrics – Chemical and Biochemical Data Analysis
, 4 Volumes, Elsevier, Amsterdam.
4. Varmuza, K. and Filzmoser, P. (2009)
Introduction to Multivariate Statistical Analysis in Chemometrics
, CRC Press, Boca Raton, FL, Berlin.
1.
Calculate the resolution for 10-, 16-, and 20-bit analog-to-digital converters.
2.
How many bits are stored in an 8-byte word?
3.
What is the difference between procedural and logical programming languages?
4.
Discuss typical operations of an analytical robot.