Mastering Azure Machine Learning. - Christoph Körner - E-Book

Mastering Azure Machine Learning. E-Book

Christoph Körner

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Azure Machine Learning is a cloud service for accelerating and managing the machine learning (ML) project life cycle that ML professionals, data scientists, and engineers can use in their day-to-day workflows. This book covers the end-to-end ML process using Microsoft Azure Machine Learning, including data preparation, performing and logging ML training runs, designing training and deployment pipelines, and managing these pipelines via MLOps.
The first section shows you how to set up an Azure Machine Learning workspace; ingest and version datasets; as well as preprocess, label, and enrich these datasets for training. In the next two sections, you'll discover how to enrich and train ML models for embedding, classification, and regression. You'll explore advanced NLP techniques, traditional ML models such as boosted trees, modern deep neural networks, recommendation systems, reinforcement learning, and complex distributed ML training techniques - all using Azure Machine Learning.
The last section will teach you how to deploy the trained models as a batch pipeline or real-time scoring service using Docker, Azure Machine Learning clusters, Azure Kubernetes Services, and alternative deployment targets.
By the end of this book, you’ll be able to combine all the steps you’ve learned by building an MLOps pipeline.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 812

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Azure Machine Learning

Second Edition

Execute large-scale end-to-end machine learning with Azure

Christoph Körner

Marcel Alsdorf

BIRMINGHAM—MUMBAI

Mastering Azure Machine Learning

Second Edition

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Senior Editor: Nathanya Dias

Content Development Editor: Manikandan Kurup

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Aparna Bhagat

Marketing Coordinators: Abeer Dawe, Shifa Ansari

First published: March 2020

Second edition: May 2022

Production reference: 1220422

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80323-241-6

www.packt.com

Contributors

About the authors

Christoph Körner previously worked as a cloud solution architect for Microsoft, specializing in Azure-based big data and machine learning solutions, where he was responsible for designing end-to-end machine learning and data science platforms. He currently works for a large cloud provider on highly scalable distributed in-memory database services. Christoph has authored four books: Deep Learning in the Browser for Bleeding Edge Press, as well as Mastering Azure Machine Learning (first edition), Learning Responsive Data Visualization, and Data Visualization with D3 and AngularJS for Packt Publishing.

Marcel Alsdorf is a cloud solution architect with 5 years of experience at Microsoft consulting various companies on their cloud strategy. In this role, he focuses on supporting companies in their move toward being data-driven by analyzing their requirements and designing their data infrastructure in the areas of IoT and event streaming, data warehousing, and machine learning. On the side, he shares his technical and business knowledge as a coach in hackathons, as a mentor for start-ups and peers, and as a university lecturer. Before his current role, he worked as an FPGA engineer for the LHC project at CERN and as a software engineer in the banking industry.

I would like to thank Anthony Pino for the use of his housing dataset, Stefanie Grois for her hands-on insight into ML on IoT edge devices, and Henry Kröger for always being there. Further, a huge thanks goes to my friend and co-author Christoph, who let me take over his book. Finally, thanks to everyone who survived being around me during the work on this book.

About the reviewers

Nirbhay Anand has a master's degree in computer application and is also a Microsoft Certified Technology Specialist with 16 years of industry experience in software product development. He has developed software in different domains such as investment banking, manufacturing, supply chains, power forecasting, and railroads. He was head of delivery for a cosmetic company while working with C3IT Solutions Pvt. Ltd. He is associated with CloudMoyo, a leading cloud and analytics partner for Microsoft. CloudMoyo brings together powerful BI capabilities using the Azure data platform to transform complex data into business insights. He is currently working on products as a tech program manager.

I would like to thank my wife, Vijeta, and kids, Navya and Nitrika, for their support. I thank my friends too for their never-ending support.

Alexey Bokov is an experienced Azure architect and has been a Microsoft technical evangelist since 2011. He works closely with Microsoft top-tier customers all around the world to develop applications based on the Azure cloud platform. Building cloud-based applications in the most challenging scenarios is his passion, as well as helping the development community to upskill and learn new things by working hands-on and hacking. He is a long-time contributor, as a coauthor and reviewer, to many Azure books and is an occasional speaker at Kubernetes events.

Table of Contents

Preface

Section 1: Introduction to Azure Machine Learning

Chapter 1: Understanding the End-to-End Machine Learning Process

Grasping the idea behind ML

Problems and scenarios requiring ML

The history of ML

Understanding the inner workings of ML through the example of ANNs

Understanding the mathematical basis for statistical analysis and ML modeling

The case for statistics in ML

Basics of statistics

Understanding bias

Classifying ML algorithms

Analyzing errors and the quality of results of model training

Discovering the end-to-end ML process

Excavating data and sources

Preparing and cleaning data

Defining labels and engineering features

Training models

Deploying models

Developing and operating enterprise-grade ML solutions

Summary

Chapter 2: Choosing the Right Machine Learning Service in Azure

Choosing an Azure service for ML

Navigating the Azure AI landscape

Consuming a managed AI service

Building a custom AI service

What is the Azure Machine Learning service?

Managed ML services

Azure Cognitive Services

Custom Cognitive Services

Azure Applied AI Services

Custom ML services

Azure Machine Learning Studio (classic)

Azure Machine Learning designer

Azure Automated Machine Learning

Azure Machine Learning workspace

Custom compute services for ML

Azure Databricks

Azure Batch

Data Science Virtual Machines

Summary

Chapter 3: Preparing the Azure Machine Learning Workspace

Technical requirements

Deploying an Azure Machine Learning workspace

Understanding the available tooling for Azure deployments

Deploying the workspace

Exploring the Azure Machine Learning service

Analyzing the deployed services

Understanding the workspace interior

Surveying Azure Machine Learning Studio

Running ML experiments with Azure Machine Learning

Setting up a local environment

Enhancing a simple experiment

Logging metrics and tracking results

Scheduling the script execution

Running experiments on a cloud compute

Summary

Section 2: Data Ingestion, Preparation, Feature Engineering, and Pipelining

Chapter 4: Ingesting Data and Managing Datasets

Technical requirements

Choosing data storage solutions for Azure Machine Learning

Organizing data in Azure Machine Learning

Understanding the default storage accounts of Azure Machine Learning

Exploring options for storing training data in Azure

Creating a datastore and ingesting data

Creating Blob Storage and connecting it with the Azure Machine Learning workspace

Ingesting data into Azure

Using datasets in Azure Machine Learning

Tracking datasets in Azure Machine Learning

Accessing data during training

Using external datasets with open datasets

Summary

Chapter 5: Performing Data Analysis and Visualization

Technical requirements

Understanding data exploration techniques

Exploring and analyzing tabular datasets

Exploring and analyzing file datasets

Performing data analysis on a tabular dataset

Initial exploration and cleansing of the Melbourne Housing dataset

Running statistical analysis on the dataset

Finding and handling missing values

Calculating correlations and feature importance

Tracking figures from exploration in Azure Machine Learning

Understanding dimensional reduction techniques

Unsupervised dimensional reduction using PCA

Supervised dimensional reduction using LDA

Non-linear dimensional reduction using t-SNE

Generalizing t-SNE with UMAP

Summary

Chapter 6: Feature Engineering and Labeling

Technical requirements

Understanding and applying feature engineering

Classifying feature engineering techniques

Discovering feature transformation and extraction methods

Testing feature engineering techniques on a tabular dataset

Handling data labeling

Analyzing scenarios that require labels

Performing data labeling for image classification using the Azure Machine Learning labeling service

Summary

Chapter 7: Advanced Feature Extraction with NLP

Technical requirements

Understanding categorical data

Comparing textual, categorical, and ordinal data

Transforming categories into numeric values

Orthogonal embedding using one-hot encoding

Semantics and textual values

Building a simple bag-of-words model

A naïve bag-of-words model using counting

Tokenization – turning a string into a list of words

Stemming – the rule-based removal of affixes

Lemmatization – dictionary-based word normalization

A bag-of-words model in scikit-learn

Leveraging term importance and semantics

Generalizing words using n-grams and skip-grams

Reducing word dictionary size using SVD

Measuring the importance of words using TF-IDF

Extracting semantics using word embeddings

Implementing end-to-end language models

The end-to-end learning of token sequences

State-of-the-art sequence-to-sequence models

Text analytics using Azure Cognitive Services

Summary

Chapter 8: Azure Machine Learning Pipelines

Technical requirements

Using pipelines in ML workflows

Why build pipelines?

What are Azure Machine Learning pipelines?

Building and publishing an ML pipeline

Creating a simple pipeline

Connecting data inputs and outputs between steps

Publishing, triggering, and scheduling a pipeline

Parallelizing steps to speed up large pipelines

Reusing pipeline steps through modularization

Integrating pipelines with other Azure services

Building pipelines with Azure Machine Learning designer

Azure Machine Learning pipelines in Azure Data Factory

Azure Pipelines for CI/CD

Summary

Section 3: The Training and Optimization of Machine Learning Models

Chapter 9: Building ML Models Using Azure Machine Learning

Technical requirements

Working with tree-based ensemble classifiers

Understanding a simple decision tree

Combining classifiers with bagging

Optimizing classifiers with boosting rounds

Training an ensemble classifier model using LightGBM

LightGBM in a nutshell

Preparing the data

Setting up the compute cluster and execution environment

Building a LightGBM classifier

Scheduling the training script on the Azure Machine Learning cluster

Summary

Chapter 10: Training Deep Neural Networks on Azure

Technical requirements

Introduction to Deep Learning

Why Deep Learning?

From neural networks to deep learning

DL versus traditional ML

Using traditional ML with DL-based feature extractors

Training a CNN for image classification

Training a CNN from scratch in your notebook

Generating more input data using augmentation

Training on a GPU cluster using Azure Machine Learning

Improving your performance through transfer learning

Summary

Chapter 11: Hyperparameter Tuning and Automated Machine Learning

Technical requirements

Finding the optimal model parameters with HyperDrive

Sampling all possible parameter combinations using grid search

Testing random combinations using random search

Converging faster using early termination

Optimizing parameter choices using Bayesian optimization

Finding the optimal model with Automated Machine Learning

The unfair advantage of Automated Machine Learning

A classification example with Automated Machine Learning

Summary

Chapter 12: Distributed Machine Learning on Azure

Technical requirements

Exploring methods for distributed ML

Training independent models on small data in parallel

Training a model ensemble on large datasets in parallel

Fundamental building blocks for distributed ML

Speeding up deep learning with data-parallel training

Training large models with model-parallel training

Using distributed ML in Azure

Horovod – a distributed deep learning training framework

Implementing the HorovodRunner API for a Spark job

Training models with Horovod on Azure Machine Learning

Summary

Chapter 13: Building a Recommendation Engine in Azure

Technical requirements

Introduction to recommendation engines

A content-based recommender system

Measuring the similarity between items

Feature engineering for content-based recommenders

Content-based recommendations using gradient boosted trees

Collaborative filtering – a rating-based recommender system

What is a rating? Explicit feedback versus implicit feedback

Predicting the missing ratings to make a recommendation

Scalable recommendations using ALS factorization

Combining content and ratings in hybrid recommendation engines

Automatic optimization through reinforcement learning

Summary

Section 4: Machine Learning Model Deployment and Operations

Chapter 14: Model Deployment, Endpoints, and Operations

Technical requirements

Preparations for model deployments

Understanding the components of an ML model

Registering your models in a model registry

Auto-deployments of registered models

Customizing your deployment environment

Choosing a deployment target in Azure

Deploying ML models in Azure

Building a real-time scoring service

Deploying to Azure Kubernetes Services

Defining a schema for scoring endpoints

Managing model endpoints

Controlled rollouts and A/B testing

Implementing a batch-scoring pipeline

ML operations in Azure

Profiling models for optimal resource configuration

Collecting logs and infrastructure metrics

Tracking telemetry and application metrics

Detecting data drift

Summary

Chapter 15: Model Interoperability, Hardware Optimization, and Integrations

Technical requirements

Model interoperability with ONNX

What is model interoperability and how can ONNX help?

Converting models to ONNX format with ONNX frontends

Native scoring of ONNX models with ONNX backends

Hardware optimization with FPGAs

Understanding FPGAs

Comparing GPUs and FPGAs for deep neural networks

Running DNN inferencing on Intel FPGAs with Azure

Integrating ML models and endpoints with Azure services

Integrating with Azure IoT Edge

Integrating with Power BI

Summary

Chapter 16: Bringing Models into Production with MLOps

Technical requirements

Ensuring reproducible builds and deployments

Version-controlling your code

Registering snapshots of your data

Tracking your model metadata and artifacts

Scripting your environments and deployments

Validating the code, data, and models

Testing data quality with unit tests

Integration testing for ML

End-to-end testing using Azure Machine Learning

Continuous profiling of your model

Building an end-to-end MLOps pipeline

Setting up Azure DevOps

Continuous integration – building code with pipelines

Continuous deployment – deploying models with release pipelines

Summary

Chapter 17: Preparing for a Successful ML Journey

Remembering the importance of data

Starting with a thoughtful infrastructure

Automating recurrent tasks

Expecting constant change

Thinking about your responsibility

Interpreting a model

Fairness in model training

Handling PII data and compliance requirements

Summary

Other Books You May Enjoy

Section 1: Introduction to Azure Machine Learning

In this section, we will learn about the history of Machine Learning (ML), the scenarios in which to apply ML, the statistical knowledge necessary, and the steps and components required for running a custom end-to-end ML project. We will have a look at the available Azure services for ML and we will learn about the scenarios they are best suited for. Finally, we will introduce Azure Machine Learning, the main service we will utilize throughout the rest of the book. We will understand how to deploy this service and use it to run our first ML experiments in the cloud.

This section comprises the following chapters:

Chapter 1, Understanding the End-to-End Machine Learning ProcessChapter 2, Choosing the Right Machine Learning Service in AzureChapter 3, Preparing the Azure Machine Learning Workspace

Chapter 1: Understanding the End-to-End Machine Learning Process

Welcome to the second edition of Mastering Azure Machine Learning. In this first chapter, we want to give you an understanding of what kinds of problems require the use of machine learning (ML), how the full ML process unfolds, and what knowledge is required to navigate this vast terrain. You can view it as an introduction to ML and an overview of the book itself, where for most topics we will provide you with a reference to upcoming chapters so that you can easily find your way around the book.

In the first section, we will ask ourselves what ML is, when we should use it, and where it comes from. In addition, we will reflect on how ML is just another form of programming.

In the second section, we will lay the mathematical groundwork you require to process data, and we will understand that the data you work with probably cannot be fully trusted. Further, we will look at different classes of ML algorithms, how they are defined, and how we can define the performance of a trained model.

Finally, in the third section, we will have a look at the end-to-end process of an ML project. We will understand where to get data from, how to preprocess data, how to choose a fitting model, and how to deploy this model into production environments. This will also get us into the topic of ML operations, known as MLOps.

In this chapter, we will cover the following topics:

Grasping the idea behind MLUnderstanding the mathematical basis for statistical analysis and ML modelingDiscovering the end-to-end ML process

Grasping the idea behind ML

The terms artificial intelligence (AI) and—partially—ML are omnipresent in today's world. However, a lot of what is found under the term AI is often nothing more than a containerized ML solution, and to make matters worse, ML is sometimes unnecessarily used to solve something extremely simple.

Therefore, in this first section, let's understand the class of problems ML tries to solve, in which scenarios to use ML, and when not to use it.

Problems and scenarios requiring ML

If you look for a definition of ML, you will often find a description such as this: It is the study of self-improving machine algorithms using data. ML is basically described as an algorithm we are trying to evolve, which in turn can be seen as one complex mathematical function.

Any computer process today follows the simple structure of the input-process-output (IPO) model. We define allowed inputs, we define a process working with those inputs, and we define an output through the type of results the process will show us. A simple example would be a word processing application, where every keystroke will result in a letter shown as the output on the screen. A completely different process might run in parallel to that one, having a time-based trigger to store the text file periodically to a hard disk.

All these processes or algorithms have one thing in common—they were manually written by someone using a high-level programming language. It is clear which actions need to be done when someone presses a letter in a word processing application. Therefore, we can easily build a process in which we implement which input values should create which output values.

Now, let's look at a more complex problem. Imagine we have a picture of a dog and want an application to just say: This is a dog. This sounds simple enough, as we know the input picture of a dog and the output value dog. Unfortunately, our brain (our own machine) is far superior to the machines we built, especially when it comes to pattern recognition. For a computer, a picture is just a square of pixels, each containing three color channels defined by an 8-bit or 10-bit value. Therefore, an image is just a bunch of pixels made up of vectors for the computer, so in essence, a lot of numbers.

We could manually start writing an algorithm that maybe clusters groups of pixels, looks for edges and points of interest, and eventually, with a lot of effort, we might succeed in having an algorithm that finds dogs in pictures. That is when we get a picture of a cat.

It should be clear to you by now that we might run into a problem. Therefore, let's define one problem that ML solves, as follows:

Building the desired algorithm for a required solution programmatically is either extremely time-consuming, completely unfeasible, or impossible.

Taking this description, we can surely define good scenarios to use ML, be it finding objects in images and videos or understanding voices and extracting their intent from audio files. We will further understand what building ML solutions entails throughout this chapter (and the rest of the book, for that matter), but to make a simple statement, let's just acknowledge that building an ML model is also a time-consuming matter.

In that vein, it should be of utmost importance to avoid ML if we have the chance to do so. This might be an obvious statement, but as we (the authors) can attest, it is not for a lot of people. We have seen projects realized with ML where the output could be defined with a simple combination of if statements given some input vectors. In such scenarios, a solution could be obtained with a couple of hundred lines of code. Instead, months of training and testing an ML algorithm occurred, costing a lot of time and resources.

An example of this would be a company wanting to predict fraud (stolen money) committed by their own employees in a retail store. You might have heard that predicting fraud is a typical scenario for ML. Here, it was not necessary to use ML, as the company already knew the influencing factors (length of time the cashier was open, error codes on return receipts, and so on) and therefore wanted to be alerted when certain combinations of these factors occurred. As they knew the factors already, they could have just written the code and be done with it. But what does this scenario tell us about ML?

So far, we have looked at ML as a solution to solve a problem that, in essence, is too hard to code. Looking at the preceding scenario, you might understand another aspect or another class of problems that ML can solve. Therefore, let's add a second problem description, as follows:

Building the desired algorithm for a required solution is not feasible, as the influencing factors for the outcome of the desired outputs are only partially known or completely unknown.

Looking at this problem, you might now understand why ML relies so heavily on the field of statistics as, through the application of statistics, we can learn how data points influence one another, and therefore we might be able to solve such a problem. At the same time, we can build an algorithm that can find and predict the desired outcome.

In the previously mentioned scenario for detecting fraud, it might be prudent to still use ML, as it may be able to find a combination of influencing factors no one has thought about. But if this is not your set goal—as it was not in this case—you should not use ML for something that is easily written in code.

Now that we have discussed some of the problems solved by ML and have had a look at some scenarios for ML, let's have a look at how ML came to be.

The history of ML

To understand ML as a whole, we must first understand where it comes from. Therefore, let's delve into the history of ML. As with all events in history, different currents are happening simultaneously, adding pieces to the whole picture. We'll now look at a few important pillars that birthed the idea of ML as we know it today.

Learnings from neuroscience

A neuropsychologist named Donald O. Hebb published a book titled The Organization of Behavior in 1949. In this book, he described his theory of how neurons (neural cells) in our brain function, and how they contribute to what we understand as learning. This theory is known as Hebbian learning, and it makes the following proposition:

When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.

This basically describes that there is a process where one cell excites another repeatedly (the initiating cell) and maybe even the receiving cell is changed through a hidden process. This process is what we call learning.

To understand this a bit more visually, let's have a look at the biological structure of a neuron, as follows:

Figure 1.1 – Neuron in a biological neural network

What is visualized here? Firstly, on the left, we see the main body of the cell and its nucleus. The body receives input signals through dendrites that are connected to other neurons. In addition, there is a larger exit perturbing from the body called the axon, which connects the main body through a chain of Schwann cells to the so-called axon terminal, which in turn connects again to other neurons.

Looking at this structure with some creativity, it certainly resembles what a function or an algorithm might be. We have input signals coming from external neurons, we have some hidden process happening with these signals, and we have an output in the form of an axon terminal that connects the results to other neurons, and therefore other processes again.

It would take another decade again for someone to realize this connection.

Learnings from computer science

It is hard to talk about the history of ML in the context of computer science without mentioning one of the fathers of modern machines, Alan Turing. In a paper called Computing Machinery and Intelligence published in 1950, Turing defines a test called the Imitation Game (later called the Turing test) to evaluate whether a machine shows human behavior indistinguishable from a human. There are multiple iterations and variants of the test, but in essence, the idea is that a person would at no point in a conversation get the feeling they are not speaking with a human.

Certainly, this test is flawed, as there are ways to give relatively intelligent answers to questions while not being intelligent at all. If you want to learn more about this, have a look at ELIZA built by Joseph Weizenbaum, which passed the Turing test.

Nevertheless, this paper triggered one of the first discussions on what AI could be and what it means that a machine can learn.

Living in these exciting times, Arthur Samuel, a researcher working at International Business Machines Corporation (IBM) at that time, started developing a computer program that could make the right decisions in a game of checkers. In each move, he let the program evaluate a scoring function that tried to measure the chances of winning for each available move. Limited by the available resources at the time, it was not feasible to calculate all possible combinations of moves all the way to the end of the game.

This first step led to the definition of the so-called minimax algorithm and its accompanying search tree, which can commonly be used for any two-player adversarial game. Later, the alpha-beta pruning algorithm was added to automatically trim the tree from decisions that did not lead to better results than the ones already evaluated.

We are talking about Arthur Samuel, as it was he who coined the name machine learning, defining it as follows:

The field of study that gives computers the ability to learn without being explicitly programmed.

Combining these first ideas of building an evaluation function for training a machine and the research done by Donald O. Hebb in neuroscience, Frank Rosenblatt, a researcher at the Cornell Aeronautical Laboratory, invented a new linear classifier that he called a perceptron. Even though his progress in building this perceptron into hardware was relatively short-lived and would not live up to its potential, its original definition is nowadays the basis for every neuron in an artificial neural network (ANN).

Therefore, let's now dive deeper into understanding how ANNs work and what we can deduce about the inner workings of an ML algorithm from them.

Understanding the inner workings of ML through the example of ANNs

ANNs, as we know them today, are defined by the following two major components, one of which we learned about already:

The neural network: The base structure of the system. A perceptron is basically an NN with only one neuron. By now, this structure comes in multiple facets, often involving hidden layers of hundreds of neurons, in the case of deep neural networks (DNNs). The backpropagation function: A rule for the system to learn and evolve. An idea thought of in the 1970s came into appreciation through a paper called Learning Representations by Back-Propagating Errors by D. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams in 1986.

To understand these two components and how they work in tandem with each other, let's have a deeper look at both.

The neural network

First, let's understand how a single neuron operates, which is very close to the idea of a perceptron defined by Rosenblatt. The following diagram shows the inner workings of such an artificial neuron:

Figure 1.2 – Neuron in an ANN

We can clearly see the similarities to a real neuron. We get inputs from the connected neurons called . Each of those inputs is weighted with a corresponding weight , and then, in the neuron itself, they are all summed up, including a bias. This is often referred to as the net input function.

As the final operation, a so-called activation function is applied to this net input that decides how the output signal of the neuron should look. This function must be continuous and differentiable and should typically create results in the range of [0:1] or [-1:1] to keep results scaled. In addition, this function could be linear or non-linear in nature, even though using a linear activation function has its downfalls, as described next:

You cannot learn a non-linear relationship presented in your data through a system of linear functions.A multilayered network made up of nodes with only linear activation functions can be broken down to just one layer of nodes with one linear activation function, making the network obsolete.You cannot use a linear activation function with backpropagation, as this requires calculating the derivative of this function, which we will discuss next.

Commonly used activation functions are sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), and softmax. Keeping this in mind, let's have a look at how we connect neurons together to achieve an ANN. A whole network is typically defined by three types of layers, as outlined here:

Input layer: Consists of neurons accepting singular input signals (not a weighted sum) to the network. Their weights might be constant or randomized depending on the application.Hidden layer: Consists of the types of neurons we described before. They are defined by an activation function and given weights to the weighted sum of the input signals. In DNNs, these layers typically represent specific transformation steps.Output layer: Consists of neurons performing the final transformation of the data. They can behave like neurons in hidden layers, but they do not have to.

These together result in a typical ANN, as shown in the following diagram:

Figure 1.3 – ANN with one hidden layer

With this, we build a generic structure that can receive some input, realize some form of mathematical function through different layers of weights and activation functions, and in the end, hopefully show the correct output. This process of pushing information through the network from inputs to outputs is typically referred to as forward propagation. This, of course, only shows us what is happening with an input that passes through the network. The following question remains: How does it learn the desired function in the first place? The next section will answer this question.

The backpropagation function

The question that should have popped up in your mind by now is: How do we define the correct output? To have a way to change the behavior of the network, which mostly boils down to changing the values of the weights in the system, don't we need a way to quantize the error the system made?

Therefore, we need a function describing the error or loss, referred to as a loss function or error function. You might have even heard another name—a cost function. Let's define them next.

Loss Function versus Cost Function

A loss function (error function) computes the error for a single training example. A cost function, on the other hand, averages all loss function results for the entire training dataset.

This is the correct definition for those terms, but they are often used interchangeably. Just keep in mind that we are using some form of metric to measure the error we made or the distance we have from the correct results.

In classic backpropagation and other ML scenarios, the mean squared error (MSE) between the correct and the computed is used to define the error or loss of the operation. The obvious target is to now minimize this error. Therefore, the actual task to perform is to find the total minimum of this function in n-dimensional space.

To do this, we use something that is often referred to as an optimizer, defined next.

Optimizer (Objective Function)

An optimizer is a function that implements a specific way to reach the objective of minimizing the cost function.

One such optimizer is an iterative process called gradient descent. Its idea is visualized in the following screenshot:

Figure 1.4 – Gradient descent with loss function influenced by only one input (left: finding global minimum, right: stuck in local minimum)

In gradient descent, we try to navigate an n-dimensional loss function by taking reasonably large enough steps, often defined by a learning rate, with the goal to find the global minimum, while avoiding getting stuck in a local minimum.

Keeping this in mind and without going into too much detail, let's finish this thought by going through the steps the backpropagation algorithm performs on the neural network. These are set out here:

Pass a pair through the network (forward propagation).Compute the loss between the expected and the computed .Compute all derivatives for all functions and weights throughout the layers using a mathematical chain rule.Update all weights beginning from the back of the network to the front, with slightly changed weights defined by the optimizer.Repeat until convergence is achieved (the weights are not receiving any meaningful updates anymore).

This is, in a nutshell, how an ANN learns. Be aware that it is vital to constantly change the pairs in Step 1, as otherwise, you might push the network too far into memorizing these couple of pairs you constantly showed it. We will discuss the phenomenon of overfitting and underfitting later in this chapter.

As a final step in this section, let's now bring together what we have learned so far about ML and what this means for building software solutions in the future.

ML and Software 2.0

What we learned so far is that ML seems to be defined by a base structure with various knobs and levers (settings and values) that can be changed. In the case of ANNs, that would be the structure of the network itself and the weights, bias, and activation function we can set in some regard.

Accompanying this base structure is some sort of rule or function as to how these knobs and levers should be transformed through a learning process. In the case of ANNs, this is defined through the backpropagation function, which combines a loss function with an optimizer and some math.

In 2017, Andrej Karpathy, the chief technical officer (CTO) of Tesla's AI division, proposed that the aforementioned idea could be just another way of programming, which he called Software 2.0 (https://karpathy.medium.com/software-2-0-a64152b37c35).

Up to this point, writing software was about explaining to the machine precisely what it must do and what outcome it must produce through defining specific commands it had to follow. In this classical software development paradigm, we define algorithms by their code and let data run through it, typically written in a reasonably readable language.

Instead of doing that, another idea could be to define a program we build by a base structure, a way to evolve this structure, and the type of data it must process. In this case, we get something very human-unfriendly to understand (an ANN with weights, for example), but it might be much better to understand for a machine.

So, we leave you at the end of this section with the thought that Andrej wanted to convey. Perhaps ML is just another form of programming machines.

Keeping all this in mind, let's now talk about math.

Understanding the mathematical basis for statistical analysis and ML modeling

Looking at what we have learned so far, it becomes abundantly clear that ML requires an ample understanding of mathematics. We already came across multiple mathematical functions we have to handle. Think about the activation function of neurons and the optimizer and loss functions for training. On top of that, we have not talked about the second aspect of our new programming paradigm—the data!

To choose the right ML algorithm and derive a good metric for a loss function, we have to take apart the data points we work with. In addition, we need to bring in the data points in relation to the domain we are working with. Therefore, when defining the role of a data scientist, you will often find a visual like this one:

Figure 1.5 – Requirements for data scientists

In this section, we will concentrate on what is referred to in Figure 1.5 as statistical research. We will understand why we need statistics and what base information we can derive from a given dataset, learn what bias is and ways to avoid that, mathematically classify possible ML algorithms, and finally, discuss how we choose useful metrics to define the performance of our trained models.

The case for statistics in ML

As we have seen, we require statistics to clean and analyze our given data. Therefore, let's start by asking: What do we understand from the term "statistics"?

Statistics is the science of collecting and analyzing a representative sample made up of a large quantity of numerical data with the purpose of inferring the statistical distribution of the underlying population.

A typical example of something such as this would be the prediction for the results of an election you see during the campaign or shortly after voting booths close. At those points in time, we do not know the precise result of the full population but we can acquire a sample, sometimes referred to as an observation. We get that by asking people for responses through a questionnaire. Then, based on this subset, we make a sound prediction for the full population by applying statistical methods.

We learned that in ML, we are trying to let the machine figure out a mathematical function that fits our problem, such as this:

Thinking back to our ANN, would be an input vector and would be the resulting output vector. In ML jargon, they are known under a different name, as seen next.

Features and Labels

One element of the input vector x is called a feature; the full output vector is called the label. Often, we only deal with a one-dimensional label.

Now, to bring this together, when training an ML model, we typically only have a sample of the given world, and as with any other time you are dealing with only a sample or subset of reality, you want to pick highly representative features and samples of the underlying population.

So, what does this mean? Let's think of an example. Imagine you want to train a small little robot car to be able to automatically drive through a tunnel. First, we need to think about what our features and labels in this scenario are. As features, we probably need something that measures the distance from the edges of the car to the tunnel in each direction, as we probably do not want to drive into the sides of the tunnel. Let's assume we have some infrared sensors attached to the front, the sides, and the back of the vehicle. Then, the output of our program would probably control the steering and the speed of the vehicle, which would be our labels.

Given that, as a next step, we should think of a whole bunch of scenarios in which the vehicle could find itself. This might be a simple scenario of the vehicle sitting straight-facing in the tunnel, or it could be a bad scenario where the vehicle is nearly stuck in a corner and the tunnel is going left or right from that point on. In all these cases, we read out the values of our infrared sensors and then do the more complicated tasks of making an educated guess as to how the steering has to be changed and how the motor has to operate. Eventually, we end up with a bunch of example situations and corresponding actions to take, which would be our training dataset. This can then be used to train an ANN so that the small car can learn how to follow a tunnel.

If you ever get the opportunity, try to perform this training. If you pick very good examples, you will understand the full power of ML, as you will most likely see something exciting, which I can attest to. In my setup, even though we never had a sample where we would instruct the vehicle to drive backward, the optimal function the machine trained had values where the vehicle learned to do exactly that.

In an example such as that, we would do everything from scratch and hopefully take representative samples by ourselves. In most cases you will encounter, the dataset already exists, and you need to figure out whether it is representative or whether we need to introduce additional data to achieve an optimal training result.

Therefore, let's have a look at some statistical properties you should familiarize yourself with.

Basics of statistics

We now understand that we need to be able to analyze the statistical properties of single features, derive their distribution, and analyze their relationship with other features and labels in the dataset.

Let's start with the properties of single features and their distribution. All the following operations require numerical data. This means that if you work with categorical data or something such as media files, you need to transform them into some form of numerical representation to get such results.

The following screenshot shows the main statistical properties you are after, their importance, and how you can calculate them:

Figure 1.6 – List of major statistical properties

From here onward, we can make the reasonable assumption that the underlying stochastic process follows a normal distribution. Be aware that this must not be the case, and therefore you should make yourself comfortable with other distributions (see https://www.itl.nist.gov/div898/handbook/eda/section3/eda36.htm).

The following screenshot shows a visual representation of a standard normal distribution:

Figure 1.7 – Standard normal distribution and its properties

Now, the strength of this normal distribution is that, based on the mean and standard deviation , we can make assumptions for the probabilities of samples to be in a certain range. As shown in Figure 1.7, there is a probability of around 68.27% for a value to have a distance from the mean of 1, 95.45% for a distance of , and 99.73% for a distance of . Based on this, we can ask questions such as this:

How probable is it to find a value with a distance of 5 from the mean?

Through questions such as this, we can start assessing whether what we see in our data is a statistical anomaly of the distribution, is a value that is simply false, or whether our suspected distribution is incorrect. This is done through a process called hypothesis testing, defined next.

Hypothesis Testing (Definition)

This is a method of testing if the so-called null hypothesis is false, typically referring to the current suspected distribution. It means that the unlikely observation we encounter is pure chance. This hypothesis is rejected in favor of an alternative hypothesis , if the probability falls below a predefined significance level (typically higher than /lower than 5%). The alternative hypothesis thus presumes that the observation we have is due to a real effect that is not taken into account in the initial distribution.

We will not go into further details on how to perform this test properly, but we urge you to familiarize yourself with this process thoroughly.

What we will talk about is the types of errors you can make in this process, as shown in the following screenshot:

Figure 1.8 – Type I and Type II errors

We define the errors you see in Figure 1.8 as follows:

Type I error: This denotes that we reject the hypothesis and the underlying distribution, even though it is correct. This is also referred to as a false-positive result or an alpha error. Type II error: This denotes that we do not reject the hypothesis and the underlying distribution, even though is correct. This error is also referred to as a false-negative result or a betaerror.

You might have heard the term false positive before. Often, it comes up when you take a medical test. A false positive would denote that you have a positive result from a test, even though you do not have the disease you are testing for. As a medical test is also a stochastic process, as with nearly everything else in our world, the term is correctly used in this scenario.

At the end of this section, when we talk about errors and metrics in ML model training, we will come back to these definitions. As a final step, let's discuss relationships among features and between features and labels. Such a relationship is referred to as a correlation.

There are multiple ways to calculate a correlation between two vectors and , but what they all have in common is that their results will fall in the range of [-1,1]. The result of this operation can be broadly defined by the following three categories:

Negatively correlated: The result leans toward -1. When the value of vector rises, the values of vector fall and vice versa.Uncorrelated: The result leans toward 0. There is no real interaction between vectors and . Positively correlated: The result leans toward 1. When the value of vector rises, the values of vector rise and vice versa.

Through this, we can get an idea of relationships between data points, but please be aware of the differences between causation and correlation, as outlined next.

Causation versus Correlation

Even if two vectors are correlated with each other, it does not mean one of them is the cause of the other one—it simply means that one of them influences the other one. It is not causation as we probably don't see the full picture and every single influencing factor.

The mathematical theory we discussed so far should give you a good basis to build upon. In the next section, we