29,99 €
Azure Machine Learning is a cloud service for accelerating and managing the machine learning (ML) project life cycle that ML professionals, data scientists, and engineers can use in their day-to-day workflows. This book covers the end-to-end ML process using Microsoft Azure Machine Learning, including data preparation, performing and logging ML training runs, designing training and deployment pipelines, and managing these pipelines via MLOps.
The first section shows you how to set up an Azure Machine Learning workspace; ingest and version datasets; as well as preprocess, label, and enrich these datasets for training. In the next two sections, you'll discover how to enrich and train ML models for embedding, classification, and regression. You'll explore advanced NLP techniques, traditional ML models such as boosted trees, modern deep neural networks, recommendation systems, reinforcement learning, and complex distributed ML training techniques - all using Azure Machine Learning.
The last section will teach you how to deploy the trained models as a batch pipeline or real-time scoring service using Docker, Azure Machine Learning clusters, Azure Kubernetes Services, and alternative deployment targets.
By the end of this book, you’ll be able to combine all the steps you’ve learned by building an MLOps pipeline.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 812
Execute large-scale end-to-end machine learning with Azure
Christoph Körner
Marcel Alsdorf
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Ali Abidi
Senior Editor: Nathanya Dias
Content Development Editor: Manikandan Kurup
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Rekha Nair
Production Designer: Aparna Bhagat
Marketing Coordinators: Abeer Dawe, Shifa Ansari
First published: March 2020
Second edition: May 2022
Production reference: 1220422
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80323-241-6
www.packt.com
Christoph Körner previously worked as a cloud solution architect for Microsoft, specializing in Azure-based big data and machine learning solutions, where he was responsible for designing end-to-end machine learning and data science platforms. He currently works for a large cloud provider on highly scalable distributed in-memory database services. Christoph has authored four books: Deep Learning in the Browser for Bleeding Edge Press, as well as Mastering Azure Machine Learning (first edition), Learning Responsive Data Visualization, and Data Visualization with D3 and AngularJS for Packt Publishing.
Marcel Alsdorf is a cloud solution architect with 5 years of experience at Microsoft consulting various companies on their cloud strategy. In this role, he focuses on supporting companies in their move toward being data-driven by analyzing their requirements and designing their data infrastructure in the areas of IoT and event streaming, data warehousing, and machine learning. On the side, he shares his technical and business knowledge as a coach in hackathons, as a mentor for start-ups and peers, and as a university lecturer. Before his current role, he worked as an FPGA engineer for the LHC project at CERN and as a software engineer in the banking industry.
I would like to thank Anthony Pino for the use of his housing dataset, Stefanie Grois for her hands-on insight into ML on IoT edge devices, and Henry Kröger for always being there. Further, a huge thanks goes to my friend and co-author Christoph, who let me take over his book. Finally, thanks to everyone who survived being around me during the work on this book.
Nirbhay Anand has a master's degree in computer application and is also a Microsoft Certified Technology Specialist with 16 years of industry experience in software product development. He has developed software in different domains such as investment banking, manufacturing, supply chains, power forecasting, and railroads. He was head of delivery for a cosmetic company while working with C3IT Solutions Pvt. Ltd. He is associated with CloudMoyo, a leading cloud and analytics partner for Microsoft. CloudMoyo brings together powerful BI capabilities using the Azure data platform to transform complex data into business insights. He is currently working on products as a tech program manager.
I would like to thank my wife, Vijeta, and kids, Navya and Nitrika, for their support. I thank my friends too for their never-ending support.
Alexey Bokov is an experienced Azure architect and has been a Microsoft technical evangelist since 2011. He works closely with Microsoft top-tier customers all around the world to develop applications based on the Azure cloud platform. Building cloud-based applications in the most challenging scenarios is his passion, as well as helping the development community to upskill and learn new things by working hands-on and hacking. He is a long-time contributor, as a coauthor and reviewer, to many Azure books and is an occasional speaker at Kubernetes events.
In this section, we will learn about the history of Machine Learning (ML), the scenarios in which to apply ML, the statistical knowledge necessary, and the steps and components required for running a custom end-to-end ML project. We will have a look at the available Azure services for ML and we will learn about the scenarios they are best suited for. Finally, we will introduce Azure Machine Learning, the main service we will utilize throughout the rest of the book. We will understand how to deploy this service and use it to run our first ML experiments in the cloud.
This section comprises the following chapters:
Chapter 1, Understanding the End-to-End Machine Learning ProcessChapter 2, Choosing the Right Machine Learning Service in AzureChapter 3, Preparing the Azure Machine Learning WorkspaceWelcome to the second edition of Mastering Azure Machine Learning. In this first chapter, we want to give you an understanding of what kinds of problems require the use of machine learning (ML), how the full ML process unfolds, and what knowledge is required to navigate this vast terrain. You can view it as an introduction to ML and an overview of the book itself, where for most topics we will provide you with a reference to upcoming chapters so that you can easily find your way around the book.
In the first section, we will ask ourselves what ML is, when we should use it, and where it comes from. In addition, we will reflect on how ML is just another form of programming.
In the second section, we will lay the mathematical groundwork you require to process data, and we will understand that the data you work with probably cannot be fully trusted. Further, we will look at different classes of ML algorithms, how they are defined, and how we can define the performance of a trained model.
Finally, in the third section, we will have a look at the end-to-end process of an ML project. We will understand where to get data from, how to preprocess data, how to choose a fitting model, and how to deploy this model into production environments. This will also get us into the topic of ML operations, known as MLOps.
In this chapter, we will cover the following topics:
Grasping the idea behind MLUnderstanding the mathematical basis for statistical analysis and ML modelingDiscovering the end-to-end ML processThe terms artificial intelligence (AI) and—partially—ML are omnipresent in today's world. However, a lot of what is found under the term AI is often nothing more than a containerized ML solution, and to make matters worse, ML is sometimes unnecessarily used to solve something extremely simple.
Therefore, in this first section, let's understand the class of problems ML tries to solve, in which scenarios to use ML, and when not to use it.
If you look for a definition of ML, you will often find a description such as this: It is the study of self-improving machine algorithms using data. ML is basically described as an algorithm we are trying to evolve, which in turn can be seen as one complex mathematical function.
Any computer process today follows the simple structure of the input-process-output (IPO) model. We define allowed inputs, we define a process working with those inputs, and we define an output through the type of results the process will show us. A simple example would be a word processing application, where every keystroke will result in a letter shown as the output on the screen. A completely different process might run in parallel to that one, having a time-based trigger to store the text file periodically to a hard disk.
All these processes or algorithms have one thing in common—they were manually written by someone using a high-level programming language. It is clear which actions need to be done when someone presses a letter in a word processing application. Therefore, we can easily build a process in which we implement which input values should create which output values.
Now, let's look at a more complex problem. Imagine we have a picture of a dog and want an application to just say: This is a dog. This sounds simple enough, as we know the input picture of a dog and the output value dog. Unfortunately, our brain (our own machine) is far superior to the machines we built, especially when it comes to pattern recognition. For a computer, a picture is just a square of pixels, each containing three color channels defined by an 8-bit or 10-bit value. Therefore, an image is just a bunch of pixels made up of vectors for the computer, so in essence, a lot of numbers.
We could manually start writing an algorithm that maybe clusters groups of pixels, looks for edges and points of interest, and eventually, with a lot of effort, we might succeed in having an algorithm that finds dogs in pictures. That is when we get a picture of a cat.
It should be clear to you by now that we might run into a problem. Therefore, let's define one problem that ML solves, as follows:
Building the desired algorithm for a required solution programmatically is either extremely time-consuming, completely unfeasible, or impossible.
Taking this description, we can surely define good scenarios to use ML, be it finding objects in images and videos or understanding voices and extracting their intent from audio files. We will further understand what building ML solutions entails throughout this chapter (and the rest of the book, for that matter), but to make a simple statement, let's just acknowledge that building an ML model is also a time-consuming matter.
In that vein, it should be of utmost importance to avoid ML if we have the chance to do so. This might be an obvious statement, but as we (the authors) can attest, it is not for a lot of people. We have seen projects realized with ML where the output could be defined with a simple combination of if statements given some input vectors. In such scenarios, a solution could be obtained with a couple of hundred lines of code. Instead, months of training and testing an ML algorithm occurred, costing a lot of time and resources.
An example of this would be a company wanting to predict fraud (stolen money) committed by their own employees in a retail store. You might have heard that predicting fraud is a typical scenario for ML. Here, it was not necessary to use ML, as the company already knew the influencing factors (length of time the cashier was open, error codes on return receipts, and so on) and therefore wanted to be alerted when certain combinations of these factors occurred. As they knew the factors already, they could have just written the code and be done with it. But what does this scenario tell us about ML?
So far, we have looked at ML as a solution to solve a problem that, in essence, is too hard to code. Looking at the preceding scenario, you might understand another aspect or another class of problems that ML can solve. Therefore, let's add a second problem description, as follows:
Building the desired algorithm for a required solution is not feasible, as the influencing factors for the outcome of the desired outputs are only partially known or completely unknown.
Looking at this problem, you might now understand why ML relies so heavily on the field of statistics as, through the application of statistics, we can learn how data points influence one another, and therefore we might be able to solve such a problem. At the same time, we can build an algorithm that can find and predict the desired outcome.
In the previously mentioned scenario for detecting fraud, it might be prudent to still use ML, as it may be able to find a combination of influencing factors no one has thought about. But if this is not your set goal—as it was not in this case—you should not use ML for something that is easily written in code.
Now that we have discussed some of the problems solved by ML and have had a look at some scenarios for ML, let's have a look at how ML came to be.
To understand ML as a whole, we must first understand where it comes from. Therefore, let's delve into the history of ML. As with all events in history, different currents are happening simultaneously, adding pieces to the whole picture. We'll now look at a few important pillars that birthed the idea of ML as we know it today.
A neuropsychologist named Donald O. Hebb published a book titled The Organization of Behavior in 1949. In this book, he described his theory of how neurons (neural cells) in our brain function, and how they contribute to what we understand as learning. This theory is known as Hebbian learning, and it makes the following proposition:
When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.
This basically describes that there is a process where one cell excites another repeatedly (the initiating cell) and maybe even the receiving cell is changed through a hidden process. This process is what we call learning.
To understand this a bit more visually, let's have a look at the biological structure of a neuron, as follows:
Figure 1.1 – Neuron in a biological neural network
What is visualized here? Firstly, on the left, we see the main body of the cell and its nucleus. The body receives input signals through dendrites that are connected to other neurons. In addition, there is a larger exit perturbing from the body called the axon, which connects the main body through a chain of Schwann cells to the so-called axon terminal, which in turn connects again to other neurons.
Looking at this structure with some creativity, it certainly resembles what a function or an algorithm might be. We have input signals coming from external neurons, we have some hidden process happening with these signals, and we have an output in the form of an axon terminal that connects the results to other neurons, and therefore other processes again.
It would take another decade again for someone to realize this connection.
It is hard to talk about the history of ML in the context of computer science without mentioning one of the fathers of modern machines, Alan Turing. In a paper called Computing Machinery and Intelligence published in 1950, Turing defines a test called the Imitation Game (later called the Turing test) to evaluate whether a machine shows human behavior indistinguishable from a human. There are multiple iterations and variants of the test, but in essence, the idea is that a person would at no point in a conversation get the feeling they are not speaking with a human.
Certainly, this test is flawed, as there are ways to give relatively intelligent answers to questions while not being intelligent at all. If you want to learn more about this, have a look at ELIZA built by Joseph Weizenbaum, which passed the Turing test.
Nevertheless, this paper triggered one of the first discussions on what AI could be and what it means that a machine can learn.
Living in these exciting times, Arthur Samuel, a researcher working at International Business Machines Corporation (IBM) at that time, started developing a computer program that could make the right decisions in a game of checkers. In each move, he let the program evaluate a scoring function that tried to measure the chances of winning for each available move. Limited by the available resources at the time, it was not feasible to calculate all possible combinations of moves all the way to the end of the game.
This first step led to the definition of the so-called minimax algorithm and its accompanying search tree, which can commonly be used for any two-player adversarial game. Later, the alpha-beta pruning algorithm was added to automatically trim the tree from decisions that did not lead to better results than the ones already evaluated.
We are talking about Arthur Samuel, as it was he who coined the name machine learning, defining it as follows:
The field of study that gives computers the ability to learn without being explicitly programmed.
Combining these first ideas of building an evaluation function for training a machine and the research done by Donald O. Hebb in neuroscience, Frank Rosenblatt, a researcher at the Cornell Aeronautical Laboratory, invented a new linear classifier that he called a perceptron. Even though his progress in building this perceptron into hardware was relatively short-lived and would not live up to its potential, its original definition is nowadays the basis for every neuron in an artificial neural network (ANN).
Therefore, let's now dive deeper into understanding how ANNs work and what we can deduce about the inner workings of an ML algorithm from them.
ANNs, as we know them today, are defined by the following two major components, one of which we learned about already:
The neural network: The base structure of the system. A perceptron is basically an NN with only one neuron. By now, this structure comes in multiple facets, often involving hidden layers of hundreds of neurons, in the case of deep neural networks (DNNs). The backpropagation function: A rule for the system to learn and evolve. An idea thought of in the 1970s came into appreciation through a paper called Learning Representations by Back-Propagating Errors by D. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams in 1986.To understand these two components and how they work in tandem with each other, let's have a deeper look at both.
First, let's understand how a single neuron operates, which is very close to the idea of a perceptron defined by Rosenblatt. The following diagram shows the inner workings of such an artificial neuron:
Figure 1.2 – Neuron in an ANN
We can clearly see the similarities to a real neuron. We get inputs from the connected neurons called . Each of those inputs is weighted with a corresponding weight , and then, in the neuron itself, they are all summed up, including a bias. This is often referred to as the net input function.
As the final operation, a so-called activation function is applied to this net input that decides how the output signal of the neuron should look. This function must be continuous and differentiable and should typically create results in the range of [0:1] or [-1:1] to keep results scaled. In addition, this function could be linear or non-linear in nature, even though using a linear activation function has its downfalls, as described next:
You cannot learn a non-linear relationship presented in your data through a system of linear functions.A multilayered network made up of nodes with only linear activation functions can be broken down to just one layer of nodes with one linear activation function, making the network obsolete.You cannot use a linear activation function with backpropagation, as this requires calculating the derivative of this function, which we will discuss next.Commonly used activation functions are sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), and softmax. Keeping this in mind, let's have a look at how we connect neurons together to achieve an ANN. A whole network is typically defined by three types of layers, as outlined here:
Input layer: Consists of neurons accepting singular input signals (not a weighted sum) to the network. Their weights might be constant or randomized depending on the application.Hidden layer: Consists of the types of neurons we described before. They are defined by an activation function and given weights to the weighted sum of the input signals. In DNNs, these layers typically represent specific transformation steps.Output layer: Consists of neurons performing the final transformation of the data. They can behave like neurons in hidden layers, but they do not have to.These together result in a typical ANN, as shown in the following diagram:
Figure 1.3 – ANN with one hidden layer
With this, we build a generic structure that can receive some input, realize some form of mathematical function through different layers of weights and activation functions, and in the end, hopefully show the correct output. This process of pushing information through the network from inputs to outputs is typically referred to as forward propagation. This, of course, only shows us what is happening with an input that passes through the network. The following question remains: How does it learn the desired function in the first place? The next section will answer this question.
The question that should have popped up in your mind by now is: How do we define the correct output? To have a way to change the behavior of the network, which mostly boils down to changing the values of the weights in the system, don't we need a way to quantize the error the system made?
Therefore, we need a function describing the error or loss, referred to as a loss function or error function. You might have even heard another name—a cost function. Let's define them next.
Loss Function versus Cost Function
A loss function (error function) computes the error for a single training example. A cost function, on the other hand, averages all loss function results for the entire training dataset.
This is the correct definition for those terms, but they are often used interchangeably. Just keep in mind that we are using some form of metric to measure the error we made or the distance we have from the correct results.
In classic backpropagation and other ML scenarios, the mean squared error (MSE) between the correct and the computed is used to define the error or loss of the operation. The obvious target is to now minimize this error. Therefore, the actual task to perform is to find the total minimum of this function in n-dimensional space.
To do this, we use something that is often referred to as an optimizer, defined next.
Optimizer (Objective Function)
An optimizer is a function that implements a specific way to reach the objective of minimizing the cost function.
One such optimizer is an iterative process called gradient descent. Its idea is visualized in the following screenshot:
Figure 1.4 – Gradient descent with loss function influenced by only one input (left: finding global minimum, right: stuck in local minimum)
In gradient descent, we try to navigate an n-dimensional loss function by taking reasonably large enough steps, often defined by a learning rate, with the goal to find the global minimum, while avoiding getting stuck in a local minimum.
Keeping this in mind and without going into too much detail, let's finish this thought by going through the steps the backpropagation algorithm performs on the neural network. These are set out here:
Pass a pair through the network (forward propagation).Compute the loss between the expected and the computed .Compute all derivatives for all functions and weights throughout the layers using a mathematical chain rule.Update all weights beginning from the back of the network to the front, with slightly changed weights defined by the optimizer.Repeat until convergence is achieved (the weights are not receiving any meaningful updates anymore).This is, in a nutshell, how an ANN learns. Be aware that it is vital to constantly change the pairs in Step 1, as otherwise, you might push the network too far into memorizing these couple of pairs you constantly showed it. We will discuss the phenomenon of overfitting and underfitting later in this chapter.
As a final step in this section, let's now bring together what we have learned so far about ML and what this means for building software solutions in the future.
What we learned so far is that ML seems to be defined by a base structure with various knobs and levers (settings and values) that can be changed. In the case of ANNs, that would be the structure of the network itself and the weights, bias, and activation function we can set in some regard.
Accompanying this base structure is some sort of rule or function as to how these knobs and levers should be transformed through a learning process. In the case of ANNs, this is defined through the backpropagation function, which combines a loss function with an optimizer and some math.
In 2017, Andrej Karpathy, the chief technical officer (CTO) of Tesla's AI division, proposed that the aforementioned idea could be just another way of programming, which he called Software 2.0 (https://karpathy.medium.com/software-2-0-a64152b37c35).
Up to this point, writing software was about explaining to the machine precisely what it must do and what outcome it must produce through defining specific commands it had to follow. In this classical software development paradigm, we define algorithms by their code and let data run through it, typically written in a reasonably readable language.
Instead of doing that, another idea could be to define a program we build by a base structure, a way to evolve this structure, and the type of data it must process. In this case, we get something very human-unfriendly to understand (an ANN with weights, for example), but it might be much better to understand for a machine.
So, we leave you at the end of this section with the thought that Andrej wanted to convey. Perhaps ML is just another form of programming machines.
Keeping all this in mind, let's now talk about math.
Looking at what we have learned so far, it becomes abundantly clear that ML requires an ample understanding of mathematics. We already came across multiple mathematical functions we have to handle. Think about the activation function of neurons and the optimizer and loss functions for training. On top of that, we have not talked about the second aspect of our new programming paradigm—the data!
To choose the right ML algorithm and derive a good metric for a loss function, we have to take apart the data points we work with. In addition, we need to bring in the data points in relation to the domain we are working with. Therefore, when defining the role of a data scientist, you will often find a visual like this one:
Figure 1.5 – Requirements for data scientists
In this section, we will concentrate on what is referred to in Figure 1.5 as statistical research. We will understand why we need statistics and what base information we can derive from a given dataset, learn what bias is and ways to avoid that, mathematically classify possible ML algorithms, and finally, discuss how we choose useful metrics to define the performance of our trained models.
As we have seen, we require statistics to clean and analyze our given data. Therefore, let's start by asking: What do we understand from the term "statistics"?
Statistics is the science of collecting and analyzing a representative sample made up of a large quantity of numerical data with the purpose of inferring the statistical distribution of the underlying population.
A typical example of something such as this would be the prediction for the results of an election you see during the campaign or shortly after voting booths close. At those points in time, we do not know the precise result of the full population but we can acquire a sample, sometimes referred to as an observation. We get that by asking people for responses through a questionnaire. Then, based on this subset, we make a sound prediction for the full population by applying statistical methods.
We learned that in ML, we are trying to let the machine figure out a mathematical function that fits our problem, such as this:
Thinking back to our ANN, would be an input vector and would be the resulting output vector. In ML jargon, they are known under a different name, as seen next.
Features and Labels
One element of the input vector x is called a feature; the full output vector is called the label. Often, we only deal with a one-dimensional label.
Now, to bring this together, when training an ML model, we typically only have a sample of the given world, and as with any other time you are dealing with only a sample or subset of reality, you want to pick highly representative features and samples of the underlying population.
So, what does this mean? Let's think of an example. Imagine you want to train a small little robot car to be able to automatically drive through a tunnel. First, we need to think about what our features and labels in this scenario are. As features, we probably need something that measures the distance from the edges of the car to the tunnel in each direction, as we probably do not want to drive into the sides of the tunnel. Let's assume we have some infrared sensors attached to the front, the sides, and the back of the vehicle. Then, the output of our program would probably control the steering and the speed of the vehicle, which would be our labels.
Given that, as a next step, we should think of a whole bunch of scenarios in which the vehicle could find itself. This might be a simple scenario of the vehicle sitting straight-facing in the tunnel, or it could be a bad scenario where the vehicle is nearly stuck in a corner and the tunnel is going left or right from that point on. In all these cases, we read out the values of our infrared sensors and then do the more complicated tasks of making an educated guess as to how the steering has to be changed and how the motor has to operate. Eventually, we end up with a bunch of example situations and corresponding actions to take, which would be our training dataset. This can then be used to train an ANN so that the small car can learn how to follow a tunnel.
If you ever get the opportunity, try to perform this training. If you pick very good examples, you will understand the full power of ML, as you will most likely see something exciting, which I can attest to. In my setup, even though we never had a sample where we would instruct the vehicle to drive backward, the optimal function the machine trained had values where the vehicle learned to do exactly that.
In an example such as that, we would do everything from scratch and hopefully take representative samples by ourselves. In most cases you will encounter, the dataset already exists, and you need to figure out whether it is representative or whether we need to introduce additional data to achieve an optimal training result.
Therefore, let's have a look at some statistical properties you should familiarize yourself with.
We now understand that we need to be able to analyze the statistical properties of single features, derive their distribution, and analyze their relationship with other features and labels in the dataset.
Let's start with the properties of single features and their distribution. All the following operations require numerical data. This means that if you work with categorical data or something such as media files, you need to transform them into some form of numerical representation to get such results.
The following screenshot shows the main statistical properties you are after, their importance, and how you can calculate them:
Figure 1.6 – List of major statistical properties
From here onward, we can make the reasonable assumption that the underlying stochastic process follows a normal distribution. Be aware that this must not be the case, and therefore you should make yourself comfortable with other distributions (see https://www.itl.nist.gov/div898/handbook/eda/section3/eda36.htm).
The following screenshot shows a visual representation of a standard normal distribution:
Figure 1.7 – Standard normal distribution and its properties
Now, the strength of this normal distribution is that, based on the mean and standard deviation , we can make assumptions for the probabilities of samples to be in a certain range. As shown in Figure 1.7, there is a probability of around 68.27% for a value to have a distance from the mean of 1, 95.45% for a distance of , and 99.73% for a distance of . Based on this, we can ask questions such as this:
How probable is it to find a value with a distance of 5 from the mean?
Through questions such as this, we can start assessing whether what we see in our data is a statistical anomaly of the distribution, is a value that is simply false, or whether our suspected distribution is incorrect. This is done through a process called hypothesis testing, defined next.
Hypothesis Testing (Definition)
This is a method of testing if the so-called null hypothesis is false, typically referring to the current suspected distribution. It means that the unlikely observation we encounter is pure chance. This hypothesis is rejected in favor of an alternative hypothesis , if the probability falls below a predefined significance level (typically higher than /lower than 5%). The alternative hypothesis thus presumes that the observation we have is due to a real effect that is not taken into account in the initial distribution.
We will not go into further details on how to perform this test properly, but we urge you to familiarize yourself with this process thoroughly.
What we will talk about is the types of errors you can make in this process, as shown in the following screenshot:
Figure 1.8 – Type I and Type II errors
We define the errors you see in Figure 1.8 as follows:
Type I error: This denotes that we reject the hypothesis and the underlying distribution, even though it is correct. This is also referred to as a false-positive result or an alpha error. Type II error: This denotes that we do not reject the hypothesis and the underlying distribution, even though is correct. This error is also referred to as a false-negative result or a betaerror.You might have heard the term false positive before. Often, it comes up when you take a medical test. A false positive would denote that you have a positive result from a test, even though you do not have the disease you are testing for. As a medical test is also a stochastic process, as with nearly everything else in our world, the term is correctly used in this scenario.
At the end of this section, when we talk about errors and metrics in ML model training, we will come back to these definitions. As a final step, let's discuss relationships among features and between features and labels. Such a relationship is referred to as a correlation.
There are multiple ways to calculate a correlation between two vectors and , but what they all have in common is that their results will fall in the range of [-1,1]. The result of this operation can be broadly defined by the following three categories:
Negatively correlated: The result leans toward -1. When the value of vector rises, the values of vector fall and vice versa.Uncorrelated: The result leans toward 0. There is no real interaction between vectors and . Positively correlated: The result leans toward 1. When the value of vector rises, the values of vector rise and vice versa.Through this, we can get an idea of relationships between data points, but please be aware of the differences between causation and correlation, as outlined next.
Causation versus Correlation
Even if two vectors are correlated with each other, it does not mean one of them is the cause of the other one—it simply means that one of them influences the other one. It is not causation as we probably don't see the full picture and every single influencing factor.
The mathematical theory we discussed so far should give you a good basis to build upon. In the next section, we