Mastering Azure Machine Learning - Christoph Korner - E-Book

Mastering Azure Machine Learning E-Book

Christoph Körner

0,0
40,81 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Master expert techniques for building automated and highly scalable end-to-end machine learning models and pipelines in Azure using TensorFlow, Spark, and Kubernetes



Key Features


Make sense of data on the cloud by implementing advanced analytics
Train and optimize advanced deep learning models efficiently on Spark using Azure Databricks
Deploy machine learning models for batch and real-time scoring with Azure Kubernetes Service (AKS)



Book Description


The increase being seen in data volume today requires distributed systems, powerful algorithms, and scalable cloud infrastructure to compute insights and train and deploy machine learning (ML) models. This book will help you improve your knowledge of building ML models using Azure and end-to-end ML pipelines on the cloud.



The book starts with an overview of an end-to-end ML project and a guide on how to choose the right Azure service for different ML tasks. It then focuses on Azure Machine Learning and takes you through the process of data experimentation, data preparation, and feature engineering using Azure Machine Learning and Python. You'll learn advanced feature extraction techniques using natural language processing (NLP), classical ML techniques, and the secrets of both a great recommendation engine and a performant computer vision model using deep learning methods. You'll also explore how to train, optimize, and tune models using Azure Automated Machine Learning and HyperDrive, and perform distributed training on Azure. Then, you'll learn different deployment and monitoring techniques using Azure Kubernetes Services with Azure Machine Learning, along with the basics of MLOps—DevOps for ML to automate your ML process as CI/CD pipeline.



By the end of this book, you'll have mastered Azure Machine Learning and be able to confidently design, build and operate scalable ML pipelines in Azure.



What you will learn


Setup your Azure Machine Learning workspace for data experimentation and visualization
Perform ETL, data preparation, and feature extraction using Azure best practices
Implement advanced feature extraction using NLP and word embeddings
Train gradient boosted tree-ensembles, recommendation engines and deep neural networks on Azure Machine Learning
Use hyperparameter tuning and Azure Automated Machine Learning to optimize your ML models
Employ distributed ML on GPU clusters using Horovod in Azure Machine Learning
Deploy, operate and manage your ML models at scale
Automated your end-to-end ML process as CI/CD pipelines for MLOps



Who this book is for


This machine learning book is for data professionals, data analysts, data engineers, data scientists, or machine learning developers who want to master scalable cloud-based machine learning architectures in Azure. This book will help you use advanced Azure services to build intelligent machine learning applications. A basic understanding of Python and working knowledge of machine learning are mandatory.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 512

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering Azure Machine Learning

 

 

 

 

 

 

Perform large-scale end-to-end advanced machine learning on the cloud with Microsoft Azure ML

 

 

 

 

 

 

 

 

 

Christoph Körner
Kaijisse Waaijer

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering Azure Machine Learning

Copyright © 2020 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

Commissioning Editor: Sunith ShettyAcquisition Editor:Poornima KumariContent Development Editor:Athikho Sapuni RishanaSenior Editor: Ayaan HodaTechnical Editor: Utkarsha S. KadamCopy Editor: Safis EditingProject Coordinator:Aishwarya MohanProofreader: Safis EditingIndexer:Manju ArasanProduction Designer:Jyoti Chauhan

First published: April 2020

Production reference: 1290420

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-755-4

www.packt.com

 

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

About the authors

Christoph Körner recently worked as a cloud solution architect for Microsoft, specialising in Azure-based big data and machine learning solutions, where he was responsible to design end-to-end machine learning and data science platforms. For the last few months, he has been working as a senior software engineer at HubSpot, building a large-scale analytics platform. Before Microsoft, Christoph was the technical lead for big data at T-Mobile, where his team designed, implemented, and operated large-scale data analytics and prediction pipelines on Hadoop. He has also authored three books: Deep Learning in the Browser (for Bleeding Edge Press), Learning Responsive Data Visualization, and Data Visualization with D3 and AngularJS (both for Packt).

 

Kaijisse Waaijer is an experienced technologist specializing in data platforms, machine learning, and the Internet of Things. Kaijisse currently works for Microsoft EMEA as a data platform consultant specializing in data science, machine learning, and big data. She works constantly with customers across multiple industries as their trusted tech advisor, helping them optimize their organizational data to create better outcomes and business insights that drive value using Microsoft technologies. Her true passion lies within the trading systems automation and applying deep learning and neural networks to achieve advanced levels of prediction and automation.

About the reviewers

Alexey Bokov is an experienced Azure architect and Microsoft technical evangelist since 2011. He works closely with Microsoft's top-tier customers all around the world to develop applications based on the Azure cloud platform. Building cloud-based applications for challenging scenarios is his passion, along with helping the development community to upskill and learn new things through hands-on exercises and hacking. He's a long-time contributor to, and coauthor and reviewer of, many Azure books, and, from time to time, is a speaker at Kubernetes events.

 

Marek Chmel is a Sr. Cloud Solutions Architect at Microsoft for Data & Artificial Intelligence , speaker and trainer with more than 15 years' experience. He's a frequent conference speaker, focusing on SQL Server, Azure and security topics. He has been a Data Platform MVP since 2012 for 8 years. He has earned numerous certifications, including MCSE: Data Management and Analytics, Azure Architect, Data Engineer and Data Scientist Associate, EC Council Certified Ethical Hacker, and several eLearnSecurity certifications. Marek earned his MSc degree in business and informatics from Nottingham Trent University. He started his career as a trainer for Microsoft Server courses and later worked as Principal SharePoint and Principal Database Administrator.

 

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Mastering Azure Machine Learning

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Azure Machine Learning Services

Building an End-To-End Machine Learning Pipeline in Azure

Performing descriptive data exploration

Moving data to the cloud

Understanding missing values

Visualizing data distributions

Finding correlated dimensions

Measuring feature and target dependencies for regression

Visualizing feature and label dependency for classification

Exploring common techniques for data preparation

Labeling the training data

Normalization and transformation in machine learning

Encoding categorical variables

A feature engineering example using time-series data

Using NLP to extract complex features from text

Choosing the right ML model to train data

Choosing an error metric

The training and testing split

Achieving great performance using tree-based ensemble models

Modeling large and complex data using deep learning techniques

Optimization techniques

Hyperparameter optimization

Model stacking

AutoML

Deploying and operating models

Batch scoring using pipelines

Real-time scoring using a container-based web service

Tracking model performance, telemetry, and data skew

Summary

Choosing a Machine Learning Service in Azure

Demystifying the different Azure services for ML

Choosing an Azure service for ML

Choosing a compute target for an Azure ML service

Azure Cognitive Services and Custom Vision

Azure Cognitive Services

Custom Vision—customizing the Cognitive Services API

Azure ML tools with GUIs

Azure ML Studio (classic)

Azure Automated ML

Microsoft Power BI

The Azure ML service

Organizing experiments and models in Azure ML

Deployments through Azure ML

Summary

Section 2: Experimentation and Data Preparation

Data Experimentation and Visualization Using Azure

Preparing your Azure ML workspace

Setting up the ML Service workspace

Running a simple experiment with Azure ML

Logging metrics and tracking results

Scheduling and running scripts

Adding cloud compute to the workspace

Visualizing high-dimensional data

Tracking figures in experiments in Azure ML

Unsupervised dimensionality reduction with PCA

Using LDA for supervised projections

Non-linear dimension reduction with t-SNE

Generalizing t-SNE with UMAP

Summary

ETL, Data Preparation, and Feature Extraction

Managing data and dataset pipelines in the cloud

Getting data into the cloud

Organizing data in data stores and datasets

Managing data in Azure ML

Versioning datasets and dataset definitions

Taking data snapshots for reproducibility

The life cycle of a dataset

Exploring data registered in the Azure ML service

Exploring the datasets

Exploring the data

Preprocessing and feature engineering with Azure ML DataPrep  

Parsing different data formats

Loading delimiter-separated data

Parsing JSON data

Loading binary column-store data in Parquet format

Building a data transformation pipeline in Azure ML

Generating features through expression

Data type conversions

Deriving columns by example

Imputing missing values

Label and one-hot encoding

Transformations and scaling

Filtering columns and rows

Writing the processed data back to a dataset

Summary

Advanced Feature Extraction with NLP

Understanding categorical data

Comparing textual, categorical, and ordinal data

Transforming categories into numeric values

Orthogonal embedding using one-hot encoding

Categories versus text

Building a simple bag-of-words model

A naive bag-of-words model using counting

Tokenization – turning a string into a list of words

Stemming – rule-based removal of affixes

Lemmatization – dictionary-based word normalization

A bag-of-words model in scikit-learn

Leveraging term importance and semantics

Generalizing words using n-grams and skip-grams

Reducing word dictionary size using SVD

Measuring the importance of words using tf-idf

Extracting semantics using word embeddings

Implementing end-to-end language models

End-to-end learning of token sequences

State-of-the-art sequence-to-sequence models

Text analytics using Azure Cognitive Services

Summary

Section 3: Training Machine Learning Models

Building ML Models Using Azure Machine Learning

Working with tree-based ensemble classifiers

Understanding a simple decision tree

Advantages of a decision tree

Disadvantages of a decision tree

Combining classifiers with bagging

Optimizing classifiers with boosting rounds

Training an ensemble classifier model using LightGBM

LightGBM in a nutshell

Preparing the data

Setting up the compute cluster and execution environment

Building a LightGBM classifier

Scheduling the training script on the Azure ML cluster

Summary

Training Deep Neural Networks on Azure

Introduction to deep learning

Why DL?

From neural networks to DL

Comparing classical ML and DL

Training a CNN for image classification

Training a CNN from scratch in your notebook

Generating more input data using augmentation

Moving training to a GPU cluster using Azure ML compute

Improving your performance through transfer learning

Summary

Hyperparameter Tuning and Automated Machine Learning

Hyperparameter tuning to find the optimal parameters

Sampling all possible parameter combinations using grid search

Trying random combinations using random search

Converging faster using early termination

The median stopping policy

The truncation selection policy

The bandit policy

A HyperDrive configuration with termination policy

Optimizing parameter choices using Bayesian optimization

Finding the optimal model with AutoML

Advantages and benefits of AutoML

A classification example

Summary

Distributed Machine Learning on Azure ML Clusters

Exploring methods for distributed ML

Training independent models on small data in parallel

Training a model ensemble on large datasets in parallel

Fundamental building blocks for distributed ML

Speeding up DL with data-parallel training

Training large models with model-parallel training

Using distributed ML in Azure

Horovod—a distributed DL training framework

Implementing the HorovodRunner API for a Spark job

Running Horovod on Azure ML compute

Summary

Building a Recommendation Engine in Azure

Introduction to recommender engines

Content-based recommendations

Measuring similarity between items

Feature engineering for content-based recommenders

Content-based recommendations using gradient boosted trees

Collaborative filtering—a rating-based recommendation engine

What is a rating? Explicit feedback as opposed to implicit feedback

Predicting the missing ratings to make a recommendation

Scalable recommendations using ALS factorization

Combining content and ratings in hybrid recommendation engines

Building a state-of-the-art recommender using the Matchbox Recommender

Automatic optimization through reinforcement learning

An example using Azure Personalizer in Python

Summary

Section 4: Optimization and Deployment of Machine Learning Models

Deploying and Operating Machine Learning Models

Deploying ML models in Azure

Understanding the components of an ML model

Registering your models in a model registry

Customizing your deployment environment

Choosing a deployment target in Azure

Building a real-time scoring service

Implementing a batch scoring pipeline

Inference optimizations and alternative deployment targets

Profiling models for optimal resource configuration

Portable scoring through the ONNX runtime

Fast inference using FPGAs in Azure

Alternative deployment targets

Monitoring Azure ML deployments

Collecting logs and infrastructure metrics

Tracking telemetry and application metrics

Summary

MLOps - DevOps for Machine Learning

Ensuring reproducible builds and deployments

Version-controlling your code

Registering snapshots of your data

Tracking your model metadata and artifacts

Scripting your environments and deployments

Validating your code, data, and models

Rethinking unit testing for data quality

Integration testing for ML

End-to-end testing using Azure ML

Continuous profiling of your model

Summary

What's Next?

Understanding the importance of data

The future of ML is automated

Change is the only constant – preparing for change

Focusing first on infrastructure and monitoring

Controlled rollouts and A/B testing

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

To make use of the current increase in the volume of data being generated globally requires distributed systems, powerful algorithms, and scalable cloud infrastructure to compute insights and train and deploy models. This book will help you improve your knowledge of building machine learning (ML) models using Azure and end-to-end ML pipelines on the cloud.

The book starts with an overview of an end-to-end ML project and a guide on how to choose the right Azure service for different ML tasks. It then focuses on Azure ML and takes you through the process of data experimentation, data preparation, and feature engineering. You'll learn advanced feature extraction techniques using natural language processing (NLP), classical ML techniques such as ensemble learning, and the secrets of both a great recommendation engine and a performant computer vision model using deep learning methods. You’ll also explore how to train, optimize, and tune models using Azure AutoML and HyperDrive, and perform distributed training on Azure ML. Next, you'll learn different deployment and monitoring techniques using Azure Kubernetes Services with Azure ML, along with the basics of MLOps—DevOps for ML.

By the end of this book, you’ll have mastered ML development and be able to confidently build and operate scalable ML pipelines in Azure.

Who this book is for

This ML book is for data professionals, data analysts, data engineers, data scientists, or ML developers who want to master scalable, cloud-based ML architectures in Azure. This book will help you use advanced Azure services to build intelligent ML applications. A basic understanding of Python and a working knowledge of ML are mandatory.

What this book covers

Chapter 1, Building an End-to-End Machine Learning Pipeline in Azure, covers all the required components for running a custom end-to-end ML pipeline in Azure. Some sections might be a recap of your existing knowledge with useful practical tips, step-by-step guidelines, and pointers to Azure services to perform ML at scale. You can see it as an overview of the book, after which we will dive into every section in great detail with many practical examples and code throughout the remaining chapters of the book.

Chapter 2, Choosing a Machine Learning Service in Azure, helps us to find out how to best navigate to all available ML services in Azure and how to select the right one for your goal. Finally, we will explain why Azure ML is the best choice for building custom ML models. This is the service that we will use throughout the book to implement an end-to-end ML pipeline.

Chapter 3, Data Experimentation and Visualization Using Azure, takes a look at how to implement data experimentation and perform data visualizations with Azure ML. First, you will learn how to prepare and interact with your ML workspace. Once set up, you will be able to perform and track experiments in Azure, as well as trained models, plots, metrics, and snapshots of your code. This can all be done from your authoring Python environment, for example, Jupyter using Azure ML's Compute Instance or any Python interpreter running in PyCharm, VS Code, and so on. You will see many popular embeddings and visualization techniques including PCA, LDA, t-SNE, and UMAP in action.

Chapter 4, ETL, Data Preparation, and Feature Extraction, explores data preparation and Extract, Transform, and Load (ETL) techniques within the Azure ML using Azure DataPrep. We will start by looking behind the scenes of datasets and data stores, the abstraction for physical data storage systems. Then, you will use Azure DataPrep to implement many popular preprocessing and feature engineering techniques such as imputing missing values, transformations, data type conversions, and many more. This will help you to implement a scalable ETL pipeline using Azure ML.

Chapter 5, Advanced Feature Extraction with NLP, takes us one step further to extract features from textual and categorical data – a problem that users are faced often when training ML models. This chapter will describe the foundations of feature extraction with NLP. This will help the reader to create semantic embeddings from categorical and textual data using techniques including n-grams, Bag of Words, TF-IDF, Word2Vec, and more.

Chapter 6, Building ML Models Using Azure Machine Learning, teaches you how to use ensembling techniques to build a traditional ML model in Azure. This chapter focuses on decision tree-based ensemble learning with popular state-of-the-art boosting and bagging techniques through the use of LightGBM in Azure ML. This will help you to apply concepts of bragging and boosting on ML models.

Chapter 7, Training Deep Neural Networks on Azure, covers training more complex parametric models using deep learning for better generalization over large data. First, we will give a short and practical overview of when and why deep learning works well and its differences with traditional ML approaches. We will focus more on understanding rational, practical principles rather than a theoretical approach. Then we will train a Convolutional Neural Network (CNN) on Azure ML using Keras.

Chapter 8, Hyperparameter Tuning and Automated Machine Learning, looks at optimizing the training process in order to take away some of the error-prone human choices from ML. These tuning tricks will help you to train better models, faster and more efficiently. First, we will look at hyperparameter tuning (also called HyperDrive in Azure ML), a standard technique for optimizing all parameter choices in an ML process. By evaluating different sampling techniques for hyperparameter tuning such as random sampling, grid sampling, and Bayesian optimization, you will learn how to efficiently manage the trade-offs between runtime and model performance. In the second half of this chapter, we will generalize from hyperparameter optimization to automating the complete end-to-end ML training process using Automated ML, which is often referred to as AutoML. Using AutoML, we can straightforwardly optimize preprocessing, feature engineering, model selection, hyperparameter tuning, and model stacking all together in one simple abstract pipeline.

Chapter 9, Distributed Machine Learning on Azure ML Clusters, takes a look into distributed and parallel computing algorithms and frameworks for efficiently training ML models in parallel on GPUs. The goal of this chapter is to build an environment in Azure where you can speed up the training process of classical ML and deep learning models by adding more machines to your training environment and hence scaling out the cluster.

Chapter 10, Building a Recommendation Engine in Azure, dives into traditional and modern recommendation engines that often combine the technologies and techniques covered in the previous chapters. We will take a quick look at the different types of recommendation engines, what data is needed for each type, and what can be recommended using these different approaches, such as content-based recommendations and rating-based recommendation engines. We will combine both techniques into a single hybrid recommender and learn about state-of-the-art techniques for modern recommendation engines. You will implement two hybrid recommenders using Azure ML, one using Python and one using Azure ML Designer—the GUI of Azure ML.

Chapter 11, Deploying and Operating Machine Learning Models, tackles the next step after training a recommender engine or any of the previously trained ML models: we are going to package the model and execution runtime, register both in a model registry, and deploy them to an execution environment. We will auto-deploy models from Azure ML to Azure Kubernetes Service with only a few lines of code. You will also learn about monitoring your target environments using out-of-the-box custom metrics.

Chapter 12, MLOps – DevOps for Machine Learning, considers how we've put emphasis throughout the book on the possibility of scripting every step of the ML training and deployment process, either through bash, Powershell, the Python SDK, or any other library wrapping the Azure ML REST service. This is true for creating environments, starting and scaling clusters, submitting experiments, performing parameter optimization, and deploying full-fledged scoring services on Kubernetes. In this chapter, we will reuse all these concepts to build a version-controlled, reproducible, automated ML training, and deployment process as a Continuous Integration/Continuous Deployment (CI/CD) pipeline in Azure.

Chapter 13, What's Next?, concludes all previous chapters and provides a rough outlook for the future. This chapter also provides ideas on how to continue working with ML in Azure, and which trends and references to watch in the future.

To get the most out of this book

Most code examples in this book require an Azure subscription to execute the code. You can create an Azure account for free and receive USD 200 of credits to use within 30 days using the sign-up page at https://azure.microsoft.com/en-us/free/.

The easiest way to get started is by creating an Azure ML Workspace (Basic or Enterprise) and subsequently creating a Compute Instance of VM type STANDARD_D3_V2 in your workspace. The Compute Instance gives you access to a JupyterLab or Jupyter Notebook environment with all essential libraries pre-installed and works great for the authoring and execution of experiments.

Rather than running all experiments on Azure, you can also run some of the code examples—especially the authoring code—on your local machine. To do so, you need a Python runtime—preferably an interactive runtime such as JupyterLab or Jupyter Notebook—with the Azure ML SDK installed. We recommend using Python>=3.6.1.

You can find more information about installing the SDK at https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py.

We will use the following library versions throughout the book if not stated otherwise. You can as well find a detailed description of all libraries used for each chapter in the Github repository for this book (link available in the next section).

Library

Version

azureml-sdk

1.3.0

pandas

0.23.4

numpy

1.16.2

scikit-learn

0.20.3

tensorflow

1.13.2

keras

2.3.1

seaborn

0.10.0

matplotlib

3.2.1

 

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

To get the most out of this book, you should have experience in programming in Python and have a basic understanding of popular ML and data manipulation libraries such as TensorFlow, Keras, Scikit, and Pandas.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

Support

tab.

Click on

Code Downloads

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Azure-Machine-Learning. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

 

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789807554_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Azure Machine Learning Services

In the first part of the book, the reader will come to understand the steps and requirements of an end-to-end machine learning pipeline and will be introduced to the different Azure Machine Learning services. The reader will learn how to choose a machine learning service for a specific machine learning task.

This section comprises the following chapters:

Chapter 1

,

Building an End-to-End Machine Learning Pipeline in Azure

Chapter 2

Choosing a Machine Learning Service in Azure

Building an End-To-End Machine Learning Pipeline in Azure

This first chapter covers all the required components for running a custom end-to-end machine learning (ML) pipeline in Azure. Some sections might be a recap of your existing knowledge with useful practical tips, step-by-step guidelines, and pointers to using Azure services to perform ML at scale. You can see it as an overview of the book, where we will dive into each section in great detail with many practical examples and a lot of code during the remaining chapters of the book.

First, we will look at data experimentation techniques as a step-by-step process for analyzing common insights, such as missing values, data distribution, feature importance, and two-dimensional embedding techniques to estimate the expected model performance of a classification task. In the second section, we will use these insights about the data to perform data preprocessing and feature engineering, such as normalization, the encoding of categorical and temporal variables, and transforming text columns into meaningful features using natural language processing (NLP).

In the subsequent sections, we will recap the analytical process of training an ML model by selecting a model, an error metric, and a train-testing split, and performing cross-validation. Then, we will learn about techniques that help to improve the prediction performance of a single model through hyperparameter tuning, model stacking, and automated machine learning (AutoML). Finally, we will cover the most common techniques for model deployments, such as online real-time scoring and batch scoring.

The following topics will be covered in this chapter:

Performing descriptive data exploration

Common techniques for data preparation

Choosing the right ML model to train data

Optimization techniques

Deploying and operating models

Performing descriptive data exploration

Descriptive data exploration is, without a doubt, one of the most important steps in an ML project. If you want to clean data and build derived features or select an ML algorithm to predict a target variable in your dataset, then you need to understand your data first. Your data will define many of the necessary cleaning and preprocessing steps; it will define which algorithms you can choose and it will ultimately define the performance of your predictive model.

Hence, data exploration should be considered an important analytical step to understanding whether your data is informative to build an ML model in the first place. By analytical step, we mean that the exploration should be done as a structured analytical process rather than a set of experimental tasks. Therefore, we will go through a checklist of data exploration tasks that you can perform as an initial step in every ML project—before starting any data cleaning, preprocessing, feature engineering, or model selection.

Once the data is provided, we will work through the following data exploration checklist and try to get as many insights as possible about the data and its relation to the target variable:

Analyze the data distribution and check for the following:

Data types (continuous, ordinal, nominal, or text)

Mean, median, and percentiles

Data skew

Outliers and minimum and maximum values

Null and missing values

Most common values

The number of unique values (in categorical features)

Correlations (in continuous features)

Analyze how the target variable is influenced by the features and check for the following:

The regression coefficient 

(in regression)

Feature importance 

(in classification)

Categorical values with high error rates (in binary classification)

Analyze the difficulty of your prediction task.

By applying these steps, you will be able to understand the data and gain knowledge about the required preprocessing tasks for your data—features and target variables. Along with that, it will give you a good estimate of what difficulties you can expect in your prediction task, which is essential for judging required algorithms and validation strategies. You will also gain an insight into what possible feature engineering methods could apply to your dataset and have a better understanding of how to select a good error metric.

You can use a representative subset of the data and extrapolate your hypothesis and insights to the whole dataset.

Moving data to the cloud

Before we can start exploring the data, we need to make it available in our cloud environment. While this seems like a trivial task, efficiently accessing data from a new environment inside a corporate environment is not always easy. Also, uploading, copying, and distributing the same data to many Virtual Machines (VMs) and data science environments is not sustainable and doesn't scale well. For data exploration, we only need a significant subset of the data that can easily be connected to all other environments—rather than live access to a production database or data warehouse.

There is no wrong practice of uploading Comma-Separated Values (CSV) orTab-Separated Values (TSV) files to your experimentation environment or accessing data via Java Database Connectivity (JDBC) from the source system. However, there are a few easy tricks to optimize your workflow.

First, we will choose a data format optimized for data exploration. In the exploration phase, we need to glance at the source data multiple times and explore the values, feature dimensions, and target variables. Hence, using a human-readable text format is usually very practical. In order to parse it efficiently, a delimiter-separated file, such as CSV, is strongly recommended. CSV can be parsed efficiently and you can open and browse it using any text editor.

Another small tweak that will bring you a significant performance improvement is compressing the file using Gzip before uploading it to the cloud. This will make uploads, loading, and downloads of this file much faster, while the compute resources spent on decompression are minimal. Thanks to the nature of the tabular data, the compression ratio will be very high. Most analytical frameworks for data processing, such as pandas and Spark, can read and parse Gzipped files natively, which requires minimal-to-no code changes. In addition, this only adds a small extra step for reading and analyzing the file manually with an editor.

Once your training data is compressed, it's recommended to upload the Gzipped CSV file to an Azure Storage container; a good choice would be Azure Blob storage. When the data is stored in Blob storage, it can be conveniently accessed from any other services within Azure, as well as from your local machine. This means if you scale your experimentation environment from an Azure notebook to a compute cluster, your code for accessing and reading the data will stay the same.

A fantastic cross-platform GUI tool to interact with many different Azure Storage services is Azure Storage Explorer. Using this tool, it is very easy to efficiently upload small and large files to Blob storage. It also allows you to generate direct links to your files with an embedded access key. This technique is simple yet also super effective when uploading hundreds of terabytes (TBs) from your local machine to the cloud. We will discuss this in much more detail in Chapter 4, ETL and Data Preparation and Feature Extraction.

Understanding missing values

Once the data is uploaded to the cloud—for example, using Azure Storage Explorer and Azure Blob storage for your files—we can bring up a Notebook environment and start exploring the data. The goal is to thoroughly explore your data in an analytical process to understand the distribution of each dimension of your data. This is essential for choosing any appropriate data preprocessing feature engineering and ML algorithms for your use case.

Please keep in mind that not only the feature dimensions but also the target variable needs to be preprocessed and thoroughly analyzed.

Analyzing each dimension of a dataset with more than 100 feature dimensions is an extremely time-consuming task. However, instead of randomly exploring feature dimensions, you can analyze the dimensions ordered by feature importance and hence significantly reduce your time working through the data. Like many other areas of computer science, it is good to use an 80/20 principle for the initial data exploration and so only use 20% of the features to achieve 80% of the performance. This sets you up for a great start and you can always come back later to add more dimensions if needed.

The first thing to look for in a new dataset is missing values for each feature dimension. This will help you to gain a deeper understanding of the dataset and what actions could be taken to resolve those. It's not uncommon to remove missing values or impute them with zeros at the beginning of a project—however, this approach bears the risk of not properly analyzing missing values in the first place. 

Missing values can be disguised as valid numeric or categorical values. Typical examples are minimum or maximum values, -1, 0, or NaN.  Hence, if you find the values 32,767 (= 215-1) or 65,535 (= 216-1) appearing multiple times in an integer data column, they might well be missing values disguised as the maximum signed or unsigned 16-bit integer representation. Always assume that your data contains missing values and outliers in different shapes and representations. Your task is to uncover, find, and clean them.

Any prior knowledge about the data or domain will give you a competitive advantage when working with the data. The reason for this is that you will be able to understand missing values, outliers, and extremes in relation to the data and domain—which will help you to perform better imputation, cleansing, or transformation. As the next step, you should look for these outliers in your data, specifically for the following values:

The absolute number (or percentage) of the null values (look for Null, "Null", "", NaN, and so on)

The absolute number (or percentage) of minimum and maximum values

The absolute number (or percentage) of the most common value (MODE)

The absolute number (or percentage) of value 0

The absolute number (or percentage) of unique values

Once you have identified these values, we can use different preprocessing techniques to impute missing values and normalize or exclude dimensions with outliers. You will find many of these techniques, such as group mean imputation, in action in Chapter 4, ETL, Data Preparation, and Feature Extraction.

Visualizing data distributions

Knowing the outliers, you can finally approach exploring the value distribution of your dataset. This will help you understand which transformation and normalization techniques should be applied during data preparation. Common distribution statistics to look for in a continuous variable are the following:

The mean or median value

The minimum and maximum value

The 25

th

, 50

th

(median), and 75

th

percentiles

The data skew

Common techniques for visualizing these distributions are boxplots, density plots, or histograms. The following figure shows these different visualization techniques plotted per target class for a multi-class recognition dataset. Each of those methods has advantages and disadvantages—boxplots show all relevant metrics, while being a bit harder to read; density plots show very smooth shapes, while hiding some of the outliers; and histograms don't let you spot the median and percentiles easily, while giving you a good estimate for the data skew:

 

From the preceding visualization techniques, only histograms work well for categorical data (both nominal and ordinal)—however, you could look at the number of values per category. Another nice way to display the value distribution versus the target rate is in a binary classification task. The following figure shows the version number of Windows Defender against the malware detection rate (for non-touch devices) from the Microsoft malware detection dataset:

Many statistical ML algorithms require that the data is normally distributed and hence needs to be normalized or standardized. Knowing the data distribution helps you to choose which transformations need to be applied during data preparation. In practice, it is often the case that data needs to be transformed, scaled, or normalized.

Finding correlated dimensions

Another common task in data exploration is looking for correlations in the dataset. This will help you dismiss feature dimensions that are highly correlated and therefore might influence your ML model. In linear regression models, for example, two highly correlated independent variables will lead to large coefficients with opposite signs that ultimately cancel each other out. A much more stable regression model can be found by removing one of the correlated dimensions.

The Pearson correlation coefficient, for example, is a popular technique used to measure the linear relationship between two variables on a scale from -1 (strongly negatively correlated) to 1 (strongly positively correlated). 0 indicates no linear relation between the two variables in the Pearson correlation coefficient.

The following figure shows an example of a correlation matrix of the Boston housing price dataset, consisting of only continuous variables. The correlations range from -1 to 1 and are colored accordingly. The last row shows us the linear correlation between each feature dimension and the target variable. We can immediately tell that the median value (MEDV) of owner-occupied homes and the lower status (LSTAT) percentage of the population are negatively correlated:

It is worth mentioning that many correlation coefficients can only be between numerical values. Ordinal variables can be encoded, for example, using integer encoding and can also compute a meaningful correlation coefficient. For nominal data, you need to fall back on different methods, such as Cramér's V to compute correlation. It is worth noting that the input data doesn't need to be normalized (linearly scaled) before computing the correlation coefficient.

Measuring feature and target dependencies for regression

Once we have analyzed missing values, data distribution, and correlations, we can start analyzing the relationship between the features and the target variable. This will give us a good indication of the difficulty of the prediction problem and hence, the expected baseline performance— which is essential for prioritizing feature engineering efforts and choosing an appropriate ML model. Another great benefit of measuring this dependency is ranking the feature dimensions by the impact on the target variable, which you can use as a priority list for data exploration and preprocessing.

In a regression task, the target variable is numerical or ordinal. Therefore, we can compute the correlation coefficient between the individual features and the target variable to compute the linear dependency between the feature and the target. High correlation, and so a high absolute correlation coefficient, indicates a strong linear relationship exists. This gives us a great place to start for further exploration. However, in many practical problems, it is rare to see a high (linear) correlation between the feature and target variables.

One can also visualize this dependency between the feature and the target variable using a scatter or regression plot. The following figure shows a regression plot between the feature average number of rooms per dwelling (RM) and the target variable median value of owner-occupied homes (MEDV) from the UCI Boston housing dataset. If the regression line is at 45 degrees, then we have a perfect linear correlation:

Another great approach to determining this dependency is to fit a linear or logistic regression model to the training data. The resulting model coefficients now give a good explanation of the relationship—the higher the coefficient, the larger the linear (for linear regression) or marginal (for logistic regression) dependency on the target variable. Hence, sorting by coefficients results in a list of features ordered by importance. Depending on the regression type, the input data should be normalized or standardized.

The following figure shows an example of the correlation coefficients (the first column) of a fitted Ordinary Least Squares (OLS) regression model:

While the resulting R-squared metric (not shown) might not be good enough for a baseline model, the ordering of the coefficients can help us to prioritize further data exploration, preprocessing, and feature engineering.

Visualizing feature and label dependency for classification

In a classification task with a multi-class nominal target variable, we can't use the regression coefficients without further preprocessing the data. Another popular method that works well out of the box is fitting a simple tree-based classifier to the training data. Depending on the size of the training data, we could use a decision tree or a tree-based ensemble classifier, such as random forest or gradient-boosted trees. Doing so results in a feature importance ranking of the feature dimensions according to the chosen split criterion. In the case of splitting by entropy, the features would be sorted by information gain and hence, indicate which variables carry the most information about the target.

The following graph shows the feature importance fitted by a tree-based ensemble classifier using the entropy criterion from the UCI wine recognition dataset:

The lines represent variations in the information gain of features between individual trees. This output is a great first step to further data analysis and exploration in order of feature importance.

Here is another popular approach to discovering the separability of your dataset. The following graphs—one that is linearly separable (left) and one that is not separable (right)—show a dataset with three classes:

You can see this when looking at the three clusters and the overlaps between these clusters. Having clearly separated clusters means that a trained classification model will perform very well on this dataset. On the other hand, when we know that the data is not linearly separable, we know that this task will require advanced feature engineering and modeling to produce good results.

The preceding figure showed two datasets in two dimensions; we actually used the first two feature dimensions for visualization. However, high-dimensional most data cannot be easily and accurately visualized in two dimensions. To achieve this, we need a projection or embedding technique to embed the feature space in two dimensions. Many linear and non-linear embedding techniques to produce two-dimensional projections of data exist; here are the most common ones:

Principal Component Analysis

(

PCA

)

Linear Discriminant Analysis

(

LDA

)

t-Distributed Stochastic Neighbor Embedding

(

t-SNE

)

Uniform Manifold Approximation and Projection

(

UMAP

)

In the following figure, the LDA (left) and t-SNE (right) embeddings for the 13-dimensional UCI wine recognition dataset (https://archive.ics.uci.edu/ml/datasets/wine) are shown. In the LDA embedding, we can see that all the classes should be linearly separable. That's a lot we have learned from using two lines of code to plot the embedding before we even start with model selection or training:

Both the LDA and t-SNE embeddings are extremely helpful for judging the separability of the individual classes and hence the difficulty of your classification task. It's always good to assess how well a particular model will perform on your data before you start selecting and training a specific algorithm. You will learn more about these techniques in Chapter 3, Data Experimentation and Visualization Using Azure.

Exploring common techniques for data preparation

After the data experimentation phase, you should have gathered enough knowledge to start preprocessing the data. This process is also often referred to as feature engineering. When coming from multiple sources, such as applications, databases, or warehouses, as well as external sources, your data cannot be analyzed or interpreted immediately.

It is, therefore, of imminent importance to preprocess data before you choose a model to interpret your problem. In addition to this, there are different steps involved in data preparation, which depend on the data that is available to you, such as the problem you want to solve, and with that, the ML algorithms that could be used for it.

You might ask yourself why data preparation is so important. The answer is that the preparation of your data might lead to improvements in model accuracy when done properly. This could be due to the relationships within your data that have been simplified due to the preparation. By experimenting with data preparation, you would also be able to boost the model's accuracy later on. Usually, data scientists spend a significant amount of their time on data preparation, feature engineering, and understanding their data. In addition to this, data preparation is important for generating insights.

Data preparation means collecting data, cleaning it up, transforming the data, and consolidating it. By doing this, you can enrich your data, transform it, and as mentioned previously, improve the accuracy of your model. In fact, in many cases, an ML model's performance can be improved significantly through better feature engineering.

The challenges that come along with data preparation are, for example, the different file formats, the data types, inconsistency in data, limited or too much access to data, and sometimes even insufficient infrastructure around data integration. Another difficult problem is converting text, such as nominal or ordinal categories or free text, into a numeric value.

The way people currently view data preparation and perform this step of the process is through the extract, transform, and load tools. It is of utmost importance that data within organizations is aligned and transformed using various data standards. Effective integration of various data sources should be done by aligning the data, transforming it, and then promoting the development and adoption of data standards. All this effectively helps in managing the volume, variety, veracity, and velocity of the data.

In the following subparagraphs, some of the key techniques in data preparation, such as labeling, storing, encoding, and normalizing data, as well as feature extraction, will be shown in more depth.

Labeling the training data

Let's start with a bummer; the first step in the data preparation journey is labeling, also called annotation. It is a bummer because it is the least exciting part of an ML project, yet one of the most important tasks in the whole process. Garbage in, garbage out—it's that simple. The ultimate goal is to feed high-quality training data into the ML algorithms, which is why labeling training data is so important.

While proper labels greatly help to improve prediction performance, the labeling process will also help you to study the dataset in greater detail. However, let me clarify that labeling data requires deep insight and understanding of the context of the dataset and the prediction process. If we were aiming to predict breast cancer using CT scans, we would also need to understand how breast cancer can be detected in CT images in order to label the data.

Mislabeling the training data has a couple of consequences, such as label noise, which you want to avoid as it will the performance of every downstream process in the ML pipeline, such as feature selection, feature ranking and ultimatively model performance. Learning relies crucially on the accuracy of labels in the training dataset. However, we should always take label noise into account when aiming for a specific target metric because it's highly unlikely that all the provided labels will be absolutely precise and accurate.

In some cases, your labeling methodology is dependent on the chosen ML approach for a prediction problem. A good example is the difference between object detection and segmentation, both of which require completely differently labeled data. As labeling for segmentation is much more time-consuming than labeling for object detection or even classification, it is also an important trade-off to make before starting an ML project.

There are some techniques you can use to speed up the labeling process, which are hopefully provided by your labeling system:

Supervised learning

: Through supervised learning, an ML model could recommend the correct labels for your data at labeling time. You can then decide whether you use the predicted label or choose a different or modified label. This works very well with object detection and image data.

Active learning

: Another technique to accelerate the labeling process is to allow a semi-supervised learning process to learn and predict based on a few manually labeled samples. Using those labeled samples, the model automatically proposes new labels that can either be accepted or changed and modified. Each label will fine-tune the model to predict better labels.

 

Unsupervised learning

Through clustering similar data samples together, the labeling environment can prioritize which data points should be labeled next. Using these insights, the labeling environment can always try to propose loads of greatly varying samples in the training data for manual labeling.

Labeling is a necessary, long, and cost-intensive step in an ML process. There are techniques to facilitate labeling; however, they always require the domain knowledge to be carried out properly. If there is any chance that you can collect labeled data through your application directly, you are very lucky and should start collecting this data. A good example is collecting training data for a click-through rate of search results based on the actual results and clicks of real users. 

Normalization and transformation in machine learning

Normalization is a common data preprocessing technique where the data is scaled to a different value range through a (linear) transformation. For many statistical ML models, the training data needs to follow a certain distribution and so it needs to first be normalized along all its dimensions. The following are some of the most commonly used methods to normalize data:

Scaling to unit length, or standard scaling

Minimum/maximum scaling

Mean normalization

Quantile normalization

In addition to these, you can also monitor normalization by ensuring the values fall between 0 and 1 in the case of probability density functions, which are used in fields such as chemistry. For exponential distribution and Poisson distribution, you could use the coefficient of variation because it deals well with positive distributions. 

In ML algorithms such as Support Vector Machines (SVM), logistic regression, and neural networks, a very common normalization technique is standardization, which standardizes features by giving them a 0 mean and unit variance. This is often referred to as a standard scaler.

Besides linear transformations, it's also quite common to apply non-linear transformations to your data for the same reason as for normalization, which is to pass the assumption for a specifically required distribution. If your data is skewed, you can use power or log transformations to normalize the distributions. This is very important, for example, for linear models where the normality assumption is a required conditional to the predictor's vector. For highly skewed data, you can also apply these transformations multiple times. For data ranges containing 0, it's also common to apply log plus 1 transformations to avoid numerical instability.

Encoding categorical variables

With a real-world dataset, you will quickly reach the limits of normalization and transformation as the data for these transformations needs to be continuous. The same is true for many statistical ML algorithms, such as linear regression, SVM, or neural networks; the input data needs to be numerical. Hence, in order to work with categorical data, we need to look at different numerical encoding techniques.

We differentiate between three different types of categorical data: ordinal, nominal, and textual (for example, free text). We make this distinction between nominal and textual data as textual data is often used to extract semantic meaning whereas nominal categorical data is often just encoded. 

There are various types of numerical encoding techniques you could use. They are listed here:

Label encoding

: This is where each category is mapped to a number or label. The labels for the categories are not related to each other; therefore, categories that are related will lose this information after encoding.

One-hot encoding

: Another popular approach is dummy coding, also called one-hot encoding. Each category is replaced with an orthogonal feature vector, where the dimension of the feature vector is dependent on the number of distinct values. This approach is not efficient with high cardinality features.

Bin encoding

: Even though bin encoding is quite similar to one-hot encoding, it differs from storing categories as binary bitstrings. The goal of bin encoding is to hash the cardinalities into binary values and each binary digit gets one column. This will result in some information loss; however, you can deal with fewer dimensions. It also creates fewer columns and so the speed of learning is higher and more memory efficient.

Count encoding

: In count encoding, we replace the categorical values with the relative or absolute count of the value over the whole training set. This is a common technique for encoding large amounts of unique labels.

Target encoding

: In this encoding methodology, we replace the categorical value with the mean value of the target variable of this category. This is also effective with high-cardinality features.

Hashing encoding

: This is used when there are a lot of large-scale categorical features. The hash function maps a lot of values into a small, finite set of values. Different values could create the same hash, which is called a collision.

We will take a closer look at some of these encoding techniques in Chapter 5, Advanced Feature Extraction with NLP.

A feature engineering example using time-series data

Feature engineering is strongly dependent on the domain of your dataset. When dealing with demographics or geographics, you can model personas and demographic metrics or join geographic attributes, such as proximity to a large city, or to the border, GDP, and others. Let's look at an example of time-series data, which is extremely common in real-world examples.

Many real-world datasets have a temporal dependency and so they store the date and time in at least one of the dimensions of the training data. This date-time field can be treated either as an encoded or an ordinal categorical variable, depending on the distribution of the date-time variable.

Depending on the distribution and patterns in the date-time data, you want to transform the date-time field into different values that encode a specific property of the current date or time. The following are a few features that can be extracted from date-time variables:

The absolute time

The hour of the day

The day of the week

The day of the month

The day of the year

The month of the year

If you see a periodic relationship between a dimension over time, you can also encode the cycle features of the time dimension. This can be achieved by computing the absolute hour of the day to compute the sine and cosine components of the normalized hour of the day. Using this technique, the resulting values will contain a cyclic dependency on the encoded date-time dimension.

Another great way of improving your model's performance is to include additional data in your training data. This works really well on the date-time dimension as you can, for example, join public holidays, public events, or other types of events by date. This lets you create features such as the following:

The number of days until or since the next or last campaign

The number of days until or since the next or last holiday

Mark a date as a public holiday

Mark a date as a major sporting event

As you can see, there are many ways to transform and encode date-time variables. It is encouraged to dive into the raw data and look for visual patterns in the data that should be interpreted by the ML model. Whenever you deal with a date-time dimension, there is room for creative feature engineering.