E-Book
40,81 €

Hands-On Artificial Intelligence for Cybersecurity E-Book

Alessandro Parisi

0,0

40,81 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch
Veröffentlichungsjahr: 2019

Beschreibung

Build smart cybersecurity systems with the power of machine learning and deep learning to protect your corporate assets

Key Features

Identify and predict security threats using artificial intelligence

Develop intelligent systems that can detect unusual and suspicious patterns and attacks

Learn how to test the effectiveness of your AI cybersecurity algorithms and tools

Book Description

Today's organizations spend billions of dollars globally on cybersecurity. Artificial intelligence has emerged as a great solution for building smarter and safer security systems that allow you to predict and detect suspicious network activity, such as phishing or unauthorized intrusions.

This cybersecurity book presents and demonstrates popular and successful AI approaches and models that you can adapt to detect potential attacks and protect your corporate systems. You'll learn about the role of machine learning and neural networks, as well as deep learning in cybersecurity, and you'll also learn how you can infuse AI capabilities into building smart defensive mechanisms. As you advance, you'll be able to apply these strategies across a variety of applications, including spam filters, network intrusion detection, botnet detection, and secure authentication.

By the end of this book, you'll be ready to develop intelligent systems that can detect unusual and suspicious patterns and attacks, thereby developing strong network security defenses using AI.

What you will learn

Detect email threats such as spamming and phishing using AI

Categorize APT, zero-days, and polymorphic malware samples

Overcome antivirus limits in threat detection

Predict network intrusions and detect anomalies with machine learning

Verify the strength of biometric authentication procedures with deep learning

Evaluate cybersecurity strategies and learn how you can improve them

Who this book is for

If you're a cybersecurity professional or ethical hacker who wants to build intelligent systems using the power of machine learning and AI, you'll find this book useful. Familiarity with cybersecurity concepts and knowledge of Python programming is essential to get the most out of this book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 396

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Leseprobe

Hands-On Artificial Intelligence for Cybersecurity

Implement smart AI systems for preventing cyber attacks and detecting threats and network anomalies

Alessandro Parisi

BIRMINGHAM - MUMBAI

Hands–On Artificial Intelligence for Cybersecurity

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor:Yogesh DeokarContent Development Editor:Manorama HaridasTechnical Editor: Vibhuti GawdeCopy Editor: Safis EditingProject Coordinator:Kirti PisatProofreader: Safis EditingIndexer:Rekha NairProduction Designer:Deepika Naik

First published: August 2019

Production reference: 1010819

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-402-7

www.packtpub.com

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Alessandro Parisi has been an IT professional for over 20 years, acquiring significant experience as a security data scientist, and as an AI cybersecurity and blockchain specialist. He has experience of operating within organizational and decisional contexts characterized by high complexity. Over the years, he has helped companies to adopt AI and blockchain DLT technologies as strategic tools in protecting sensitive corporate assets. He holds an MSc in economics and statistics.

To Ilaria, for all her love...and patience!

About the reviewers

Chiheb Chebbi is a Tunisian InfoSec enthusiast, author, and technical reviewer with experience in various aspects of information security, focusing on the investigation of advanced cyber attacks and research into cyber espionage. His core interests are penetration testing, machine learning, and threat hunting. He has been included in many halls of fame. The proposals outlined in his talks have been accepted by many world-class information security conferences.

I dedicate this book to every person who makes the security community awesome and fun!

Dr. Madiha Jafri received a BSc in computer engineering, an MSc in electrical engineering and a PhD in electrical and computer engineering from Old Dominion University in 2003, 2004, and 2007, respectively. Funded by NASA, she developed the use of artificial intelligence to predict electromagnetic interference patterns on commercial aircraft. Dr. Jafri has been with Lockheed Martin since June 2007 as a cybersecurity manager and cryptography expert, designing complex cryptographic solutions. She is now a senior scientist with a focus in artificial intelligence and cybersecurity.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands–On Artificial Intelligence for Cybersecurity

About Packt

Why subscribe?

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: AI Core Concepts and Tools of the Trade

Introduction to AI for Cybersecurity Professionals

Applying AI in cybersecurity

Evolution in AI: from expert systems to data mining

A brief introduction to expert systems

Reflecting the indeterministic nature of reality

Going beyond statistics toward machine learning

Mining data for models

Types of machine learning

Supervised learning

Unsupervised learning

Reinforcement learning

Algorithm training and optimization

How to find useful sources of data

Quantity versus quality

Getting to know Python's libraries

Supervised learning example – linear regression

Unsupervised learning example – clustering

Simple NN example – perceptron

AI in the context of cybersecurity

Summary

Setting Up Your AI for Cybersecurity Arsenal

Getting to know Python for AI and cybersecurity

Python libraries for AI

NumPy as an AI building block

NumPy multidimensional arrays

Matrix operations with NumPy

Implementing a simple predictor with NumPy

Scikit-learn

Matplotlib and Seaborn

Pandas

Python libraries for cybersecurity

Pefile

Volatility

Installing Python libraries

Enter Anaconda – the data scientist's environment of choice

Anaconda Python advantages

Conda utility

Installing packages in Anaconda

Creating custom environments

Some useful Conda commands

Python on steroids with parallel GPU

Playing with Jupyter Notebooks

Our first Jupyter Notebook

Exploring the Jupyter interface

What's in a cell?

Useful keyboard shortcuts

Choose your notebook kernel

Getting your hands dirty

Installing DL libraries

Deep learning pros and cons for cybersecurity

TensorFlow

Keras

PyTorch

PyTorch versus TensorFlow

Summary

Section 2: Detecting Cybersecurity Threats with AI

Ham or Spam? Detecting Email Cybersecurity Threats with AI

Detecting spam with Perceptrons

Meet NNs at their purest – the Perceptron

It's all about finding the right weight!

Spam filters in a nutshell

Spam filters in action

Detecting spam with linear classifiers

How the Perceptron learns

A simple Perceptron-based spam filter

Pros and cons of Perceptrons

Spam detection with SVMs

SVM optimization strategy

SVM spam filter example

Image spam detection with SVMs

How did SVM come into existence?

Phishing detection with logistic regression and decision trees

Regression models

Introducing linear regression models

Linear regression with scikit-learn

Linear regression – pros and cons

Logistic regression

A phishing detector with logistic regression

Logistic regression pros and cons

Making decisions with trees

Decision trees rationales

Phishing detection with decision trees

Decision trees – pros and cons

Spam detection with Naive Bayes

Advantages of Naive Bayes for spam detection

Why Naive Bayes?

NLP to the rescue

NLP steps

A Bayesian spam detector with NLTK

Summary

Malware Threat Detection

Malware analysis at a glance

Artificial intelligence for malware detection

Malware goes by many names

Malware analysis tools of the trade

Malware detection strategies

Static malware analysis

Static analysis methodology

Difficulties of static malware analysis

How to perform static analysis

Hardware requirements for static analysis

Dynamic malware analysis

Anti-analysis tricks

Getting malware samples

Hacking the PE file format

The PE file format as a potential vector of infection

Overview of the PE file format

The DOS header and DOS stub

The PE header structure

The data directory

Import and export tables

Extracting malware artifacts in a dataset

Telling different malware families apart

Understanding clustering algorithms

From distances to clusters

Clustering algorithms

Evaluating clustering with the Silhouette coefficient

K-Means in depth

K-Means steps

K-Means pros and cons

Clustering malware with K-Means

Decision tree malware detectors

Decision trees classification strategy

Detecting malwares with decision trees

Decision trees on steroids – random forests

Random Forest Malware Classifier

Detecting metamorphic malware with HMMs

How malware circumvents detection?

Polymorphic malware detection strategies

HMM fundamentals

HMM example

Advanced malware detection with deep learning

NNs in a nutshell

CNNs

From images to malware

Why should we use images for malware detection?

Detecting malware from images with CNNs

Summary

Network Anomaly Detection with AI

Network anomaly detection techniques

Anomaly detection rationales

Intrusion Detection Systems 

Host Intrusion Detection Systems

Network Intrusion Detection Systems

Anomaly-driven IDS

Turning service logs into datasets

Advantages of integrating network data with service logs

How to classify network attacks

Most common network attacks

Anomaly detection strategies

Anomaly detection assumptions and challenges

Detecting botnet topology

What is a botnet?

The botnet kill chain

Different ML algorithms for botnet detection

Gaussian anomaly detection

The Gaussian distribution

Anomaly detection using the Gaussian distribution

Gaussian anomaly detection example

False alarm management in anomaly detection

Receiver operating characteristic analysis

Summary

Section 3: Protecting Sensitive Information and Assets

Securing User Authentication

Authentication abuse prevention

Are passwords obsolete?

Common authentication practices

How to spot fake logins

Fake login management – reactive versus predictive

Predicting the unpredictable

Choosing the right features

Preventing fake account creation

Account reputation scoring

Classifying suspicious user activity

Supervised learning pros and cons

Clustering pros and cons

User authentication with keystroke recognition

Coursera Signature Track

Keystroke dynamics

Anomaly detection with keystroke dynamics

Keystroke detection example code

User detection with multilayer perceptrons

Biometric authentication with facial recognition

Facial recognition pros and cons

Eigenfaces facial recognition

Dimensionality reduction with principal component analysis (PCA)

Principal component analysis

Variance, covariance, and the covariance matrix

Eigenvectors and Eigenvalues

Eigenfaces example

Summary

Fraud Prevention with Cloud AI Solutions

Introducing fraud detection algorithms

Dealing with credit card fraud

Machine learning for fraud detection

Fraud detection and prevention systems

Expert-driven predictive models

Data-driven predictive models

FDPS – the best of both worlds

Learning from unbalanced and non-stationary data

Dealing with unbalanced datasets

Dealing with non-stationary datasets

Predictive analytics for credit card fraud detection

Embracing big data analytics in fraud detection

Ensemble learning

Bagging (bootstrap aggregating)

Boosting algorithms

Stacking

Bagging example

Boosting with AdaBoost

Introducing the gradient

Gradient boosting

eXtreme Gradient Boosting (XGBoost)

Sampling methods for unbalanced datasets

Oversampling with SMOTE

Sampling examples

Getting to know IBM Watson Cloud solutions

Cloud computing advantages

Achieving data scalability

Cloud delivery models

Empowering cognitive computing

Importing sample data and running Jupyter Notebook in the cloud

Credit card fraud detection with IBM Watson Studio

Predicting with RandomForestClassifier

Predicting with GradientBoostingClassifier

Predicting with XGBoost

Evaluating the quality of our predictions

F1 value

ROC curve

AUC (Area Under the ROC curve)

Comparing ensemble classifiers

The RandomForestClassifier report

The GradientBoostingClassifier report

The XGBClassifier report

Improving predictions accuracy with SMOTE

Summary

GANs - Attacks and Defenses

GANs in a nutshell

A glimpse into deep learning

Artificial neurons and activation functions

From artificial neurons to neural networks

Getting to know GANs

Generative versus discriminative networks

The Nash equilibrium

The math behind GANs

How to train a GAN

An example of a GAN–emulating MNIST handwritten digits

GAN Python tools and libraries

Neural network vulnerabilities

Deep neural network attacks

Adversarial attack methodologies

Adversarial attack transferability

Defending against adversarial attacks

CleverHans library of adversarial examples

EvadeML-Zoo library of adversarial examples

Network attack via model substitution

Substitute model training

Generating the synthetic dataset

Fooling malware detectors with MalGAN

IDS evasion via GAN

Introducing IDSGAN

Features of IDSGAN

The IDSGAN training dataset

Generator network

Discriminator network

Understanding IDSGAN's algorithm training

Facial recognition attacks with GAN

Facial recognition vulnerability to adversarial attacks

Adversarial examples against FaceNet

Launching the adversarial attack against FaceNet's CNN

Summary

Section 4: Evaluating and Testing Your AI Arsenal

Evaluating Algorithms

Best practices of feature engineering 

Better algorithms or more data?

The very nature of raw data

Feature engineering to the rescue

Dealing with raw data

Data binarization

Data binning

Logarithmic data transformation

Data normalization

Min–max scaling

Variance scaling 

How to manage categorical variables

Ordinal encoding

One-hot encoding

Dummy encoding

Feature engineering examples with sklearn

Min–max scaler

Standard scaler

Power transformation

Ordinal encoding with sklearn

One-hot encoding with sklearn

Evaluating a detector's performance with ROC

ROC curve and AUC measure

Examples of ROC metrics

ROC curve example

AUC score example

Brier score example

How to split data into training and test sets

Algorithm generalization error

Algorithm learning curves

Using cross validation for algorithms

K-folds cross validation pros and cons

K-folds cross validation example

Summary

Assessing your AI Arsenal

Evading ML detectors

Understanding RL

RL feedback and state transition

Evading malware detectors with RL

Black-box attacks with RL

Challenging ML anomaly detection

Incident response and threat mitigation

Empowering detection systems with human feedback

Testing for data and model quality

Assessing data quality

Biased datasets

Unbalanced and mislabeled datasets

Missing values in datasets

Missing values example

Assessing model quality

Fine-tuning hyperparameters

Model optimization with cross validation

Ensuring security and reliability

Ensuring performance and scalability

Ensuring resilience and availability

Ensuring confidentiality and privacy

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Organizations today are spending billions of dollars globally on cybersecurity. Artificial Intelligence (AI) has emerged as a great solution for building smarter and safer security systems that allow you to predict and detect suspicious network activities, such as phishing or unauthorized intrusions, in your network.

This cybersecurity book presents and demonstrates the popular and successful AI approaches and models that you can adopt to detect potential attacks and protect your corporate systems. You'll understand the roles of machine learning (ML) and neural networks (NNs), and deep learning in cybersecurity, and learn how you can infuse AI capabilities when building smart defensive mechanisms. As you advance, you'll be able to apply these strategies across a variety of applications, including spam filters, network intrusion detection, botnet detection, and secure authentication.

By the end of this book, you'll be ready to develop intelligent systems that can detect unusual and suspicious patterns and attacks, thereby developing strong network security defenses using AI.

Who this book is for

If you're a cybersecurity professional or ethical hacker who wants to build intelligent systems using the power of ML and AI, you'll find this book useful.

What this book covers

Chapter 1, Introduction to AI for Cybersecurity Professionals, introduces the various branches of AI to be distinguished between, focusing on the pros and cons of the various approaches of automated learning in the field of cybersecurity. This chapter also covers the different strategies for learning the algorithms and their optimizations. The main concepts of AI will be shown in action using Jupyter Notebooks. The tools used in this chapter are Jupyter Notebooks, NumPy, and scikit-learn, and the datasets used are scikit-learn datasets and CSV samples.

Chapter 2, Setting Up Your AI for Cybersecurity Arsenal, introduces the main software requirements and their configurations. We will learn to feed a knowledge base with samples of malicious code to feed into AI algorithms. Jupyter Notebooks will be introduced for the interactive execution of Python tools and commands. The tools used in this chapter are Anaconda, and Jupyter Notebooks. No dataset is used here.

Chapter 3, Ham or Spam? Detecting Email Cybersecurity Threats with AI, covers detecting email security threats that use email as an attack vector. Different detection strategies, ranging from linear classifiers and Bayesian filters to more sophisticated solutions (such as decision trees, logistic regression, and natural language processing (NLP), will be illustrated. The examples will make use of the Jupyter Notebooks to allow greater interaction of the reader with the different solutions illustrated. The tools used in this chapter are Jupyter Notebooks, scikit-learn, and NLTK. The datasets used in this regard are the Kaggle spam dataset, CSV spam samples, and honeypot phishing samples.

Chapter 4, Malware Threat Detection, introduces a high diffusion of malware and ransomware codes, together with the rapid polymorphic mutation in different variants (polymorphic and metamorphic malwares) of the same threats that has rendered obsolete traditional detection solutions based on signatures and the hashing of image files. It is upon these techniques that common antivirus softwares are based. The examples will show the different malware analysis strategies that use ML algorithms. The tools used in this chapter are Jupyter Notebooks, scikit-learn, and TensorFlow. Datasets/samples used in this regard include theZoo malware samples.

Chapter 5, Network Anomaly Detection with AI, explains how the current level of interconnection between different devices has attained such complexity that it leads to serious doubts about the effectiveness of traditional concepts such as perimeter security. In cyberspace, in fact, the attack surface grows exponentially, and it is therefore essential to have automated tools for the detection of network anomalies and for learning about new potential threats. The tools used in this chapter are Jupyter Notebooks, pandas, scikit-learn, and Keras. The datasets used in this regard are Kaggle datasets, KDD 1990, CIDDS, CICIDS2017, services, and IDS log files.

Chapter 6, Securing User Authentication, introduces AI in the field of cybersecurity, which plays an increasingly important role in terms of the protection of sensitive user-related information, including credentials for access to their network accounts and applications in order to prevent abuse, such as identity theft.

Chapter 7, Fraud Prevention with Cloud AI Solutions, covers many of the security attacks and data breaches suffered by corporations. Such breaches have as their objective the violation of sensitive info, such as customers' credit cards. Such attacks are often conducted in stealth mode, meaning that it is difficult to detect such threats using traditional methods. The tools used in this chapter are IBM Watson Studio, IBM Cloud Object Storage, Jupyter Notebooks, scikit-learn, Apache Spark. The dataset used here is the Kaggle Credit Card Fraud Detection dataset.

Chapter 8, GANs – Attacks and Defenses, introduces Generative Adversarial Networks (GANs) that represent the most advanced example of NNs that deep learning makes available to us. In the context of cybersecurity, GANs can be used for legitimate purposes, as in the case of authentication procedures, but they can also be exploited to violate these procedures. The tools used in this chapter are CleverHans, the Adversarial Machine Learning (AML) library, EvadeML-Zoo, TensorFlow, and Keras. The datasets used are example images of faces created entirely by using a GAN.

Chapter 9, Evaluating Algorithms, shows how to evaluate the effectiveness of the various alternative solutions using appropriate analysis metrics. The tools used in this chapter are scikit-learn, NumPy, and Matplotlib. scikit datasets are used in this regard.

Chapter 10,Assessing Your AI Arsenal, covers techniques that attackers exploit to evade the tools. Only in this way is it possible to obtain a realistic picture of the effectiveness and reliability of the solutions adopted. In addition, the aspects related to the scalability of the solutions must be taken into consideration, and then monitored continuously to guarantee reliability. The tools used in this chapter are scikit-learn, Foolbox, EvadeML, Deep-pwning, TensorFlow, and Keras. The MNIST and scikit datasets are used in this regard.

To get the most out of this book

A familiarity with cybersecurity concepts and knowledge of Python programming is essential in order to get the most out of this book.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-for-Cybersecurity. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789804027_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: AI Core Concepts and Tools of the Trade

In this section, the fundamental concepts of AI will be introduced, including analyzing the different types of algorithms and the most indicated use strategies for cybersecurity.

This section contains the following chapters:

Chapter 1

Introduction to AI for Cybersecurity Professionals

Chapter 2

Setting Up Your AI for Cybersecurity Arsenal

Introduction to AI for Cybersecurity Professionals

In this chapter, we'll distinguish between the various branches of Artificial Intelligence (AI), focusing on the pros and cons of the different approaches of automated learning in the field of cybersecurity.

We will introduce different strategies for learning and optimizing of the various algorithms, and we'll also look at the main concepts of AI in action using Jupyter Notebooks and the scikit-learn Python library.

This chapter will cover the following topics:

Applying AI in cybersecurity

The evolution from expert systems to data mining and AI

The different forms of automated learning

The characteristics of algorithm training and optimization

Beginning with AI via Jupyter Notebooks

Introducing AI in the context of cybersecurity

Applying AI in cybersecurity

The application of AI to cybersecurity is an experimental research area that's not without problems, which we will try to explain during this chapter. However, it is undeniable that the results achieved so far are promising, and that in the near future the methods of analysis will become common practice, with clear and positive consequences in the cybersecurity professional field, both in terms of new job opportunities and new challenges.

When dealing with the topic of applying AI to cybersecurity, the reactions from insiders are often ambivalent. In fact, reactions of skepticism alternate with conservative attitudes, partly caused by the fear that machines will supplant human operators, despite the high technical and professional skills of humans, acquired from years of hard work.

However, in the near future, companies and organizations will increasingly need to invest in automated analysis tools that enable a rapid and adequate response to current and future cybersecurity challenges. Therefore, the scenario that is looming is actually a combination of skills, rather than a clash between human operators and machines. It is therefore likely that the AI within the field of cybersecurity will take charge of the dirty work, that is, the selection of potential suspect cases, leaving the most advanced tasks to the security analysts, letting them investigate in more depth the threats that deserve the most attention.

Evolution in AI: from expert systems to data mining

To understand the advantages associated with the adoption of AI in the field of cybersecurity, it is necessary to introduce the underlying logic to the different methodological approaches that characterize AI.

We will start with a brief historical analysis of the evolution of AI in order to fully evaluate the potential benefits of applying it in the field of cybersecurity.

A brief introduction to expert systems

One of the first attempts at automated learning consisted of defining the rule-based decision system applied to a given application domain, covering all the possible ramifications and concrete cases that could be found in the real world. In this way, all the possible options were hardcoded within the automated learning solutions, and were verified by experts in the field.

The fundamental limitation of such expert systems consisted of the fact that they reduced the decisions to Boolean values (which reduce everything down to a binary choice), thus limiting the ability to adapt the solutions to the different nuances of real-world use cases.

In fact, expert systems do not learn anything new compared to hardcoded solutions, but limit themselves to looking for the right answer within a (potentially very large) knowledge base that is not able to adapt to new problems that were not addressed previously.

Reflecting the indeterministic nature of reality

Since the concrete cases that we come across in the real world cannot simply be represented using just true/false classification models (although experts in the sector strive to list all possible cases, there is always something in reality that escapes classification), it is therefore necessary to make the best use of the data at our disposal in order to let latent tendencies and anomalous cases (such as outliers) emerge, making use of statistical and probabilistic models that can more appropriately reflect the indeterministic nature of reality.

Going beyond statistics toward machine learning

Although the introduction of statistical models broke through the limitations of expert systems, the underlying rigidity of the approach remained, because statistical models, such as rule-based decisions, were in fact established in advance and could not be modified to adapt to new data. For example, one of the most commonly used statistical models is the Gaussian distribution. The statistician could then decide that the data comes from a Gaussian distribution, and try to estimate the parameters that characterize the hypothetical distribution that best describes the data being analyzed, without taking into consideration alternative models.

To overcome these limits, it was therefore necessary to adopt an iterative approach, which allowed the introduction of machine learning (ML) algorithms capable of generalizing the descriptive models starting from the available data, thus autonomously generating its own features, without limiting itself to predefined target functions, but adapting them to the continuous evolution of the algorithm training process.

Mining data for models

The difference in approach compared to the predefined static models is also reflected in the research field known as data mining.

An adequate definition of the data mining process consists of the discovery of adequate representative models, starting with the data. Also, in this case, instead of adopting pre-established statistical models, we can use ML algorithms based on the training data to identify the most suitable predictive model (this is more true when we are not able to understand the nature of the data at our disposal).

However, the algorithmic approach is not always adequate. When the nature of the data is clear and conforms to known models, there is no advantage in using ML algorithms instead of pre-defined models. The next step, which absorbs and extends the advantages of the previous approaches, adding the ability to manage cases not covered in the training data, leads us to AI.

AI is a wider field of research than ML, which can manage data of a more generic and abstract nature than ML, thus enabling the transfer of common solutions to different types of data without the need for complete retraining. In this way, it is possible, for example, to recognize objects from color images, starting with objects originally obtained from black and white samples.

Therefore, AI is considered as a broad field of research that includes ML. In turn, ML includes deep learning (DL) which is ML method based on artificial neural networks, as shown in the following diagram:

Types of machine learning

The process of mechanical learning from data can take different forms, with different characteristics and predictive abilities.

In the case of ML (which, as we have seen, is a branch of research belonging to AI), it is common to distinguish between the following types of ML:

Supervised learning

Unsupervised learning

Reinforcement learning

The differences between these learning modalities are attributable to the type of result (output) that we intend to achieve, based on the nature of the input required to produce it.

Supervised learning

In the case of supervised learning, algorithm training is conducted using an input dataset, from which the type of output that we have to obtain is already known.

In practice, the algorithms must be trained to identify the relationships between the variables being trained, trying to optimize the learning parameters on the basis of the target variables (also called labels) that, as mentioned, are already known.

An example of a supervised learning algorithm is classification algorithms, which are particularly used in the field of cybersecurity for spam classification.

A spam filter is in fact trained by submitting an input dataset to the algorithm containing many examples of emails that have already been previously classified as spam (the emails were malicious or unwanted) or ham (the emails were genuine and harmless).

The classification algorithm of the spam filter must therefore learn to classify the new emails it will receive in the future, referring to the spam or ham classes based on the training previously performed on the input dataset of the already classified emails.

Another example of supervised algorithms is regression algorithms. Ultimately, there are the following main supervised algorithms:

Regression (linear and logistic)

k-Nearest Neighbors

(

k-NNs

)

Support vector machines

(

SVMs

)

Decision trees and random forests

Neural networks

(

NNs

)

Unsupervised learning

In the case of unsupervised learning, the algorithms must try to classify the data independently, without the aid of a previous classification provided by the analyst. In the context of cybersecurity, unsupervised learning algorithms are important for identifying new (not previously detected) forms of malware attacks, frauds, and email spamming campaigns.

Here are some examples of unsupervised algorithms:

Dimensionality reduction:

Principal component analysis

(

PCA

)

PCA Kernel

Clustering:

k-means

Hierarchical cluster analysis

(

HCA

)

Reinforcement learning

In the case of reinforcement learning (RL), a different learning strategy is followed, which emulates the trial and error approach. Thus, drawing information from the feedback obtained during the learning path, with the aim of maximizing the reward finally obtained based on the number of correct decisions that the algorithm has selected.

In practice, the learning process takes place in an unsupervised manner, with the particularity that a positive reward is assigned to each correct decision (and a negative reward for incorrect decisions) taken at each step of the learning path. At the end of the learning process, the decisions of the algorithm are reassessed based on the final reward achieved.

Given its dynamic nature, it is no coincidence that RL is more similar to the general approach adopted by AI than to the common algorithms developed in ML.

The following are some examples of RL algorithms:

Markov process

Q-learning

Temporal difference

(

) methods

Monte Carlo methods

In particular, Hidden Markov Models (HMM) (which make use of the Markov process) are extremely important in the detection of polymorphic malware threats.

Algorithm training and optimization

When preparing automated learning procedures, we will often face a series of challenges. We need to overcome these challenges in order to recognize and avoid compromising the reliability of the procedures themselves, thus preventing the possibility of drawing erroneous or hasty conclusions that, in the context of cybersecurity, can have devastating consequences.

One of the main problems that we often face, especially in the case of the configuration of threat detection procedures, is the management of false positives; that is, cases detected by the algorithm and classified as potential threats, which in reality are not. We will discuss false positives and ML evaluation metrics in more depth in Chapter 7, Fraud Prevention with Cloud AI Solutions, and Chapter 9, Evaluating Algorithms.

The management of false positives is particularly burdensome in the case of detection systems aimed at contrasting networking threats, given that the number of events detected are often so high that they absorb and saturate all the human resources dedicated to threat detection activities.

On the other hand, even correct (true positive) reports, if in excessive numbers, contribute to functionally overloading the analysts, distracting them from priority tasks. The need to optimize the learning procedures therefore emerges in order to reduce the number of cases that need to be analyzed in depth by the analysts.

This optimization activity often starts with the selection and cleaning of the data submitted to the algorithms.

How to find useful sources of data

In the case of anomaly detection, for example, particular attention must be paid to the data being analyzed. An effective anomaly detection activity presupposes that the training data does not contain the anomalies sought, but that on the contrary, they reflect the normal situation of reference.

If, on the other hand, the training data was biased with the anomalies being investigated, the anomaly detection activity would lose much of its reliability and utility in accordance with the principle commonly known as GIGO, which stands for garbage in, garbage out.

Given the increasing availability of raw data in real time, often the preliminary cleaning of data is considered a challenge in itself. In fact, it's often necessary to conduct a preliminary skim of the data, eliminating irrelevant or redundant information. We can then present the data to the algorithms in a correct form, which can improve their ability to learn, adapting to the form of data on the basis of the type of algorithm used.

For example, a classification algorithm will be able to identify a more representative and more effective model in cases in which the input data will be presented in a grouped form, or is capable of being linearly separable. In the same way, the presence of variables (also known as dimensions) containing empty fields weighs down the computational effort of the algorithm and produces less reliable predictive models due to the phenomenon known as the curse of dimensionality.

This occurs when the number of features, that is, dimensions, increases without improving the relevant information, simply resulting in data being dispersed in the increased space of research:

Also, the sources from which we draw our test cases (samples) are important. Think, for example, of a case in which we have to predict the mischievous behavior of an unknown executable. The problem in question is reduced to the definition of a model of classification of the executable, which must be traced back to one of two categories: genuine and malicious.

To achieve such a result, we need to train our classification algorithm by providing it with a number of examples of executables that are considered malicious as an input dataset.

Quantity versus quality

When it all boils down to quantity versus quality, we are immediately faced with the following two problems:

What types of malware can we consider most representative of the most probable risks and threats to our company?

How many example cases (samples) should we collect and administer to the algorithms in order to obtain a reliable result in terms of both effectiveness and predictive efficiency of future threats?

The answers to the two questions are closely related to the knowledge that the analyst has of the specific organizational realm in which they must operate.

All this could lead the analyst to believe that the creation of a honey-pot, which is useful for gathering malicious samples in the wild that will be fed to the algorithms as training samples, would be more representative of the level of risk to which the organization is exposed than the use of datasets as examples of generic threats. At the same time, the number of test examples to be submitted to the algorithm is determined by the characteristics of the data themselves. These can, in fact, present a prevalence of cases (skewness) of a certain type, to the detriment of other types, leading to a distortion in the predictions of the algorithm toward the classes that are most numerous, when in reality, the most relevant information for our investigation is represented by a class with a smaller number of cases.

In conclusion, it will not be a matter of being able to simply choose the best algorithm for our goals (which often does not exist), but mainly to select the most representative cases (samples) to be submitted to a set of algorithms, which we will try to optimize based on the results obtained.

Getting to know Python's libraries

In the following sections, we will explore the concepts presented so far, presenting some sample code that make use of a series of Python libraries that are among the most well known and widespread in the field of ML:

NumPy (version 1.13.3)

pandas (version 0.20.3)

Matplotlib (version 2.0.2)

scikit-learn (version 0.20.0)

Seaborn (version 0.8.0)

The sample code will be shown here in the form of snippets, along with screenshots representing their output. Do not worry if not all of the implementation details are clear to you at first glance; we will have the opportunity to understand the implementation aspects of every single algorithm throughout the book.

AI in the context of cybersecurity

With the exponential increase in the spread of threats associated with the daily diffusion of new malware, it is practically impossible to think of dealing effectively with these threats using only analysis conducted by human operators. It is necessary to introduce algorithms that allow us to automate that introductory phase of analysis known as triage, that is to say, to conduct a preliminary screening of the threats to be submitted to the attention of the cybersecurity professionals, allowing us to respond in a timely and effective manner to ongoing attacks.

We need to be able to respond in a dynamic fashion, adapting to the changes in the context related to the presence of unprecedented