40,81 €
Build smart cybersecurity systems with the power of machine learning and deep learning to protect your corporate assets
Key Features
Book Description
Today's organizations spend billions of dollars globally on cybersecurity. Artificial intelligence has emerged as a great solution for building smarter and safer security systems that allow you to predict and detect suspicious network activity, such as phishing or unauthorized intrusions.
This cybersecurity book presents and demonstrates popular and successful AI approaches and models that you can adapt to detect potential attacks and protect your corporate systems. You'll learn about the role of machine learning and neural networks, as well as deep learning in cybersecurity, and you'll also learn how you can infuse AI capabilities into building smart defensive mechanisms. As you advance, you'll be able to apply these strategies across a variety of applications, including spam filters, network intrusion detection, botnet detection, and secure authentication.
By the end of this book, you'll be ready to develop intelligent systems that can detect unusual and suspicious patterns and attacks, thereby developing strong network security defenses using AI.
What you will learn
Who this book is for
If you're a cybersecurity professional or ethical hacker who wants to build intelligent systems using the power of machine learning and AI, you'll find this book useful. Familiarity with cybersecurity concepts and knowledge of Python programming is essential to get the most out of this book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 396
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor:Yogesh DeokarContent Development Editor:Manorama HaridasTechnical Editor: Vibhuti GawdeCopy Editor: Safis EditingProject Coordinator:Kirti PisatProofreader: Safis EditingIndexer:Rekha NairProduction Designer:Deepika Naik
First published: August 2019
Production reference: 1010819
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78980-402-7
www.packtpub.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Alessandro Parisi has been an IT professional for over 20 years, acquiring significant experience as a security data scientist, and as an AI cybersecurity and blockchain specialist. He has experience of operating within organizational and decisional contexts characterized by high complexity. Over the years, he has helped companies to adopt AI and blockchain DLT technologies as strategic tools in protecting sensitive corporate assets. He holds an MSc in economics and statistics.
Chiheb Chebbi is a Tunisian InfoSec enthusiast, author, and technical reviewer with experience in various aspects of information security, focusing on the investigation of advanced cyber attacks and research into cyber espionage. His core interests are penetration testing, machine learning, and threat hunting. He has been included in many halls of fame. The proposals outlined in his talks have been accepted by many world-class information security conferences.
Dr. Madiha Jafri received a BSc in computer engineering, an MSc in electrical engineering and a PhD in electrical and computer engineering from Old Dominion University in 2003, 2004, and 2007, respectively. Funded by NASA, she developed the use of artificial intelligence to predict electromagnetic interference patterns on commercial aircraft. Dr. Jafri has been with Lockheed Martin since June 2007 as a cybersecurity manager and cryptography expert, designing complex cryptographic solutions. She is now a senior scientist with a focus in artificial intelligence and cybersecurity.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands–On Artificial Intelligence for Cybersecurity
About Packt
Why subscribe?
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: AI Core Concepts and Tools of the Trade
Introduction to AI for Cybersecurity Professionals
Applying AI in cybersecurity
Evolution in AI: from expert systems to data mining
A brief introduction to expert systems
Reflecting the indeterministic nature of reality
Going beyond statistics toward machine learning
Mining data for models
Types of machine learning
Supervised learning
Unsupervised learning
Reinforcement learning
Algorithm training and optimization
How to find useful sources of data
Quantity versus quality
Getting to know Python's libraries
Supervised learning example – linear regression
Unsupervised learning example – clustering
Simple NN example – perceptron
AI in the context of cybersecurity
Summary
Setting Up Your AI for Cybersecurity Arsenal
Getting to know Python for AI and cybersecurity
Python libraries for AI
NumPy as an AI building block
NumPy multidimensional arrays
Matrix operations with NumPy
Implementing a simple predictor with NumPy
Scikit-learn
Matplotlib and Seaborn
Pandas
Python libraries for cybersecurity
Pefile
Volatility
Installing Python libraries
Enter Anaconda – the data scientist's environment of choice
Anaconda Python advantages
Conda utility
Installing packages in Anaconda
Creating custom environments
Some useful Conda commands
Python on steroids with parallel GPU
Playing with Jupyter Notebooks
Our first Jupyter Notebook
Exploring the Jupyter interface
What's in a cell?
Useful keyboard shortcuts
Choose your notebook kernel
Getting your hands dirty
Installing DL libraries
Deep learning pros and cons for cybersecurity
TensorFlow
Keras
PyTorch
PyTorch versus TensorFlow
Summary
Section 2: Detecting Cybersecurity Threats with AI
Ham or Spam? Detecting Email Cybersecurity Threats with AI
Detecting spam with Perceptrons
Meet NNs at their purest – the Perceptron
It's all about finding the right weight!
Spam filters in a nutshell
Spam filters in action
Detecting spam with linear classifiers
How the Perceptron learns
A simple Perceptron-based spam filter
Pros and cons of Perceptrons
Spam detection with SVMs
SVM optimization strategy
SVM spam filter example
Image spam detection with SVMs
How did SVM come into existence?
Phishing detection with logistic regression and decision trees
Regression models
Introducing linear regression models
Linear regression with scikit-learn
Linear regression – pros and cons
Logistic regression
A phishing detector with logistic regression
Logistic regression pros and cons
Making decisions with trees
Decision trees rationales
Phishing detection with decision trees
Decision trees – pros and cons
Spam detection with Naive Bayes
Advantages of Naive Bayes for spam detection
Why Naive Bayes?
NLP to the rescue
NLP steps
A Bayesian spam detector with NLTK
Summary
Malware Threat Detection
Malware analysis at a glance
Artificial intelligence for malware detection
Malware goes by many names
Malware analysis tools of the trade
Malware detection strategies
Static malware analysis
Static analysis methodology
Difficulties of static malware analysis
How to perform static analysis
Hardware requirements for static analysis
Dynamic malware analysis
Anti-analysis tricks
Getting malware samples
Hacking the PE file format
The PE file format as a potential vector of infection
Overview of the PE file format
The DOS header and DOS stub
The PE header structure
The data directory
Import and export tables
Extracting malware artifacts in a dataset
Telling different malware families apart
Understanding clustering algorithms
From distances to clusters
Clustering algorithms
Evaluating clustering with the Silhouette coefficient
K-Means in depth
K-Means steps
K-Means pros and cons
Clustering malware with K-Means
Decision tree malware detectors
Decision trees classification strategy
Detecting malwares with decision trees
Decision trees on steroids – random forests
Random Forest Malware Classifier
Detecting metamorphic malware with HMMs
How malware circumvents detection?
Polymorphic malware detection strategies
HMM fundamentals
HMM example
Advanced malware detection with deep learning
NNs in a nutshell
CNNs
From images to malware
Why should we use images for malware detection?
Detecting malware from images with CNNs
Summary
Network Anomaly Detection with AI
Network anomaly detection techniques
Anomaly detection rationales
Intrusion Detection Systems 
Host Intrusion Detection Systems
Network Intrusion Detection Systems
Anomaly-driven IDS
Turning service logs into datasets
Advantages of integrating network data with service logs
How to classify network attacks
Most common network attacks
Anomaly detection strategies
Anomaly detection assumptions and challenges
Detecting botnet topology
What is a botnet?
The botnet kill chain
Different ML algorithms for botnet detection
Gaussian anomaly detection
The Gaussian distribution
Anomaly detection using the Gaussian distribution
Gaussian anomaly detection example
False alarm management in anomaly detection
Receiver operating characteristic analysis
Summary
Section 3: Protecting Sensitive Information and Assets
Securing User Authentication
Authentication abuse prevention
Are passwords obsolete?
Common authentication practices
How to spot fake logins
Fake login management – reactive versus predictive
Predicting the unpredictable
Choosing the right features
Preventing fake account creation
Account reputation scoring
Classifying suspicious user activity
Supervised learning pros and cons
Clustering pros and cons
User authentication with keystroke recognition
Coursera Signature Track
Keystroke dynamics
Anomaly detection with keystroke dynamics
Keystroke detection example code
User detection with multilayer perceptrons
Biometric authentication with facial recognition
Facial recognition pros and cons
Eigenfaces facial recognition
Dimensionality reduction with principal component analysis (PCA)
Principal component analysis
Variance, covariance, and the covariance matrix
Eigenvectors and Eigenvalues
Eigenfaces example
Summary
Fraud Prevention with Cloud AI Solutions
Introducing fraud detection algorithms
Dealing with credit card fraud
Machine learning for fraud detection
Fraud detection and prevention systems
Expert-driven predictive models
Data-driven predictive models
FDPS – the best of both worlds
Learning from unbalanced and non-stationary data
Dealing with unbalanced datasets
Dealing with non-stationary datasets
Predictive analytics for credit card fraud detection
Embracing big data analytics in fraud detection
Ensemble learning
Bagging (bootstrap aggregating)
Boosting algorithms
Stacking
Bagging example
Boosting with AdaBoost
Introducing the gradient
Gradient boosting
eXtreme Gradient Boosting (XGBoost)
Sampling methods for unbalanced datasets
Oversampling with SMOTE
Sampling examples
Getting to know IBM Watson Cloud solutions
Cloud computing advantages
Achieving data scalability
Cloud delivery models
Empowering cognitive computing
Importing sample data and running Jupyter Notebook in the cloud
Credit card fraud detection with IBM Watson Studio
Predicting with RandomForestClassifier
Predicting with GradientBoostingClassifier
Predicting with XGBoost
Evaluating the quality of our predictions
F1 value
ROC curve
AUC (Area Under the ROC curve)
Comparing ensemble classifiers
The RandomForestClassifier report
The GradientBoostingClassifier report
The XGBClassifier report
Improving predictions accuracy with SMOTE
Summary
GANs - Attacks and Defenses
GANs in a nutshell
A glimpse into deep learning
Artificial neurons and activation functions
From artificial neurons to neural networks
Getting to know GANs
Generative versus discriminative networks
The Nash equilibrium
The math behind GANs
How to train a GAN
An example of a GAN–emulating MNIST handwritten digits
GAN Python tools and libraries
Neural network vulnerabilities
Deep neural network attacks
Adversarial attack methodologies
Adversarial attack transferability
Defending against adversarial attacks
CleverHans library of adversarial examples
EvadeML-Zoo library of adversarial examples
Network attack via model substitution
Substitute model training
Generating the synthetic dataset
Fooling malware detectors with MalGAN
IDS evasion via GAN
Introducing IDSGAN
Features of IDSGAN
The IDSGAN training dataset
Generator network
Discriminator network
Understanding IDSGAN's algorithm training
Facial recognition attacks with GAN
Facial recognition vulnerability to adversarial attacks
Adversarial examples against FaceNet
Launching the adversarial attack against FaceNet's CNN
Summary
Section 4: Evaluating and Testing Your AI Arsenal
Evaluating Algorithms
Best practices of feature engineering 
Better algorithms or more data?
The very nature of raw data
Feature engineering to the rescue
Dealing with raw data
Data binarization
Data binning
Logarithmic data transformation
Data normalization
Min–max scaling
Variance scaling 
How to manage categorical variables
Ordinal encoding
One-hot encoding
Dummy encoding
Feature engineering examples with sklearn
Min–max scaler
Standard scaler
Power transformation
Ordinal encoding with sklearn
One-hot encoding with sklearn
Evaluating a detector's performance with ROC
ROC curve and AUC measure
Examples of ROC metrics
ROC curve example
AUC score example
Brier score example
How to split data into training and test sets
Algorithm generalization error
Algorithm learning curves
Using cross validation for algorithms
K-folds cross validation pros and cons
K-folds cross validation example
Summary
Assessing your AI Arsenal
Evading ML detectors
Understanding RL
RL feedback and state transition
Evading malware detectors with RL
Black-box attacks with RL
Challenging ML anomaly detection
Incident response and threat mitigation
Empowering detection systems with human feedback
Testing for data and model quality
Assessing data quality
Biased datasets
Unbalanced and mislabeled datasets
Missing values in datasets
Missing values example
Assessing model quality
Fine-tuning hyperparameters
Model optimization with cross validation
Ensuring security and reliability
Ensuring performance and scalability
Ensuring resilience and availability
Ensuring confidentiality and privacy
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Organizations today are spending billions of dollars globally on cybersecurity. Artificial Intelligence (AI) has emerged as a great solution for building smarter and safer security systems that allow you to predict and detect suspicious network activities, such as phishing or unauthorized intrusions, in your network.
This cybersecurity book presents and demonstrates the popular and successful AI approaches and models that you can adopt to detect potential attacks and protect your corporate systems. You'll understand the roles of machine learning (ML) and neural networks (NNs), and deep learning in cybersecurity, and learn how you can infuse AI capabilities when building smart defensive mechanisms. As you advance, you'll be able to apply these strategies across a variety of applications, including spam filters, network intrusion detection, botnet detection, and secure authentication.
By the end of this book, you'll be ready to develop intelligent systems that can detect unusual and suspicious patterns and attacks, thereby developing strong network security defenses using AI.
If you're a cybersecurity professional or ethical hacker who wants to build intelligent systems using the power of ML and AI, you'll find this book useful.
Chapter 1, Introduction to AI for Cybersecurity Professionals, introduces the various branches of AI to be distinguished between, focusing on the pros and cons of the various approaches of automated learning in the field of cybersecurity. This chapter also covers the different strategies for learning the algorithms and their optimizations. The main concepts of AI will be shown in action using Jupyter Notebooks. The tools used in this chapter are Jupyter Notebooks, NumPy, and scikit-learn, and the datasets used are scikit-learn datasets and CSV samples.
Chapter 2, Setting Up Your AI for Cybersecurity Arsenal, introduces the main software requirements and their configurations. We will learn to feed a knowledge base with samples of malicious code to feed into AI algorithms. Jupyter Notebooks will be introduced for the interactive execution of Python tools and commands. The tools used in this chapter are Anaconda, and Jupyter Notebooks. No dataset is used here.
Chapter 3, Ham or Spam? Detecting Email Cybersecurity Threats with AI, covers detecting email security threats that use email as an attack vector. Different detection strategies, ranging from linear classifiers and Bayesian filters to more sophisticated solutions (such as decision trees, logistic regression, and natural language processing (NLP), will be illustrated. The examples will make use of the Jupyter Notebooks to allow greater interaction of the reader with the different solutions illustrated. The tools used in this chapter are Jupyter Notebooks, scikit-learn, and NLTK. The datasets used in this regard are the Kaggle spam dataset, CSV spam samples, and honeypot phishing samples.
Chapter 4, Malware Threat Detection, introduces a high diffusion of malware and ransomware codes, together with the rapid polymorphic mutation in different variants (polymorphic and metamorphic malwares) of the same threats that has rendered obsolete traditional detection solutions based on signatures and the hashing of image files. It is upon these techniques that common antivirus softwares are based. The examples will show the different malware analysis strategies that use ML algorithms. The tools used in this chapter are Jupyter Notebooks, scikit-learn, and TensorFlow. Datasets/samples used in this regard include theZoo malware samples.
Chapter 5, Network Anomaly Detection with AI, explains how the current level of interconnection between different devices has attained such complexity that it leads to serious doubts about the effectiveness of traditional concepts such as perimeter security. In cyberspace, in fact, the attack surface grows exponentially, and it is therefore essential to have automated tools for the detection of network anomalies and for learning about new potential threats. The tools used in this chapter are Jupyter Notebooks, pandas, scikit-learn, and Keras. The datasets used in this regard are Kaggle datasets, KDD 1990, CIDDS, CICIDS2017, services, and IDS log files.
Chapter 6, Securing User Authentication, introduces AI in the field of cybersecurity, which plays an increasingly important role in terms of the protection of sensitive user-related information, including credentials for access to their network accounts and applications in order to prevent abuse, such as identity theft.
Chapter 7, Fraud Prevention with Cloud AI Solutions, covers many of the security attacks and data breaches suffered by corporations. Such breaches have as their objective the violation of sensitive info, such as customers' credit cards. Such attacks are often conducted in stealth mode, meaning that it is difficult to detect such threats using traditional methods. The tools used in this chapter are IBM Watson Studio, IBM Cloud Object Storage, Jupyter Notebooks, scikit-learn, Apache Spark. The dataset used here is the Kaggle Credit Card Fraud Detection dataset.
Chapter 8, GANs – Attacks and Defenses, introduces Generative Adversarial Networks (GANs) that represent the most advanced example of NNs that deep learning makes available to us. In the context of cybersecurity, GANs can be used for legitimate purposes, as in the case of authentication procedures, but they can also be exploited to violate these procedures. The tools used in this chapter are CleverHans, the Adversarial Machine Learning (AML) library, EvadeML-Zoo, TensorFlow, and Keras. The datasets used are example images of faces created entirely by using a GAN.
Chapter 9, Evaluating Algorithms, shows how to evaluate the effectiveness of the various alternative solutions using appropriate analysis metrics. The tools used in this chapter are scikit-learn, NumPy, and Matplotlib. scikit datasets are used in this regard.
Chapter 10,Assessing Your AI Arsenal, covers techniques that attackers exploit to evade the tools. Only in this way is it possible to obtain a realistic picture of the effectiveness and reliability of the solutions adopted. In addition, the aspects related to the scalability of the solutions must be taken into consideration, and then monitored continuously to guarantee reliability. The tools used in this chapter are scikit-learn, Foolbox, EvadeML, Deep-pwning, TensorFlow, and Keras. The MNIST and scikit datasets are used in this regard.
A familiarity with cybersecurity concepts and knowledge of Python programming is essential in order to get the most out of this book.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-for-Cybersecurity. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789804027_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, the fundamental concepts of AI will be introduced, including analyzing the different types of algorithms and the most indicated use strategies for cybersecurity.
This section contains the following chapters:
Chapter 1
,
Introduction to AI for Cybersecurity Professionals
Chapter 2
,
Setting Up Your AI for Cybersecurity Arsenal
In this chapter, we'll distinguish between the various branches of Artificial Intelligence (AI), focusing on the pros and cons of the different approaches of automated learning in the field of cybersecurity.
We will introduce different strategies for learning and optimizing of the various algorithms, and we'll also look at the main concepts of AI in action using Jupyter Notebooks and the scikit-learn Python library.
This chapter will cover the following topics:
Applying AI in cybersecurity
The evolution from expert systems to data mining and AI
The different forms of automated learning
The characteristics of algorithm training and optimization
Beginning with AI via Jupyter Notebooks
Introducing AI in the context of cybersecurity
The application of AI to cybersecurity is an experimental research area that's not without problems, which we will try to explain during this chapter. However, it is undeniable that the results achieved so far are promising, and that in the near future the methods of analysis will become common practice, with clear and positive consequences in the cybersecurity professional field, both in terms of new job opportunities and new challenges.
When dealing with the topic of applying AI to cybersecurity, the reactions from insiders are often ambivalent. In fact, reactions of skepticism alternate with conservative attitudes, partly caused by the fear that machines will supplant human operators, despite the high technical and professional skills of humans, acquired from years of hard work.
However, in the near future, companies and organizations will increasingly need to invest in automated analysis tools that enable a rapid and adequate response to current and future cybersecurity challenges. Therefore, the scenario that is looming is actually a combination of skills, rather than a clash between human operators and machines. It is therefore likely that the AI within the field of cybersecurity will take charge of the dirty work, that is, the selection of potential suspect cases, leaving the most advanced tasks to the security analysts, letting them investigate in more depth the threats that deserve the most attention.
To understand the advantages associated with the adoption of AI in the field of cybersecurity, it is necessary to introduce the underlying logic to the different methodological approaches that characterize AI.
We will start with a brief historical analysis of the evolution of AI in order to fully evaluate the potential benefits of applying it in the field of cybersecurity.
One of the first attempts at automated learning consisted of defining the rule-based decision system applied to a given application domain, covering all the possible ramifications and concrete cases that could be found in the real world. In this way, all the possible options were hardcoded within the automated learning solutions, and were verified by experts in the field.
The fundamental limitation of such expert systems consisted of the fact that they reduced the decisions to Boolean values (which reduce everything down to a binary choice), thus limiting the ability to adapt the solutions to the different nuances of real-world use cases.
In fact, expert systems do not learn anything new compared to hardcoded solutions, but limit themselves to looking for the right answer within a (potentially very large) knowledge base that is not able to adapt to new problems that were not addressed previously.
Since the concrete cases that we come across in the real world cannot simply be represented using just true/false classification models (although experts in the sector strive to list all possible cases, there is always something in reality that escapes classification), it is therefore necessary to make the best use of the data at our disposal in order to let latent tendencies and anomalous cases (such as outliers) emerge, making use of statistical and probabilistic models that can more appropriately reflect the indeterministic nature of reality.
Although the introduction of statistical models broke through the limitations of expert systems, the underlying rigidity of the approach remained, because statistical models, such as rule-based decisions, were in fact established in advance and could not be modified to adapt to new data. For example, one of the most commonly used statistical models is the Gaussian distribution. The statistician could then decide that the data comes from a Gaussian distribution, and try to estimate the parameters that characterize the hypothetical distribution that best describes the data being analyzed, without taking into consideration alternative models.
To overcome these limits, it was therefore necessary to adopt an iterative approach, which allowed the introduction of machine learning (ML) algorithms capable of generalizing the descriptive models starting from the available data, thus autonomously generating its own features, without limiting itself to predefined target functions, but adapting them to the continuous evolution of the algorithm training process.
The difference in approach compared to the predefined static models is also reflected in the research field known as data mining.
An adequate definition of the data mining process consists of the discovery of adequate representative models, starting with the data. Also, in this case, instead of adopting pre-established statistical models, we can use ML algorithms based on the training data to identify the most suitable predictive model (this is more true when we are not able to understand the nature of the data at our disposal).
However, the algorithmic approach is not always adequate. When the nature of the data is clear and conforms to known models, there is no advantage in using ML algorithms instead of pre-defined models. The next step, which absorbs and extends the advantages of the previous approaches, adding the ability to manage cases not covered in the training data, leads us to AI.
AI is a wider field of research than ML, which can manage data of a more generic and abstract nature than ML, thus enabling the transfer of common solutions to different types of data without the need for complete retraining. In this way, it is possible, for example, to recognize objects from color images, starting with objects originally obtained from black and white samples.
Therefore, AI is considered as a broad field of research that includes ML. In turn, ML includes deep learning (DL) which is ML method based on artificial neural networks, as shown in the following diagram:
The process of mechanical learning from data can take different forms, with different characteristics and predictive abilities.
In the case of ML (which, as we have seen, is a branch of research belonging to AI), it is common to distinguish between the following types of ML:
Supervised learning
Unsupervised learning
Reinforcement learning
The differences between these learning modalities are attributable to the type of result (output) that we intend to achieve, based on the nature of the input required to produce it.
In the case of supervised learning, algorithm training is conducted using an input dataset, from which the type of output that we have to obtain is already known.
In practice, the algorithms must be trained to identify the relationships between the variables being trained, trying to optimize the learning parameters on the basis of the target variables (also called labels) that, as mentioned, are already known.
An example of a supervised learning algorithm is classification algorithms, which are particularly used in the field of cybersecurity for spam classification.
A spam filter is in fact trained by submitting an input dataset to the algorithm containing many examples of emails that have already been previously classified as spam (the emails were malicious or unwanted) or ham (the emails were genuine and harmless).
The classification algorithm of the spam filter must therefore learn to classify the new emails it will receive in the future, referring to the spam or ham classes based on the training previously performed on the input dataset of the already classified emails.
Another example of supervised algorithms is regression algorithms. Ultimately, there are the following main supervised algorithms:
Regression (linear and logistic)
k-Nearest Neighbors
(
k-NNs
)
Support vector machines
(
SVMs
)
Decision trees and random forests
Neural networks
(
NNs
)
In the case of unsupervised learning, the algorithms must try to classify the data independently, without the aid of a previous classification provided by the analyst. In the context of cybersecurity, unsupervised learning algorithms are important for identifying new (not previously detected) forms of malware attacks, frauds, and email spamming campaigns.
Here are some examples of unsupervised algorithms:
Dimensionality reduction:
Principal component analysis
(
PCA
)
PCA Kernel
Clustering:
k-means
Hierarchical cluster analysis
(
HCA
)
In the case of reinforcement learning (RL), a different learning strategy is followed, which emulates the trial and error approach. Thus, drawing information from the feedback obtained during the learning path, with the aim of maximizing the reward finally obtained based on the number of correct decisions that the algorithm has selected.
In practice, the learning process takes place in an unsupervised manner, with the particularity that a positive reward is assigned to each correct decision (and a negative reward for incorrect decisions) taken at each step of the learning path. At the end of the learning process, the decisions of the algorithm are reassessed based on the final reward achieved.
Given its dynamic nature, it is no coincidence that RL is more similar to the general approach adopted by AI than to the common algorithms developed in ML.
The following are some examples of RL algorithms:
Markov process
Q-learning
Temporal difference
(
TD
) methods
Monte Carlo methods
In particular, Hidden Markov Models (HMM) (which make use of the Markov process) are extremely important in the detection of polymorphic malware threats.
When preparing automated learning procedures, we will often face a series of challenges. We need to overcome these challenges in order to recognize and avoid compromising the reliability of the procedures themselves, thus preventing the possibility of drawing erroneous or hasty conclusions that, in the context of cybersecurity, can have devastating consequences.
One of the main problems that we often face, especially in the case of the configuration of threat detection procedures, is the management of false positives; that is, cases detected by the algorithm and classified as potential threats, which in reality are not. We will discuss false positives and ML evaluation metrics in more depth in Chapter 7, Fraud Prevention with Cloud AI Solutions, and Chapter 9, Evaluating Algorithms.
The management of false positives is particularly burdensome in the case of detection systems aimed at contrasting networking threats, given that the number of events detected are often so high that they absorb and saturate all the human resources dedicated to threat detection activities.
On the other hand, even correct (true positive) reports, if in excessive numbers, contribute to functionally overloading the analysts, distracting them from priority tasks. The need to optimize the learning procedures therefore emerges in order to reduce the number of cases that need to be analyzed in depth by the analysts.
This optimization activity often starts with the selection and cleaning of the data submitted to the algorithms.
In the case of anomaly detection, for example, particular attention must be paid to the data being analyzed. An effective anomaly detection activity presupposes that the training data does not contain the anomalies sought, but that on the contrary, they reflect the normal situation of reference.
If, on the other hand, the training data was biased with the anomalies being investigated, the anomaly detection activity would lose much of its reliability and utility in accordance with the principle commonly known as GIGO, which stands for garbage in, garbage out.
Given the increasing availability of raw data in real time, often the preliminary cleaning of data is considered a challenge in itself. In fact, it's often necessary to conduct a preliminary skim of the data, eliminating irrelevant or redundant information. We can then present the data to the algorithms in a correct form, which can improve their ability to learn, adapting to the form of data on the basis of the type of algorithm used.
For example, a classification algorithm will be able to identify a more representative and more effective model in cases in which the input data will be presented in a grouped form, or is capable of being linearly separable. In the same way, the presence of variables (also known as dimensions) containing empty fields weighs down the computational effort of the algorithm and produces less reliable predictive models due to the phenomenon known as the curse of dimensionality.
This occurs when the number of features, that is, dimensions, increases without improving the relevant information, simply resulting in data being dispersed in the increased space of research:
Also, the sources from which we draw our test cases (samples) are important. Think, for example, of a case in which we have to predict the mischievous behavior of an unknown executable. The problem in question is reduced to the definition of a model of classification of the executable, which must be traced back to one of two categories: genuine and malicious.
To achieve such a result, we need to train our classification algorithm by providing it with a number of examples of executables that are considered malicious as an input dataset.
When it all boils down to quantity versus quality, we are immediately faced with the following two problems:
What types of malware can we consider most representative of the most probable risks and threats to our company?
How many example cases (samples) should we collect and administer to the algorithms in order to obtain a reliable result in terms of both effectiveness and predictive efficiency of future threats?
The answers to the two questions are closely related to the knowledge that the analyst has of the specific organizational realm in which they must operate.
All this could lead the analyst to believe that the creation of a honey-pot, which is useful for gathering malicious samples in the wild that will be fed to the algorithms as training samples, would be more representative of the level of risk to which the organization is exposed than the use of datasets as examples of generic threats. At the same time, the number of test examples to be submitted to the algorithm is determined by the characteristics of the data themselves. These can, in fact, present a prevalence of cases (skewness) of a certain type, to the detriment of other types, leading to a distortion in the predictions of the algorithm toward the classes that are most numerous, when in reality, the most relevant information for our investigation is represented by a class with a smaller number of cases.
In conclusion, it will not be a matter of being able to simply choose the best algorithm for our goals (which often does not exist), but mainly to select the most representative cases (samples) to be submitted to a set of algorithms, which we will try to optimize based on the results obtained.
In the following sections, we will explore the concepts presented so far, presenting some sample code that make use of a series of Python libraries that are among the most well known and widespread in the field of ML:
NumPy (version 1.13.3)
pandas (version 0.20.3)
Matplotlib (version 2.0.2)
scikit-learn (version 0.20.0)
Seaborn (version 0.8.0)
The sample code will be shown here in the form of snippets, along with screenshots representing their output. Do not worry if not all of the implementation details are clear to you at first glance; we will have the opportunity to understand the implementation aspects of every single algorithm throughout the book.
With the exponential increase in the spread of threats associated with the daily diffusion of new malware, it is practically impossible to think of dealing effectively with these threats using only analysis conducted by human operators. It is necessary to introduce algorithms that allow us to automate that introductory phase of analysis known as triage, that is to say, to conduct a preliminary screening of the threats to be submitted to the attention of the cybersecurity professionals, allowing us to respond in a timely and effective manner to ongoing attacks.
We need to be able to respond in a dynamic fashion, adapting to the changes in the context related to the presence of unprecedented