45,59 €
Master the intricacies of Apache Storm and develop real-time stream processing applications with ease
If you are a Java developer who wants to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications.
Apache Storm is a real-time Big Data processing framework that processes large amounts of data reliably, guaranteeing that every message will be processed. Storm allows you to scale your data as it grows, making it an excellent platform to solve your big data problems. This extensive guide will help you understand right from the basics to the advanced topics of Storm.
The book begins with a detailed introduction to real-time processing and where Storm fits in to solve these problems. You'll get an understanding of deploying Storm on clusters by writing a basic Storm Hello World example. Next we'll introduce you to Trident and you'll get a clear understanding of how you can develop and deploy a trident topology. We cover topics such as monitoring, Storm Parallelism, scheduler and log processing, in a very easy to understand manner. You will also learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.
With real-world examples and clear explanations, this book will ensure you will have a thorough mastery of Apache Storm. You will be able to use this knowledge to develop efficient, distributed real-time applications to cater to your business needs.
This easy-to-follow guide is full of examples and real-world applications to help you get an in-depth understanding of Apache Storm. This book covers the basics thoroughly and also delves into the intermediate and slightly advanced concepts of application development with Apache Storm.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 247
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2017
Production reference: 1140817
ISBN 978-1-78712-563-6
www.packtpub.com
Author
Ankit Jain
Copy Editor
Safis Editing
Reviewers
Doug Ortiz
Oleg Okun
Project Coordinator
Nidhi Joshi
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Divya Poojari
Indexer
Tejal Daruwale Soni
Content Development Editor
Mayur Pawanikar
Graphics
Tania Dutta
Technical Editor
Dinesh Pawar
Production Coordinator
Arvindkumar Gupta
Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE.
He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Doug Ortiz is a senior big data artechitect at ByteCubed who has been architecting, developing and integrating enterprise solutions throughout his whole career. Organizations that leverage high skillset have been able to rediscover and reuse their underutilized data via existing and emerging technologies such as: Microsoft BI Stack, Hadoop, NoSQL Databases, SharePoint, Hadoop, related toolsets and technologies.
He is the founder of Illustris, LLC and can be reached at [email protected].
Interesting aspects of his profession:
He has experience integrating multiple platforms and products
He has big data, data science certifications, R and Python certifications
He helps organizations gain a deeper understanding of, and value, their current investments in data and existing resources, turning them into useful sources of information
He has improved, salvaged, and architected projects by utilizing unique and innovative techniques
His hobbies are yoga and scuba diving.
Oleg Okun is a machine learning expert and the author/editor of four books, numerous journal articles, and conference papers. His career spans more than a quarter of a century. He was employed in both academia and industry in his mother country, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, credit scoring analytics, and text analytics.
He is interested in all aspects of distributed machine learning and the Internet of Things. Oleg currently lives and works in Hamburg, Germany.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com, and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787125637.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products.
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Real-Time Processing and Storm Introduction
Apache Storm
Features of Storm
Storm components
Nimbus
Supervisor nodes
The ZooKeeper cluster
The Storm data model
Definition of a Storm topology
Operation modes in Storm
Programming languages
Summary
Storm Deployment, Topology Development, and Topology Options
Storm prerequisites
Installing Java SDK 7
Deployment of the ZooKeeper cluster
Setting up the Storm cluster
Developing the hello world example
The different options of the Storm topology
Deactivate
Activate
Rebalance
Kill
Dynamic log level settings
Walkthrough of the Storm UI
Cluster Summary section
Nimbus Summary section
Supervisor Summary section
Nimbus Configuration section
Topology Summary section
Dynamic log level settings
Updating the log level from the Storm UI
Updating the log level from the Storm CLI
Summary
Storm Parallelism and Data Partitioning
Parallelism of a topology
Worker process
Executor
Task
Configure parallelism at the code level
Worker process, executor, and task distribution
Rebalance the parallelism of a topology
Rebalance the parallelism of a SampleStormClusterTopology topology
Different types of stream grouping in the Storm cluster
Shuffle grouping
Field grouping
All grouping
Global grouping
Direct grouping
Local or shuffle grouping
None grouping
Custom grouping
Guaranteed message processing
Tick tuple
Summary
Trident Introduction
Trident introduction
Understanding Trident's data model
Writing Trident functions, filters, and projections
Trident function
Trident filter
Trident projection
Trident repartitioning operations
Utilizing shuffle operation
Utilizing partitionBy operation
Utilizing global operation
Utilizing broadcast operation
Utilizing batchGlobal operation
Utilizing partition operation
Trident aggregator
partitionAggregate
aggregate
ReducerAggregator
Aggregator
CombinerAggregator
persistentAggregate
Aggregator chaining
Utilizing the groupBy operation
When to use Trident
Summary
Trident Topology and Uses
Trident groupBy operation
groupBy before partitionAggregate
groupBy before aggregate
Non-transactional topology
Trident hello world topology
Trident state
Distributed RPC
When to use Trident
Summary
Storm Scheduler
Introduction to Storm scheduler
Default scheduler
Isolation scheduler
Resource-aware scheduler
Component-level configuration
Memory usage example
CPU usage example
Worker-level configuration
Node-level configuration
Global component configuration
Custom scheduler
Configuration changes in the supervisor node
Configuration setting at component level
Writing a custom supervisor class
Converting component IDs to executors
Converting supervisors to slots
Registering a CustomScheduler class
Summary
Monitoring of Storm Cluster
Cluster statistics using the Nimbus thrift client
Fetching information with Nimbus thrift
Monitoring the Storm cluster using JMX
Monitoring the Storm cluster using Ganglia
Summary
Integration of Storm and Kafka
Introduction to Kafka
Kafka architecture
Producer
Replication
Consumer
Broker
Data retention
Installation of Kafka brokers
Setting up a single node Kafka cluster
Setting up a three node Kafka cluster
Multiple Kafka brokers on a single node
Share ZooKeeper between Storm and Kafka
Kafka producers and publishing data into Kafka
Kafka Storm integration
Deploy the Kafka topology on Storm cluster
Summary
Storm and Hadoop Integration
Introduction to Hadoop
Hadoop Common
Hadoop Distributed File System
Namenode
Datanode
HDFS client
Secondary namenode
YARN
ResourceManager (RM)
NodeManager (NM)
ApplicationMaster (AM)
Installation of Hadoop
Setting passwordless SSH
Getting the Hadoop bundle and setting up environment variables
Setting up HDFS
Setting up YARN
Write Storm topology to persist data into HDFS
Integration of Storm with Hadoop
Setting up Storm-YARN
Storm-Starter topologies on Storm-YARN
Summary
Storm Integration with Redis, Elasticsearch, and HBase
Integrating Storm with HBase
Integrating Storm with Redis
Integrating Storm with Elasticsearch
Integrating Storm with Esper
Summary
Apache Log Processing with Storm
Apache log processing elements
Producing Apache log in Kafka using Logstash
Installation of Logstash
What is Logstash?
Why are we using Logstash?
Installation of Logstash
Configuration of Logstash
Why are we using Kafka between Logstash and Storm?
Splitting the Apache log line
Identifying country, operating system type, and browser type from the log file
Calculate the search keyword
Persisting the process data
Kafka spout and define topology
Deploy topology
MySQL queries
Calculate the page hit from each country
Calculate the count for each browser
Calculate the count for each operating system
Summary
Twitter Tweet Collection and Machine Learning
Exploring machine learning
Twitter sentiment analysis
Using Kafka producer to store the tweets in a Kafka cluster
Kafka spout, sentiments bolt, and HDFS bolt
Summary
Real-time data processing in no longer a luxury exercised by a few big companies but has become a necessity for businesses that want to compete, and Apache Storm is one of the de facto standards for developing real-time processing pipelines. The key features of Storm are that it is horizontally scalable, is fault tolerant, and provides guaranteed message processing. Storm can solve various types of analytic problem: machine learning, log processing, graph analysis, and so on.
Mastering Storm will serve both as a getting started guide to inexperienced developers and as a reference for implementing advanced use cases with Storm for experienced developers. In the first two chapters, you will learn the basics of a Storm topology and various components of a Storm cluster. In the later chapters, you will learn how to build a Storm application that can interact with various other big data technologies and how to create transactional topologies. Finally, the last two chapters cover case studies for log processing and machine learning. We are also going to cover how we can use the Storm scheduler to assign delicate work to delicate machines.
Chapter 1, Real-Time Processing and Storm Introduction, gives an introduction to Storm and its components.
Chapter 2, Storm Deployment, Topology Development, and Topology Options, covers deploying Storm into the cluster, deploying the sample topology on a Storm cluster, how we can monitor the storm pipeline using storm UI, and how we can dynamically change the log level settings.
Chapter 3, Storm Parallelism and Data Partitioning, covers the parallelism of topology, how to configure parallelism at the code level, guaranteed message processing, and Storm internally generated tuples.
Chapter 4, Trident Introduction, covers an introduction to Trident, an understanding of the Trident data model, and how we can write Trident filters and functions. This chapter also covers repartitioning and aggregation operations on Trident tuples.
Chapter 5, Trident Topology and Uses, introduces Trident tuple grouping, non-transactional topology, and a sample Trident topology. The chapter also introduces Trident state and distributed RPC.
Chapter 6, Storm Scheduler, covers different types of scheduler available in Storm: the default scheduler, isolation scheduler, resource-aware scheduler, and custom scheduler.
Chapter 7, Monitoring of the Storm Cluster, covers monitoring Storm by writing custom monitoring UIs using the stats published by Nimbus. We explain the integration of Ganglia with Storm using JMXTrans. This chapter also covers how we can configure Storm to publish JMX metrics.
Chapter 8, Integration of Storm and Kafka, shows the integration of Storm with Kafka. This chapter starts with an introduction to Kafka, covers the installation of Storm, and ends with the integration of Storm with Kafka to solve any real-world problem.
Chapter 9, Storm and Hadoop Integration, covers an overview of Hadoop, writing the Storm topology to publish data into HDFS, an overview of Storm-YARN, and deploying the Storm topology on YARN.
Chapter 10, Storm Integration with Redis, Elasticsearch, and HBase, teaches you how to integrate Storm with various other big data technologies.
Chapter 11, Apache Log Processing with Storm, covers a sample log processing application in which we parse Apache web server logs and generate some business information from log files.
Chapter 12, Twitter Tweets Collection and Machine Learning, walks you through a case study implementing a machine learning topology in Storm.
All of the code in this book has been tested on CentOS 6.5. It will run on other variants of Linux and Windows as well with appropriate changes in commands.
We have tried to keep the chapters self-contained, and the setup and installation of all the software used in each chapter are included in the chapter itself. These are the software packages used throughout the book:
CentOS 6.5
Oracle JDK 8
Apache ZooKeeper 3.4.6
Apache Storm 1.0.2
Eclipse or Spring Tool Suite
Elasticsearch 2.4.4
Hadoop 2.2.2
Logstash 5.4.1
Kafka 0.9.0.1
Esper 5.3.0
If you are a Java developer and are keen to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Add the following line in the storm.yaml file of the Nimbus machine to enable JMX on the Nimbus node."
A block of code is set as follows:
<dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-core</artifactId> <version>1.0.2</version> <scope>provided<scope></dependency>
Any command-line input or output is written as follows:
cd $ZK_HOME/conf
touch zoo.cfg
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Now, click on the Connect button to view the metrics of the supervisor node."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support, and register to have the files e-mailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Apache-Storm. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringApacheStorm_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support, and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
With the exponential growth in the amount of data being generated and advanced data-capturing capabilities, enterprises are facing the challenge of making sense out of this mountain of raw data. On the batch processing front, Hadoop has emerged as the go-to framework to deal with big data. Until recently, there has been a void when one looks for frameworks to build real-time stream processing applications. Such applications have become an integral part of a lot of businesses as they enable them to respond swiftly to events and adapt to changing situations. Examples of this are monitoring social media to analyze public response to any new product that you launch and predicting the outcome of an election based on the sentiments of election-related posts.
Organizations are collecting a large volume of data from external sources and want to evaluate/process the data in real time to get market trends, detect fraud, identify user behavior, and so on. The need for real-time processing is increasing day by day and we require a real-time system/platform that should support the following features:
Scalable
: The platform should be horizontally scalable without any down time.
Fault tolerance
: The platform should be able to process the data even after some of the nodes in a cluster go down.
No data lost
: The platform should provide the guaranteed processing of messages.
High throughput
: The system should be able to support millions of records per second and also support any size of messages.
Easy to operate
: The system should have easy installation and operation. Also, the expansion of clusters should be an easy process.
Multiple languages
: The platform should support multiple languages. The end user should be able to write code in different languages. For example, a user can write code in Python, Scala, Java, and so on. Also, we can execute different language code inside the one cluster.
Cluster isolation
: The system should support isolation so that dedicated processes can be assigned to dedicated machines for processing.
Apache Storm has emerged as the platform of choice for industry leaders to develop distributed, real-time, data processing platforms. It provides a set of primitives that can be used to develop applications that can process a very large amount of data in real time in a highly scalable manner.
Storm is to real-time processing what Hadoop is to batch processing. It is open source software, and managed by Apache Software Foundation. It has been deployed to meet real-time processing needs by companies such as Twitter, Yahoo!, and Flipboard. Storm was first developed by Nathan Marz at BackType, a company that provided social search applications. Later, BackType was acquired by Twitter, and it is a critical part of their infrastructure. Storm can be used for the following use cases:
Stream processing
: Storm is used to process a stream of data and update a variety of databases in real time. This processing occurs in real time and the processing speed needs to match the input data speed.
Continuous computation
: Storm can do continuous computation on data streams and stream the results to clients in real time. This might require processing each message as it comes in or creating small batches over a short time. An example of continuous computation is streaming trending topics on Twitter into browsers.
Distributed RPC
: Storm can parallelize an intense query so that you can compute it in real time.
Real-time analytics
: Storm can analyze and respond to data that comes from different data sources as they happen in real time.
In this chapter, we will cover the following topics:
What is a Storm?
Features of Storm
Architecture and components of a Storm cluster
Terminologies of Storm
Programming language
Operation modes
The following are some of the features of Storm that make it a perfect solution to process streams of data in real time:
Fast
: Storm has been reported to process up to 1 million tuples/records per second per node.
Horizontally scalable
: Being fast is a necessary feature to build a high volume/velocity data processing platform, but a single node will have an upper limit on the number of events that it can process per second. A node represents a single machine in your setup that executes Storm applications. Storm, being a distributed platform, allows you to add more nodes to your Storm cluster and increase the processing capacity of your application. Also, it is linearly scalable, which means that you can double the processing capacity by doubling the nodes.
Fault tolerant
: Units of work are executed by worker processes in a Storm cluster. When a worker dies, Storm will restart that worker, and if the node on which the worker is running dies, Storm will restart that worker on some other node in the cluster. This feature will be covered in more detail in
Chapter 3
,
Storm Parallelism and Data Partitioning
.
Guaranteed data processing
: Storm provides strong guarantees that each message entering a Storm process will be processed at least once. In the event of failures, Storm will replay the lost tuples/records. Also, it can be configured so that each message will be processed only once.
Easy to operate
: Storm is simple to deploy and manage. Once the cluster is deployed, it requires little maintenance.
Programming language agnostic
: Even though the Storm platform runs on
Java virtual machine
(
JVM
), the applications that run over it can be written in any programming language that can read and write to standard input and output streams.
A Storm cluster follows a master-slave model where the master and slave processes are coordinated through ZooKeeper. The following are the components of a Storm cluster.
The Nimbus node is the master in a Storm cluster. It is responsible for distributing the application code across various worker nodes, assigning tasks to different machines, monitoring tasks for any failures, and restarting them as and when required.
Nimbus is stateless and stores all of its data in ZooKeeper. There is a single Nimbus node in a Storm cluster. If the active node goes down, then the passive node will become an Active node. It is designed to be fail-fast, so when the active Nimbus dies, the passive node will become an active node, or the down node can be restarted without having any effect on the tasks already running on the worker nodes. This is unlike Hadoop, where if the JobTracker dies, all the running jobs are left in an inconsistent state and need to be executed again. The Storm workers can work smoothly even if all the Nimbus nodes go down but the user can't submit any new jobs into the cluster or the cluster will not be able to reassign the failed workers to another node.
Supervisor nodes are the worker nodes in a Storm cluster. Each supervisor node runs a supervisor daemon that is responsible for creating, starting, and stopping worker processes to execute the tasks assigned to that node. Like Nimbus, a supervisor daemon is also fail-fast and stores all of its states in ZooKeeper so that it can be restarted without any state loss. A single supervisor daemon normally handles multiple worker processes running on that machine.
In any distributed application, various processes need to coordinate with each other and share some configuration information. ZooKeeper is an application that provides all these services in a reliable manner. As a distributed application, Storm also uses a ZooKeeper cluster to coordinate various processes. All of the states associated with the cluster and the various tasks submitted to Storm are stored in ZooKeeper. Nimbus and supervisor nodes do not communicate directly with each other, but through ZooKeeper. As all data is stored in ZooKeeper, both Nimbus and the supervisor daemons can be killed abruptly without adversely affecting the cluster.
The following is an architecture diagram of a Storm cluster:
The basic unit of data that can be processed by a Storm application is called a tuple. Each tuple consists of a predefined list of fields. The value of each field can be a byte, char, integer, long, float, double, Boolean, or byte array. Storm also provides an API to define your own datatypes, which can be serialized as fields in a tuple.
A tuple is dynamically typed, that is, you just need to define the names of the fields in a tuple and not their datatype. The choice of dynamic typing helps to simplify the API and makes it easy to use. Also, since a processing unit in Storm can process multiple types of tuples, it's not practical to declare field types.
Each of the fields in a tuple can be accessed by its name, getValueByField(String), or its positional index, getValue(int), in the tuple. Tuples also provide convenient methods such as getIntegerByField(String) that save you from typecasting the objects. For example, if you have a Fraction (numerator, denominator) tuple, representing fractional numbers, then you can get the value of the numerator by either using getIntegerByField("numerator") or getInteger(0).
You can see the full set of operations supported by org.apache.storm.tuple.Tuple in the Java doc that is located at https://storm.apache.org/releases/1.0.2/javadocs/org/apache/storm/tuple/Tuple.html.