E-Book
45,59 €

Mastering Apache Storm E-Book

Ankit jain

0,0

45,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Master the intricacies of Apache Storm and develop real-time stream processing applications with ease

About This Book

Exploit the various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more
Integrate Storm with other Big Data technologies like Hadoop, HBase, and Apache Kafka
An easy-to-understand guide to effortlessly create distributed applications with Storm

Who This Book Is For

If you are a Java developer who wants to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications.

What You Will Learn

Understand the core concepts of Apache Storm and real-time processing
Follow the steps to deploy multiple nodes of Storm Cluster
Create Trident topologies to support various message-processing semantics
Make your cluster sharing effective using Storm scheduling
Integrate Apache Storm with other Big Data technologies such as Hadoop, HBase, Kafka, and more
Monitor the health of your Storm cluster

In Detail

Apache Storm is a real-time Big Data processing framework that processes large amounts of data reliably, guaranteeing that every message will be processed. Storm allows you to scale your data as it grows, making it an excellent platform to solve your big data problems. This extensive guide will help you understand right from the basics to the advanced topics of Storm.

The book begins with a detailed introduction to real-time processing and where Storm fits in to solve these problems. You'll get an understanding of deploying Storm on clusters by writing a basic Storm Hello World example. Next we'll introduce you to Trident and you'll get a clear understanding of how you can develop and deploy a trident topology. We cover topics such as monitoring, Storm Parallelism, scheduler and log processing, in a very easy to understand manner. You will also learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery of Apache Storm. You will be able to use this knowledge to develop efficient, distributed real-time applications to cater to your business needs.

Style and approach

This easy-to-follow guide is full of examples and real-world applications to help you get an in-depth understanding of Apache Storm. This book covers the basics thoroughly and also delves into the intermediate and slightly advanced concepts of application development with Apache Storm.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 247

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Mastering Apache Storm

Processing big data streams in real time

Ankit Jain

BIRMINGHAM - MUMBAI

Mastering Apache Storm

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2017

Production reference: 1140817

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78712-563-6

www.packtpub.com

Credits

Author

Ankit Jain

Copy Editor

Safis Editing

Reviewers

Doug Ortiz

Oleg Okun

Project Coordinator

Nidhi Joshi

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Divya Poojari

Indexer

Tejal Daruwale Soni

Content Development Editor

Mayur Pawanikar

Graphics

Tania Dutta

Technical Editor

Dinesh Pawar

Production Coordinator

Arvindkumar Gupta

About the Author

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE.

He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.

About the Reviewers

Doug Ortiz is a senior big data artechitect at ByteCubed who has been architecting, developing and integrating enterprise solutions throughout his whole career. Organizations that leverage high skillset have been able to rediscover and reuse their underutilized data via existing and emerging technologies such as: Microsoft BI Stack, Hadoop, NoSQL Databases, SharePoint, Hadoop, related toolsets and technologies.

He is the founder of Illustris, LLC and can be reached at [email protected].

Interesting aspects of his profession:

He has experience integrating multiple platforms and products

He has big data, data science certifications, R and Python certifications

He helps organizations gain a deeper understanding of, and value, their current investments in data and existing resources, turning them into useful sources of information

He has improved, salvaged, and architected projects by utilizing unique and innovative techniques

His hobbies are yoga and scuba diving.

Oleg Okun is a machine learning expert and the author/editor of four books, numerous journal articles, and conference papers. His career spans more than a quarter of a century. He was employed in both academia and industry in his mother country, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, credit scoring analytics, and text analytics.

He is interested in all aspects of distributed machine learning and the Internet of Things. Oleg currently lives and works in Hamburg, Germany.

I would like to express my deepest gratitude to my parents for everything that they have done for me.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com, and as a print book customer, you are entitled to a discount on the eBook copy.

Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787125637.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Real-Time Processing and Storm Introduction

Apache Storm

Features of Storm

Storm components

Nimbus

Supervisor nodes

The ZooKeeper cluster

The Storm data model

Definition of a Storm topology

Operation modes in Storm

Programming languages

Summary

Storm Deployment, Topology Development, and Topology Options

Storm prerequisites

Installing Java SDK 7

Deployment of the ZooKeeper cluster

Setting up the Storm cluster

Developing the hello world example

The different options of the Storm topology

Deactivate

Activate

Rebalance

Kill

Dynamic log level settings

Walkthrough of the Storm UI

Cluster Summary section

Nimbus Summary section

Supervisor Summary section

Nimbus Configuration section

Topology Summary section

Dynamic log level settings

Updating the log level from the Storm UI

Updating the log level from the Storm CLI

Summary

Storm Parallelism and Data Partitioning

Parallelism of a topology

Worker process

Executor

Task

Configure parallelism at the code level

Worker process, executor, and task distribution

Rebalance the parallelism of a topology

Rebalance the parallelism of a SampleStormClusterTopology topology

Different types of stream grouping in the Storm cluster

Shuffle grouping

Field grouping

All grouping

Global grouping

Direct grouping

Local or shuffle grouping

None grouping

Custom grouping

Guaranteed message processing

Tick tuple

Summary

Trident Introduction

Trident introduction

Understanding Trident's data model

Writing Trident functions, filters, and projections

Trident function

Trident filter

Trident projection

Trident repartitioning operations

Utilizing shuffle operation

Utilizing partitionBy operation

Utilizing global operation

Utilizing broadcast operation

Utilizing batchGlobal operation

Utilizing partition operation

Trident aggregator

partitionAggregate

aggregate

ReducerAggregator

Aggregator

CombinerAggregator

persistentAggregate

Aggregator chaining

Utilizing the groupBy operation

When to use Trident

Summary

Trident Topology and Uses

Trident groupBy operation

groupBy before partitionAggregate

groupBy before aggregate

Non-transactional topology

Trident hello world topology

Trident state

Distributed RPC

When to use Trident

Summary

Storm Scheduler

Introduction to Storm scheduler

Default scheduler

Isolation scheduler

Resource-aware scheduler

Component-level configuration

Memory usage example

CPU usage example

Worker-level configuration

Node-level configuration

Global component configuration

Custom scheduler

Configuration changes in the supervisor node

Configuration setting at component level

Writing a custom supervisor class

Converting component IDs to executors

Converting supervisors to slots

Registering a CustomScheduler class

Summary

Monitoring of Storm Cluster

Cluster statistics using the Nimbus thrift client

Fetching information with Nimbus thrift

Monitoring the Storm cluster using JMX

Monitoring the Storm cluster using Ganglia

Summary

Integration of Storm and Kafka

Introduction to Kafka

Kafka architecture

Producer

Replication

Consumer

Broker

Data retention

Installation of Kafka brokers

Setting up a single node Kafka cluster

Setting up a three node Kafka cluster

Multiple Kafka brokers on a single node

Share ZooKeeper between Storm and Kafka

Kafka producers and publishing data into Kafka

Kafka Storm integration

Deploy the Kafka topology on Storm cluster

Summary

Storm and Hadoop Integration

Introduction to Hadoop

Hadoop Common

Hadoop Distributed File System

Namenode

Datanode

HDFS client

Secondary namenode

YARN

ResourceManager (RM)

NodeManager (NM)

ApplicationMaster (AM)

Installation of Hadoop

Setting passwordless SSH

Getting the Hadoop bundle and setting up environment variables

Setting up HDFS

Setting up YARN

Write Storm topology to persist data into HDFS

Integration of Storm with Hadoop

Setting up Storm-YARN

Storm-Starter topologies on Storm-YARN

Summary

Storm Integration with Redis, Elasticsearch, and HBase

Integrating Storm with HBase

Integrating Storm with Redis

Integrating Storm with Elasticsearch

Integrating Storm with Esper

Summary

Apache Log Processing with Storm

Apache log processing elements

Producing Apache log in Kafka using Logstash

Installation of Logstash

What is Logstash?

Why are we using Logstash?

Installation of Logstash

Configuration of Logstash

Why are we using Kafka between Logstash and Storm?

Splitting the Apache log line

Identifying country, operating system type, and browser type from the log file

Calculate the search keyword

Persisting the process data

Kafka spout and define topology

Deploy topology

MySQL queries

Calculate the page hit from each country

Calculate the count for each browser

Calculate the count for each operating system

Summary

Twitter Tweet Collection and Machine Learning

Exploring machine learning

Twitter sentiment analysis

Using Kafka producer to store the tweets in a Kafka cluster

Kafka spout, sentiments bolt, and HDFS bolt

Summary

Preface

Real-time data processing in no longer a luxury exercised by a few big companies but has become a necessity for businesses that want to compete, and Apache Storm is one of the de facto standards for developing real-time processing pipelines. The key features of Storm are that it is horizontally scalable, is fault tolerant, and provides guaranteed message processing. Storm can solve various types of analytic problem: machine learning, log processing, graph analysis, and so on.

Mastering Storm will serve both as a getting started guide to inexperienced developers and as a reference for implementing advanced use cases with Storm for experienced developers. In the first two chapters, you will learn the basics of a Storm topology and various components of a Storm cluster. In the later chapters, you will learn how to build a Storm application that can interact with various other big data technologies and how to create transactional topologies. Finally, the last two chapters cover case studies for log processing and machine learning. We are also going to cover how we can use the Storm scheduler to assign delicate work to delicate machines.

What this book covers

Chapter 1, Real-Time Processing and Storm Introduction, gives an introduction to Storm and its components.

Chapter 2, Storm Deployment, Topology Development, and Topology Options, covers deploying Storm into the cluster, deploying the sample topology on a Storm cluster, how we can monitor the storm pipeline using storm UI, and how we can dynamically change the log level settings.

Chapter 3, Storm Parallelism and Data Partitioning, covers the parallelism of topology, how to configure parallelism at the code level, guaranteed message processing, and Storm internally generated tuples.

Chapter 4, Trident Introduction, covers an introduction to Trident, an understanding of the Trident data model, and how we can write Trident filters and functions. This chapter also covers repartitioning and aggregation operations on Trident tuples.

Chapter 5, Trident Topology and Uses, introduces Trident tuple grouping, non-transactional topology, and a sample Trident topology. The chapter also introduces Trident state and distributed RPC.

Chapter 6, Storm Scheduler, covers different types of scheduler available in Storm: the default scheduler, isolation scheduler, resource-aware scheduler, and custom scheduler.

Chapter 7, Monitoring of the Storm Cluster, covers monitoring Storm by writing custom monitoring UIs using the stats published by Nimbus. We explain the integration of Ganglia with Storm using JMXTrans. This chapter also covers how we can configure Storm to publish JMX metrics.

Chapter 8, Integration of Storm and Kafka, shows the integration of Storm with Kafka. This chapter starts with an introduction to Kafka, covers the installation of Storm, and ends with the integration of Storm with Kafka to solve any real-world problem.

Chapter 9, Storm and Hadoop Integration, covers an overview of Hadoop, writing the Storm topology to publish data into HDFS, an overview of Storm-YARN, and deploying the Storm topology on YARN.

Chapter 10, Storm Integration with Redis, Elasticsearch, and HBase, teaches you how to integrate Storm with various other big data technologies.

Chapter 11, Apache Log Processing with Storm, covers a sample log processing application in which we parse Apache web server logs and generate some business information from log files.

Chapter 12, Twitter Tweets Collection and Machine Learning, walks you through a case study implementing a machine learning topology in Storm.

What you need for this book

All of the code in this book has been tested on CentOS 6.5. It will run on other variants of Linux and Windows as well with appropriate changes in commands.

We have tried to keep the chapters self-contained, and the setup and installation of all the software used in each chapter are included in the chapter itself. These are the software packages used throughout the book:

CentOS 6.5

Oracle JDK 8

Apache ZooKeeper 3.4.6

Apache Storm 1.0.2

Eclipse or Spring Tool Suite

Elasticsearch 2.4.4

Hadoop 2.2.2

Logstash 5.4.1

Kafka 0.9.0.1

Esper 5.3.0

Who this book is for

If you are a Java developer and are keen to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Add the following line in the storm.yaml file of the Nimbus machine to enable JMX on the Nimbus node."

A block of code is set as follows:

<dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-core</artifactId> <version>1.0.2</version> <scope>provided<scope></dependency>

Any command-line input or output is written as follows:

cd $ZK_HOME/conf

touch zoo.cfg

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Now, click on the Connect button to view the metrics of the supervisor node."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support, and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

Enter the name of the book in the

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Apache-Storm. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringApacheStorm_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support, and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Real-Time Processing and Storm Introduction

With the exponential growth in the amount of data being generated and advanced data-capturing capabilities, enterprises are facing the challenge of making sense out of this mountain of raw data. On the batch processing front, Hadoop has emerged as the go-to framework to deal with big data. Until recently, there has been a void when one looks for frameworks to build real-time stream processing applications. Such applications have become an integral part of a lot of businesses as they enable them to respond swiftly to events and adapt to changing situations. Examples of this are monitoring social media to analyze public response to any new product that you launch and predicting the outcome of an election based on the sentiments of election-related posts.

Organizations are collecting a large volume of data from external sources and want to evaluate/process the data in real time to get market trends, detect fraud, identify user behavior, and so on. The need for real-time processing is increasing day by day and we require a real-time system/platform that should support the following features:

Scalable

: The platform should be horizontally scalable without any down time.

Fault tolerance

: The platform should be able to process the data even after some of the nodes in a cluster go down.

No data lost

: The platform should provide the guaranteed processing of messages.

High throughput

: The system should be able to support millions of records per second and also support any size of messages.

Easy to operate

: The system should have easy installation and operation. Also, the expansion of clusters should be an easy process.

Multiple languages

: The platform should support multiple languages. The end user should be able to write code in different languages. For example, a user can write code in Python, Scala, Java, and so on. Also, we can execute different language code inside the one cluster.

Cluster isolation

: The system should support isolation so that dedicated processes can be assigned to dedicated machines for processing.

Apache Storm

Apache Storm has emerged as the platform of choice for industry leaders to develop distributed, real-time, data processing platforms. It provides a set of primitives that can be used to develop applications that can process a very large amount of data in real time in a highly scalable manner.

Storm is to real-time processing what Hadoop is to batch processing. It is open source software, and managed by Apache Software Foundation. It has been deployed to meet real-time processing needs by companies such as Twitter, Yahoo!, and Flipboard. Storm was first developed by Nathan Marz at BackType, a company that provided social search applications. Later, BackType was acquired by Twitter, and it is a critical part of their infrastructure. Storm can be used for the following use cases:

Stream processing

: Storm is used to process a stream of data and update a variety of databases in real time. This processing occurs in real time and the processing speed needs to match the input data speed.

Continuous computation

: Storm can do continuous computation on data streams and stream the results to clients in real time. This might require processing each message as it comes in or creating small batches over a short time. An example of continuous computation is streaming trending topics on Twitter into browsers.

Distributed RPC

: Storm can parallelize an intense query so that you can compute it in real time.

Real-time analytics

: Storm can analyze and respond to data that comes from different data sources as they happen in real time.

In this chapter, we will cover the following topics:

What is a Storm?

Features of Storm

Architecture and components of a Storm cluster

Terminologies of Storm

Programming language

Operation modes

Features of Storm

The following are some of the features of Storm that make it a perfect solution to process streams of data in real time:

Fast

: Storm has been reported to process up to 1 million tuples/records per second per node.

Horizontally scalable

: Being fast is a necessary feature to build a high volume/velocity data processing platform, but a single node will have an upper limit on the number of events that it can process per second. A node represents a single machine in your setup that executes Storm applications. Storm, being a distributed platform, allows you to add more nodes to your Storm cluster and increase the processing capacity of your application. Also, it is linearly scalable, which means that you can double the processing capacity by doubling the nodes.

Fault tolerant

: Units of work are executed by worker processes in a Storm cluster. When a worker dies, Storm will restart that worker, and if the node on which the worker is running dies, Storm will restart that worker on some other node in the cluster. This feature will be covered in more detail in

Chapter 3

Storm Parallelism and Data Partitioning

Guaranteed data processing

: Storm provides strong guarantees that each message entering a Storm process will be processed at least once. In the event of failures, Storm will replay the lost tuples/records. Also, it can be configured so that each message will be processed only once.

Easy to operate

: Storm is simple to deploy and manage. Once the cluster is deployed, it requires little maintenance.

Programming language agnostic

: Even though the Storm platform runs on

Java virtual machine

(

JVM

), the applications that run over it can be written in any programming language that can read and write to standard input and output streams.

Storm components

A Storm cluster follows a master-slave model where the master and slave processes are coordinated through ZooKeeper. The following are the components of a Storm cluster.

Nimbus

The Nimbus node is the master in a Storm cluster. It is responsible for distributing the application code across various worker nodes, assigning tasks to different machines, monitoring tasks for any failures, and restarting them as and when required.

Nimbus is stateless and stores all of its data in ZooKeeper. There is a single Nimbus node in a Storm cluster. If the active node goes down, then the passive node will become an Active node. It is designed to be fail-fast, so when the active Nimbus dies, the passive node will become an active node, or the down node can be restarted without having any effect on the tasks already running on the worker nodes. This is unlike Hadoop, where if the JobTracker dies, all the running jobs are left in an inconsistent state and need to be executed again. The Storm workers can work smoothly even if all the Nimbus nodes go down but the user can't submit any new jobs into the cluster or the cluster will not be able to reassign the failed workers to another node.

Supervisor nodes

Supervisor nodes are the worker nodes in a Storm cluster. Each supervisor node runs a supervisor daemon that is responsible for creating, starting, and stopping worker processes to execute the tasks assigned to that node. Like Nimbus, a supervisor daemon is also fail-fast and stores all of its states in ZooKeeper so that it can be restarted without any state loss. A single supervisor daemon normally handles multiple worker processes running on that machine.

The ZooKeeper cluster

In any distributed application, various processes need to coordinate with each other and share some configuration information. ZooKeeper is an application that provides all these services in a reliable manner. As a distributed application, Storm also uses a ZooKeeper cluster to coordinate various processes. All of the states associated with the cluster and the various tasks submitted to Storm are stored in ZooKeeper. Nimbus and supervisor nodes do not communicate directly with each other, but through ZooKeeper. As all data is stored in ZooKeeper, both Nimbus and the supervisor daemons can be killed abruptly without adversely affecting the cluster.

The following is an architecture diagram of a Storm cluster:

The Storm data model

The basic unit of data that can be processed by a Storm application is called a tuple. Each tuple consists of a predefined list of fields. The value of each field can be a byte, char, integer, long, float, double, Boolean, or byte array. Storm also provides an API to define your own datatypes, which can be serialized as fields in a tuple.

A tuple is dynamically typed, that is, you just need to define the names of the fields in a tuple and not their datatype. The choice of dynamic typing helps to simplify the API and makes it easy to use. Also, since a processing unit in Storm can process multiple types of tuples, it's not practical to declare field types.

Each of the fields in a tuple can be accessed by its name, getValueByField(String), or its positional index, getValue(int), in the tuple. Tuples also provide convenient methods such as getIntegerByField(String) that save you from typecasting the objects. For example, if you have a Fraction (numerator, denominator) tuple, representing fractional numbers, then you can get the value of the numerator by either using getIntegerByField("numerator") or getInteger(0).

You can see the full set of operations supported by org.apache.storm.tuple.Tuple in the Java doc that is located at https://storm.apache.org/releases/1.0.2/javadocs/org/apache/storm/tuple/Tuple.html.