40,81 €
Perform advanced data manipulation tasks using pandas and become an expert data analyst.
Key Features
Book Description
pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains.
An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook.
By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.
What you will learn
Who this book is for
This book is for data scientists, analysts and Python developers who wish to explore advanced data analysis and scientific computing techniques using pandas. Some fundamental understanding of Python programming and familiarity with the basic data analysis concepts is all you need to get started with this book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 545
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith ShettyAcquisition Editor: Amey VarangaonkarContent Development Editor: Roshan KumarSenior Editor: Ayaan HodaTechnical Editor: Utkarsha S. KadamCopy Editor: Safis EditingProject Coordinator: Kirti PisatProofreader: Safis EditingIndexer: Tejal Daruwale SoniProduction Designer: Deepika Naik
First published: June 2015 Second edition: October 2019
Production reference: 1251019
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78934-323-6
www.packt.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks pff to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledling start-ups.
Jamshaid Sohail is a data scientist who is highly passionate about data science, machine learning, deep learning, big data, and other related fields. He spends his free time learning more about the data science field and learning how to use its emerging tools and technologies. He is always looking for new ways to share his knowledge with other people and add value to other people's lives. He has also attended Cambridge University for a summer course in computer science, where he studied under great professors; he would like to impart this knowledge to others. He has extensive experience as a data scientist in a US-based company. In short, he would be extremely delighted to educate and share knowledge with other people.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Mastering pandas Second Edition
About Packt
Why subscribe?
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Overview of Data Analysis and pandas
Introduction to pandas and Data Analysis
Motivation for data analysis
We live in a big data world
The four V's of big data
Volume of big data
Velocity of big data
Variety of big data
Veracity of big data
So much data, so little time for analysis
The move towards real-time analytics
Data analytics pipeline
How Python and pandas fit into the data analytics pipeline
What is pandas?
Where does pandas fit in the pipeline?
Benefits of using pandas
History of pandas
Usage pattern and adoption of pandas
pandas on the technology adoption curve
Popular applications of pandas
Summary
References
Installation of pandas and Supporting Software
Selecting a version of Python to use
Standalone Python installation
Linux
Installing Python from a compressed tarball
Windows
Core Python installation
Installing third-party Python and packages 
macOS/X
Installation using a package manager
Installation of Python and pandas using Anaconda
What is Anaconda?
Why Anaconda?
Installing Anaconda
Windows Installation
macOS Installation
Linux Installation
Cloud installation
Other numeric and analytics-focused Python distributions
Dependency packages for pandas
Review of items installed with Anaconda
JupyterLab
GlueViz
Walk-through of Jupyter Notebook and Spyder
Jupyter Notebook
Spyder
Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio
Pandas with R
pandas with Azure ML Studio
pandas with Julia
pandas with H2O
Command line tricks for pandas
Options and settings for pandas
Summary
Further reading
Section 2: Data Structures and I/O in pandas
Using NumPy and Data Structures with pandas
NumPy ndarrays
NumPy array creation
Array of ones and zeros
Array based on a numerical range
Random and empty arrays
Arrays based on existing arrays
NumPy data types
NumPy indexing and slicing
Array slicing
Array masking
Complex indexing
Copies and views
Operations
Basic operators
Mathematical operators
Statistical operators
Logical operators
Broadcasting
Array shape manipulation
Reshaping
Transposing
Ravel
Adding a new axis
Basic linear algebra operations
Array sorting
Implementing neural networks with NumPy
Practical applications of multidimensional arrays
Selecting only one channel
Selecting the region of interest of an image
Multiple channel selection and suppressing other channels
Data structures in pandas
Series
Series creation
Using an ndarray
Using a Python dictionary
Using a scalar value
Operations on Series
Assignment
Slicing
Other operations
DataFrames
DataFrame creation
Using a dictionary of Series
Using a dictionary of ndarrays/lists
Using a structured array
Using a list of dictionaries
Using a dictionary of tuples for multilevel indexing
Using a Series
Operations on pandas DataFrames
Column selection
Adding a new column
Deleting columns
Alignment of DataFrames
Other mathematical operations
Panels
Using a 3D NumPy array with axis labels
Using a Python dictionary of DataFrame objects
Using the DataFrame.to_panel method
Other operations
Summary
References
I/Os of Different Data Formats with pandas
Data sources and pandas methods
CSV and TXT
Reading CSV and TXT files
Reading a CSV file
Specifying column names for a dataset
Reading from a string of data
Skipping certain rows
Row index
Reading a text file
Subsetting while reading
Reading thousand format numbers as numbers
Indexing and multi-indexing
Reading large files in chunks
Handling delimiter characters in column data
Writing to a CSV
Excel
URL and S3
HTML
Writing to an HTML file
JSON
Writing a JSON to a file
Reading a JSON
Writing JSON to a DataFrame
Subsetting a JSON
Looping over JSON keys
Reading HDF formats
Reading feather files
Reading parquet files
Reading a SQL file
Reading a SAS/Stata file
Reading from Google BigQuery
Reading from a clipboard
Managing sparse data
Writing JSON objects to a file
Serialization/deserialization
Writing to exotic file types
to_pickle()
to_parquet()
to_hdf()
to_sql()
to_feather()
to_html()
to_msgpack()
to_latex()
to_stata()
to_clipboard()
GeoPandas
What is geospatial data?
Installation and dependencies
Working with GeoPandas
GeoDataFrames
Open source APIs – Quandl
read_sql_query
Pandas plotting
Andrews curves
Parallel plot
Radviz plots
Scatter matrix plot
Lag plot
Bootstrap plot
pandas-datareader
Yahoo Finance
World Bank
Summary
Section 3: Mastering Different Data Operations in pandas
Indexing and Selecting in pandas
Basic indexing
Accessing attributes using the dot operator
Range slicing
Labels, integer, and mixed indexing
Label-oriented indexing
Integer-oriented indexing
The .iat and .at operators
Mixed indexing with the .ix operator
Multi-indexing
Swapping and re-ordering levels
Cross-sections
Boolean indexing
The isin and any all methods
Using the where() method
Operations on indexes
Summary
Grouping, Merging, and Reshaping Data in pandas
Grouping data
The groupby operation
Using groupby with a MultiIndex
Using the aggregate method
Applying multiple functions
The transform() method
Filtering
Merging and joining
The concat function
Using append
Appending a single row to a DataFrame
SQL-like merging/joining of DataFrame objects
The join function
Pivots and reshaping data
Stacking and unstacking
The stack() function
The unstack() function
Other methods for reshaping DataFrames
Using the melt function
The pandas.get_dummies() function
pivot table
Transpose in pandas
Squeeze
nsmallest and nlargest
Summary
Special Data Operations in pandas
Writing and applying one-liner custom functions
lambda and apply
Handling missing values
Sources of missing values
Data extraction 
Data collection 
Data missing at random 
Data not missing at random 
Different types of missing values
Miscellaneous analysis of missing values
Strategies for handling missing values
Deletion 
Imputation
Interpolation 
KNN 
A survey of methods on series
The items() method
The keys() method
The pop() method
The apply() method
The map() method
The drop() method
The equals() method
The sample() method
The ravel() function
The value_counts() function
The interpolate() function
The align() function
pandas string methods
upper(), lower(), capitalize(), title(), and swapcase()
contains(), find(), and replace()
strip() and split()
startswith() and endswith()
The is...() functions
Binary operations on DataFrames and series
Binning values
Using mathematical methods on DataFrames
The abs() function
corr() and cov()
cummax(), cumin(), cumsum(), and cumprod()
The describe() function
The diff() function
The rank() function
The quantile() function
The round() function
The pct_change() function
min(), max(), median(), mean(), and mode()
all() and any()
The clip() function
The count() function
Summary
Time Series and Plotting Using Matplotlib
Handling time series data
Reading in time series data
Assigning date indexes and subsetting in time series data
Plotting the time series data
Resampling and rolling of the time series data
Separating timestamp components
DateOffset and TimeDelta objects
Time series-related instance methods
Shifting/lagging
Frequency conversion
Resampling of data
Aliases for time series frequencies
Time series concepts and datatypes
Period and PeriodIndex
PeriodIndex
Conversion between time series datatypes
A summary of time series-related objects
Interconversions between strings and timestamps
Data-processing techniques for time series data
Data transformation
Plotting using matplotlib
Summary
Section 4: Going a Step Beyond with pandas
Making Powerful Reports In Jupyter Using pandas
pandas styling
In-built styling options
User-defined styling options
Navigating Jupyter Notebook
Exploring the menu bar of Jupyter Notebook
Edit mode and command mode
Mouse navigation
Jupyter Notebook Dashboard
Ipywidgets
Interactive visualizations
Writing mathematical equations in Jupyter Notebook
Formatting text in Jupyter Notebook
Headers
Bold and italics
Alignment
Font color
Bulleted lists
Tables
Tables
HTML
Citation
Miscellaneous operations in Jupyter Notebook
Loading an image
Hyperlinks
Writing to a Python file
Running a Python file
Loading a Python file
Internal Links
Sharing Jupyter Notebook reports
Using NbViewer
Using the browser
Using Jupyter Hub
Summary
A Tour of Statistics with pandas and NumPy
Descriptive statistics versus inferential statistics
Measures of central tendency and variability
Measures of central tendency
The mean
The median
The mode
Computing the measures of central tendency of a dataset in Python
Measures of variability, dispersion, or spread
Range
Quartile
Deviation and variance
Hypothesis testing – the null and alternative hypotheses
The null and alternative hypotheses
The alpha and p-values
Type I and Type II errors
Statistical hypothesis tests
Background
The z-test
The t-test
Types of t-tests
A t-test example
chi-square test
ANOVA test
Confidence intervals
An illustrative example
Correlation and linear regression
Correlation
Linear regression
An illustrative example
Summary
A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates
Introduction to Bayesian statistics
The mathematical framework for Bayesian statistics
Bayes' theory and odds
Applications of Bayesian statistics
Probability distributions
Fitting a distribution
Discrete probability distributions
Discrete uniform distribution
The Bernoulli distribution
The binomial distribution
The Poisson distribution
The geometric distribution
The negative binomial distribution
Continuous probability distributions
The continuous uniform distribution
The exponential distribution
The normal distribution
Bayesian statistics versus frequentist statistics
What is probability?
How the model is defined
Confidence (frequentist) versus credible (Bayesian) intervals
Conducting Bayesian statistical analysis
Monte Carlo estimation of the likelihood function and PyMC
Bayesian analysis example – switchpoint detection
Maximum likelihood estimate
MLE calculation examples
Uniform distribution
Poisson distribution
References
Summary
Data Case Studies Using pandas
End-to-end exploratory data analysis
Data overview
Feature selection
Feature extraction
Data aggregation
Web scraping with Python
Web scraping using pandas
Web scraping using BeautifulSoup
Data validation
Data overview
Structured databases versus unstructured databases
Validating data types
Validating dimensions
Validating individual entries
Using pandas indexing
Using loops
Summary
The pandas Library Architecture
Understanding the pandas file hierarchy
Description of pandas modules and files
pandas/core
pandas/io
pandas/tools
pandas/util
pandas/tests
pandas/compat
pandas/computation
pandas/plotting
pandas/tseries
Improving performance using Python extensions
Summary
pandas Compared with Other Tools
Comparison with R
Data types in R
R lists
R DataFrames
Slicing and selection
Comparing R-matrix and NumPy array
Comparing R lists and pandas series
Specifying a column name in R
Specifying a column name in pandas
R DataFrames versus pandas DataFrames
Multi-column selection in R
Multi-column selection in pandas
Arithmetic operations on columns
Aggregation and GroupBy
Aggregation in R
The pandas GroupBy operator
Comparing matching operators in R and pandas
R %in% operator
Pandas isin() function
Logical subsetting
Logical subsetting in R
Logical subsetting in pandas
Split-apply-combine
Implementation in R
Implementation in pandas
Reshaping using melt
R melt function
The pandas melt function
Categorical data
R example using cut()
The pandas solution
Comparison with SQL
SELECT
SQL
pandas
Where
SQL
pandas
SQL
pandas
SQL
pandas
group by
SQL
pandas
SQL
pandas
SQL
pandas
update
SQL
pandas
delete
SQL
pandas
JOIN
SQL
pandas
SQL
pandas
SQL
pandas
Comparison with SAS
Summary
A Brief Tour of Machine Learning
The role of pandas in machine learning
Installation of scikit-learn
Installing via Anaconda
Installing on Unix (Linux/macOS)
Installing on Windows
Introduction to machine learning
Supervised versus unsupervised learning
Illustration using document classification
Supervised learning
Unsupervised learning
How machine learning systems learn
Application of machine learning – Kaggle Titanic competition
The Titanic: Machine Learning from Disaster problem
The problem of overfitting
Data analysis and preprocessing using pandas
Examining the data
Handling missing values
A naive approach to the Titanic problem
The scikit-learn ML/classifier interface
Supervised learning algorithms
Constructing a model using Patsy for scikit-learn
General boilerplate code explanation
Logistic regression
Support vector machine
Decision trees
Random forest
Unsupervised learning algorithms
Dimensionality reduction
K-means clustering
XGBoost case study
Entropy
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas for performing complex data analysis in various domains. It provides features and capabilities that make data analysis much easier and faster than with many other popular languages, such as Java, C, C++, and Ruby.
This book is for data scientists, analysts, and Python developers who wish to explore advanced data analysis and scientific computing techniques using pandas. Some fundamental understanding of Python programming and familiarity with basic data analysis concepts is all you need to get started with this book.
Chapter 1, Introduction to pandas and Data Analysis, will introduce pandas and explain where it fits in the data analysis pipeline. We will also look into some of the popular applications of pandas and how Python and pandas can be used for data analysis.
Chapter 2, Installation of pandas and Supporting Software, will deal with the installation of Python (if necessary), the pandas library, and all necessary dependencies for the Windows, macOS X, and Linux platforms. We will also look into the command-line tricks and options and settings for pandas as well.
Chapter 3, Using NumPy and Data Structures with pandas, will give a quick tour of the power of NumPy and provide a glimpse of how it makes life easier when working with pandas. We will also be implementing a neural network with NumPy and exploring some of the practical applications of multi-dimensional arrays.
Chapter 4, I/O of Different Data Formats with pandas, will teach you how to read and write commonplace formats, such as comma-separated value (CSV), with all the options, as well as more exotic file formats, such as URL, JSON, and XML. We will also create files in those formats from data objects and create niche plots from within pandas.
Chapter 5, Indexing and Selecting in pandas, will show you how to access and select data from pandas data structures. We will look in detail at basic indexing, label indexing, integer indexing, mixed indexing, and the operation of indexes.
Chapter 6, Grouping, Merging, and Reshaping Data in pandas, will examine the various functions that enable us to rearrange data, by having you utilize such functions on real-world datasets. We will also learn about grouping, merging, and reshaping data.
Chapter 7, Special Data Operations in pandas, will discuss and elaborate on the methods, syntax, and usage of some of the special data operations in pandas.
Chapter 8, Time Series and Plotting Using Matplotlib, will look at how to handle time series and dates. We will also take a tour of some topics that are necessary for you to know about in order to develop your expertise in using pandas.
Chapter 9, Making Powerful Reports Using pandas in Jupyter, will look into the application of a range of styling, as well as the formatting options that pandas has. We will also learn how to create dashboards and reports in the Jupyter Notebook.
Chapter 10, A Tour of Statistics with pandas and NumPy, will delve into how pandas can be used to perform statistical calculations using packages and calculations.
Chapter 11, A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates, will examine an alternative approach to statistics, which is the Bayesian approach. We will also look into the key statistical distributions and see how we can use various statistical packages to generate and plot distributions in matplotlib.
Chapter 12, Data Case Studies Using pandas, will discuss how we can solve real-life data case studies using pandas. We will look into web scraping with Python and data validation as well.
Chapter 13, The pandas Library Architecture, will discuss the architecture and code structure of the pandas library. This chapter will also briefly demonstrate how you can improve performance using Python extensions.
Chapter 14, pandas Compared with Other Tools, will focus on comparing pandas, with R and other tools such as SQL and SAS. We will also look into slicing and selection as well.
Chapter 15, Brief Tour of Machine Learning, will conclude the book by giving a brief introduction to the scikit-learn library for doing machine learning and show how pandas fits within that framework.
The following software will be used while we execute the code:
Windows/macOS/Linux
Python 3.6
pandas
IPython
R
scikit-learn
For hardware, there are no specific requirements. Python and pandas can run on a Mac, Linux, or Windows machine.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Pandas-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789343236_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Python has an built-in array module to create arrays."
A block of code is set as follows:
source_python("titanic.py")titanic_in_r <- get_data_head("titanic.csv")
Any command-line input or output is written as follows:
python --version
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Any notebooks in other directories could be transferred to the current working directory of the Jupyter Notebook through the Upload option."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, we give you a quick overview of the concepts of the data analysis process and where pandas fits into that picture. You will also learn how to install and set up the pandas library, along with the other supporting libraries and environments required to build an enterprise-grade data analysis pipeline.
This section is comprised of the following chapters:
Chapter 1
,
Introduction to pandas and Data Analysis
Chapter 2
,
Installation of pandas and Supporting Software
We start the book and this chapter by discussing the contemporary data analytics landscape and how pandas fits into that landscape. pandas is the go-to tool for data scientists for data pre-processing tasks. We will learn about the technicalities of pandas in the later chapters. This chapter covers the context, origin, history, market share, and current standing of pandas.
The chapter has been divided into the following headers:
Motivation for data analysis
How Python and pandas can be used for data analysis
Description of the pandas library
Benefits of using pandas
In this section, we discuss the trends that are making data analysis an increasingly important field in today's fast-moving technological landscape.
The term big data has become one of the hottest technology buzzwords in the past two years. We now increasingly hear about big data in various media outlets, and big data start-ups have increasingly been attracting venture capital. A good example in the area of retail is Target Corporation, which has invested substantially in big data and is now able to identify potential customers by using big data to analyze people's shopping habits online; refer to a related article at http://nyti.ms/19LT8ic.
Loosely speaking, big data refers to the phenomenon wherein the amount of data exceeds the capability of the recipients of the data to process it. Here is an article on big data that sums it up nicely: https://www.oracle.com/in/big-data/guide/what-is-big-data.html.
A good way to start thinking about the complexities of big data is called the four dimensions, or Four V's of big data. This model was first introduced as the three V's by Gartner analyst Doug Laney in 2001. The three V's stood for Volume, Velocity, and Variety, and the fourth V, Veracity, was added later by IBM. Gartner's official definition states the following:
The volume of data in the big data age is simply mind-boggling. According to IBM, by 2020, the total amount of data on the planet will have ballooned to 40 zettabytes. You heard that right! 40 zettabytes is 43 trillion gigabytes. For more information on this, refer to the Wikipedia page on the zettabyte: http://en.wikipedia.org/wiki/Zettabyte.
To get a handle on how much data this is, let me refer to an EMC press release published in 2010, which stated what 1 zettabyte was approximately equal to:
The growth rate of data has been fuelled largely by a few factors, such as the following:
The rapid growth of the internet.
The conversion from analog to digital media, coupled with an increased ability to capture and store data, which in turn has been made possible with cheaper and better storage technology. There has been a proliferation of digital data input devices, such as cameras and wearables, and the cost of huge data storage has fallen rapidly. Amazon Web Services is a prime example of the trend toward much cheaper storage.
The internetification of devices, or rather the Internet of Things, is the phenomenon wherein common household devices, such as our refrigerators and cars, will be connected to the internet. This phenomenon will only accelerate the above trend.
From a purely technological point of view, velocity refers to the throughput of big data, or how fast the data is coming in and is being processed. This has ramifications on how fast the recipient of the data needs to process it to keep up. Real-time analytics is one attempt to handle this characteristic. Tools that can enable this include Amazon Web Services Elastic MapReduce.
At a more macro level, the velocity of data can also be regarded as the increased speed at which data and information can now be transferred and processed faster and at greater distances than ever before.
The proliferation of high-speed data and communication networks coupled with the advent of cell phones, tablets, and other connected devices are primary factors driving information velocity. Some measures of velocity include the number of tweets per second and the number of emails per minute.
The variety of big data comes from having a multiplicity of data sources that generate data and the different formats of data that are produced.
This results in a technological challenge for the recipients of the data who have to process it. Digital cameras, sensors, the web, cell phones, and so on are some of the data generators that produce data in differing formats, and the challenge is being able to handle all these formats and extract meaningful information from the data. The ever-changing nature of data formats with the dawn of the big data era has led to a revolution in the database technology industry with the rise of NoSQL databases, which handle what is known as unstructured data or rather data whose format is fungible or constantly changing.
The fourth characteristic of big data—veracity, which was added later—refers to the need to validate or confirm the correctness of the data or the fact that the data represents the truth. The sources of data must be verified and errors kept to a minimum. According to an estimate by IBM, poor data quality costs the US economy about $3.1 trillion dollars a year. For example, medical errors cost the United States $19.5 billion in 2008; you can refer to a related article at http://www.wolterskluwerlb.com/health/resource-center/articles/2012/10/economics-health-care-quality-and-medical-errors for more information.
The following link provides an infographic by IBM that summarizes the four V's of big data: https://www.ibmbigdatahub.com/infographic/four-vs-big-data.
Data analytics has been described by Eric Schmidt, the former CEO of Google, as the Future of Everything. For more information, check out a YouTube video called Why Data Analytics is the Future of Everything at https://www.youtube.com/watch?v=9hDnO_ykC7Y.
The volume and velocity of data will continue to increase in the big data age. Companies that can efficiently collect, filter, and analyze data that results in information that allows them to better meet the needs of their customers in a much quicker timeframe will gain a significant advantage over their competitors. For example, data analytics (the Culture of Metrics) plays a very key role in the business strategy of Amazon. For more information, refer to the Amazon.com case study by Smart Insights at http://bit.ly/1glnA1u.
As technologies and tools have evolved to meet the ever-increasing demands of business, there has been a move towards what is known as real-time analytics. More information on this is available from Intel in their Insight Everywhere whitepaper at http://intel.ly/1899xqo.
In the big data internet era, here are some examples of real-time analytics on big data:
Online businesses demand instantaneous insights into how the new products/features they have introduced online are doing and can adjust their online product mix accordingly. Amazon is a prime example of this with their
Customers Who Viewed This Item Also Viewed
feature.
In finance, risk management and trading systems demand almost instantaneous analysis in order to make effective decisions based on data-driven insights.
Data modeling is the process of using data to build predictive models. Data can also be used for descriptive and prescriptive analysis. But before we make use of data, it has to be fetched from several sources, stored, assimilated, cleaned, and engineered to suit our goal. The sequential operations that need to be performed on data are akin to a manufacturing pipeline, where each subsequent step adds value to the potential end product and each progression requires a new person or skill set.
The various steps in a data analytics pipeline are shown in the following diagram:
Extract Data
Transform Data
Load Data
Read & Process Data
Exploratory Data Analysis
Create Features
Build Predictive Models
Validate Models
Build Products
These steps can be combined into three high-level categories: data engineering, data science, and product development.
Data Engineering
:
Step 1
to
Step 3
in the preceding diagram fall into this category. It deals with sourcing data from a variety of sources, creating a suitable database and table schema, and loading the data in a suitable database. There can be many approaches to this step depending on the following:
Type of data
: Structured (tabular data) versus unstructured (such as images and text) versus semi-structured (such as JSON and XML)
Velocity of data upgrade
: Batch processing versus real-time data streaming
Volume of data
: Distributed (or cluster-based) storage versus single instance databases
Variety of data
: Document storage, blob storage, or data lake
Data Science
:
Step 4
to
Step 8
in figure 1.2 fall into the category of data science. This is the phase where the data is made usable and used to predict the future, learn patterns, and extrapolate these patterns. Data science can further be sub-divided into two phases.
Step 4 to Step 6 comprise the first phase, wherein the goal is to understand the data better and make it usable. Making the data usable requires considerable effort to clean it by removing invalid characters and missing values. It also involves understanding the nitty-gritty of the data at hand—what is the distribution of data, what is the relationship between different data variables, is there a causatory relationship between the input and outcome variable, and so on. It also involves exploring numerical transformations (features) that might explain this causation (between input and outcome variables) better. This phase entails the real forensic effort that goes into the ultimate use of data. To use an analogy, bamboo seeds remain buried in the soil for years with no signs of a sapling growing, and suddenly a sapling grows, and within months a full bamboo tree is ready. This phase of data science is akin to the underground preparation the bamboo seeds undergo before the rapid growth. This is like the stealth mode of a start up wherein a lot of time and effort is committed. And this is where the pandas library, protagonist of this book, finds it raison d'etre and sweet spot.
Step 7 to Step 8 constitute the part where patterns (the parameters of a mathematical expression) are learned from historic data and extrapolated to future data. It involves a lot of experimentation and iterations to get to the optimal results. But if Step 4 to Step 6 have been done with the utmost care, this phase can be implemented pretty quickly thanks to the number of packages in Python, R, and many other data science tools. Of course, it requires a sound understanding of the math and algorithms behind the applied model in order to tweak its parameters to perfection.
Product Development
: This is the phase where all the hard work bears fruit and all the insights, results, and patterns are served to the users in a way that they can consume, understand, and act upon. It might range from building a dashboard on data with additional derived fields to an API that calls a trained model and returns an output on incoming data. A product can also be built to encompass all the stages of the data pipeline, from extracting the data to building a predictive model or creating an interactive dashboard.
Apart from these steps in the pipeline, there are some additional steps that might come into the picture. This is due to the highly evolving nature of the data landscape. For example, deep learning, which is used extensively to build intelligent products around image, text, and audio data, often requires the training data to be labeled into a category or augmented if the quantity is too small to create an accurate model.
For example, an object detection task on video data might require the creation of training data for object boundaries and object classes using some tools, or even manually. Data augmentation helps with image data by creating slightly perturbed data (rotated or grained images, for example) and adding it to training data. For a supervised learning task, labels are mandatory. This label is generally generated together with the data. For example, to train a churn model, a dataset with customer descriptions and when they churned out is required. This information is generally available in the company's CRM tool.
The Python programming language is one of the fastest-growing languages today in the emerging field of data science and analytics. Python was created by Guido van Rossum in 1991, and its key features include the following:
Interpreted rather than compiled
Dynamic type system
Pass by value with object references
Modular capability
Comprehensive libraries
Extensibility with respect to other languages
Object orientation
Most of the major programming paradigms: procedural, object-oriented, and, to a lesser extent, functional
For more information, refer to the following article on Python at https://www.python.org/about/.
Among the characteristics that make Python popular for data science are its very user-friendly (human-readable) syntax, the fact that it is interpreted rather than compiled (leading to faster development time), and it has very comprehensive libraries for parsing and analyzing data, as well as its capacity for numerical and statistical computations. Python has libraries that provide a complete toolkit for data science and analysis. The major ones are as follows:
NumPy
: The general-purpose array functionality with an emphasis on numeric computation
SciPy
: Numerical computing
Matplotlib
: Graphics
pandas
: Series and data frames (1D and 2D array-like types)
Scikit-learn
: Machine learning
NLTK
: Natural language processing
Statstool
: Statistical analysis
For this book, we will be focusing on the fourth library in the preceding list, pandas.
The pandas we are going to obsess over in this book are not the cute and lazy animals that also do kung fu when needed.
pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. pandas stands for panel data, a reference to the tabular format in which it processes the data. It is available for free and is distributed with a 3-Clause BSD License under the open source initiative.
Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, and there's a large community behind it, (1,200+ contributors, 17,000+ commits, 23 versions, and 15,000+ stars) rapid iteration, features, and enhancements are continuously made.
Some key features of pandas include the following:
It can process a variety of datasets in different formats: time series, tabular heterogeneous, and matrix data.
It facilitates loading/importing data from varied sources, such as CSV and databases such as SQL.
It can handle myriad operations on datasets: subsetting, slicing, filtering, merging, groupBy, re-ordering, and re-shaping.
It can deal with missing data according to rules defined by the user/developer, such as ignore, convert to 0, and so on.
It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.
It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.
It delivers fast performance and can be sped up even more by making use of
Cython
(C extensions to Python).
For more information, go through the official pandas documentation at http://pandas.pydata.org/pandas-docs/stable/.
As discussed in the previous section, pandas can be used to perform Step 4 to Step 6 in the pipeline. And Step 4 to Step 6 are the backbone of any data science process, application, or product:
The Step 1 to Step 6 can be performed in pandas by some methods. Those in the Step 4 to Step 6 are the primary tasks while the Step 1 to Step 3 can also be done in some way or other in pandas.
pandas is an indispensable library if you're working with data, and it would be near impossible to find code for data modeling that doesn't import pandas into the working environment. Easy-to-use syntax in Python and the availability of a spreadsheet-like data structure called a dataframe make it amenable even to users who are too comfortable and too unwilling to move away from Excel. At the same time, it is loved by scientists and researchers to handle exotic file formats such as parquet, feather file, and many more. It can read data in batch mode without clogging all the machine's memory. No wonder the famous news aggregator Quartz called it themost important tool in data science.
pandas is suited well for the following types of dataset:
Tabular with heterogeneous type columns
Ordered and unordered time series
Matrix/array data with labeled or unlabeled rows and columns
pandas can perform the following operations on data with finesse:
Easy handling of missing and NaN data
Addition and deletion of columns
Automatic and explicit data alignment with labels
GroupBy for aggregating and transforming data using split-apply-combine
Converting differently indexed Python or NumPy data to DataFrame
Slicing, indexing, hierarchical indexing, and subsetting of data
Merging, joining, and concatenating data
I/O methods for flat files, HDF5, feather, and parquet formats
Time series functionality
The basic version of pandas was built in 2008 by Wes McKinney, an MIT grad with heavy quantitative finance experience. Now a celebrity in his own right, thanks to his open source contributions and the wildly popular book called Data Analysis with Python, he was reportedly frustrated with the time he had to waste doing simple data manipulation tasks at his job, such as reading a CSV file, with the popular tools at that time. He said he quickly fell in love with Python for its intuitive and accessible nature after not finding Excel and R suitable for his needs. But he found that it was missing key features that would make it the go-to tool for data analysis—for example, an intuitive format to deal with spreadsheet data or to create new calculated columns from existing columns.
According to an interview he gave to Quartz, the design considerations and vision that he had in mind while creating the tool were the following:
Quality of data is far more important than any fancy analysis
Treating in-memory data like a SQL table or an Excel spreadsheet
Intuitive analysis and exploration with minimal and elegant code
Easier compatibility with other libraries used for the same or different steps in the data pipeline
After building the basic version, he went on to pursue a PhD at Duke University but dropped out in a quest to make the tool he had created a cornerstone for data science and Python. With his dedicated contribution, together with the release of popular Python visualization libraries such as Matplotlib, followed by machine learning libraries such as Scikit-Learn and interactive user interfaces such as Jupyter and Spyder, pandas and eventually Python became the hottest tool in the armory of any data scientist.
Wes is heavily invested in the constant improvement of the tool he created from scratch. He coordinates the development of new features and the improvement of existing ones. The data science community owes him big time.
The popularity of Python has skyrocketed over the years, especially after 2012; a lot of this can be attributed to the popularity of pandas. Python-related questions make up around 12% of the total questions asked from high-income countries on Stack Overflow, a popular platform for developers to ask questions and get answers from other people in the community about how to get things done and fix bugs in different programming languages. Given that there are hundreds of programming languages, one language occupying 12% of market share is an extraordinary achievement:
According to this survey conducted by Kaggle, 60% of the respondents said that they were aware of or have used Python for their data science jobs.
According to the data recorded by Stack Overflow about the types of question asked on their platform, Python and pandas have registered steady growth year on year, while some of the other programming languages, such as Java and C, have declined in popularity and are playing catch-up. Python has almost caught up with the number of questions asked about Java on the platform, while the number for Java has shown a negative trend. pandas has been showing constant growth in numbers.
The following chart is based on data gathered from the SQL API exposed by Stack Overflow. The y axis represents the number of questions asked about that topic on Stack Overflow in a particular year:
Google Trend also shows a surge in popularity for pandas, as demonstrated in the following chart. Numbers represent surge in interest for pandas relative to the highest point (historically) on the chart for the given region and time.
The geographical split of the popularity of pandas is even more interesting. The highest interest has come from China, which might be an indicator of the high adoption of open source tools and/or a very high inclination towards building powerful tech for data science:
Apart from the popularity with its users, pandas (owing to its open source origins) also has a thriving community that is committed to constantly improving it and making it easier for the users to get answers about the issues. The following chart shows the weekly modifications (additions/deletions) to the pandas source code by the contributors:
According to a popular framework called Gartner Hype Cycle, there are five phases in the process of the proliferation and adoption of technologies:
Technology trigger
Peak of inflated expectations
Trough of disillusionment
Slope of enlightenment
Plateau of productivity
The following link contains a chart that shows different technologies and the stage they are at on the technology adoption curve https://blogs-images.forbes.com/gartnergroup/files/2012/09/2012Emerging-Technologies-Graphic4.gif.
As can be seen, Predictive Analytics has already reached the steady plateau of productivity, which is where the optimum and stable return on investment can be extracted from a technology. Since pandas is an essential component of most predictive analytics initiatives, it is safe to say that pandas has reached the plateau of productivity.
pandas is built on top of NumPy. Some of the noteworthy uses of the pandas, apart from every other data science project of course, are the following:
pandas is a dependency of statsmodels (
http://www.statsmodels.org/stable/index.html
), making it a significant part of Python's numerical computing ecosystem.
pandas has been used extensively in the production of many financial applications.
We live in a big data era characterized by the four V's- volume, velocity, variety, and veracity. The volume and velocity of data are set to increase for the foreseeable future. Companies that can harness and analyze big data to extract information and take actionable decisions based on this information will be the winners in the marketplace. Python is a fast-growing, user-friendly, extensible language that is very popular for data analysis.
pandas is a core library of the Python toolkit for data analysis. It provides features and capabilities that make it much easier and faster than many other popular languages, such as Java, C, C++, and Ruby.
Thus, given the strengths of Python outlined in this chapter as a choice for the analysis of data and the popularity it has gained from users, contributors, and industry leaders, data analysis practitioners utilizing Python should become adept at pandas in order to become more effective. This book aims to help you achieve this goal.
In the next chapter, we proceed towards this goal by first setting up the infrastructure required to run pandas on your computer. We will also see different ways and scenarios in which pandas can be used and run.
https://activewizards.com/blog/top-20-python-libraries-for-data-science-in-2018/
https://qz.com/1126615/the-story-of-the-most-important-tool-in-data-science/
Before we can start work on pandas for doing data analysis, we need to make sure that the software is installed and the environment is in proper working order. This chapter deals with the installation of Python (if necessary), the pandas library, and all necessary dependencies for the Windows, macOS/X, and Linux platforms. The topics we address include, among other things, selecting a version of Python, installing Python, and installing pandas.
The steps outlined in the following section should work for the most part, but your mileage may vary depending upon the setup. On different operating system versions, the scripts may not always work perfectly, and the third-party software packages already in the system may sometimes conflict with the instructions provided.
The following topics will be covered in this chapter:
Selecting a version of Python to use
Installation of Python and pandas using Anaconda
Dependency packages for pandas
Review of items installed with Anaconda
Cross tooling—combining the pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio command line tricks for pandas
Options and settings for Pandas
This is a classic battle among Python developers—Python 2.7.x or Python 3.x—which is better? Until a year back, it was Python 2.7.x that topped the charts; the reason being it was a stable version. More than 70% of projects used Python 2.7, in the year 2016. This number began to fall and by 2017 it was 63%. This shift in trends was due to the announcement that Python 2.7 would not be maintained from January 1, 2018, meaning that there would be no more bug fixes or new releases. Some libraries released after this announcement are only compatible with Python 3.x. Several businesses have started migrating towards Python 3.x. Hence, as of 2018, Python 3.x is the preferred version.
The main differences between Python 2.x and 3 include better Unicode support in Python 3, print and exec changed to functions, and integer division. For more details, see What's New in Python 3.0 at http://docs.python.org/3/whatsnew/3.0.html.