Mastering pandas - Ashish Kumar - E-Book

Mastering pandas E-Book

Ashish Kumar

0,0
40,81 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Perform advanced data manipulation tasks using pandas and become an expert data analyst.




Key Features



  • Manipulate and analyze your data expertly using the power of pandas


  • Work with missing data and time series data and become a true pandas expert


  • Includes expert tips and techniques on making your data analysis tasks easier



Book Description



pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains.






An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook.






By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.




What you will learn



  • Speed up your data analysis by importing data into pandas


  • Keep relevant data points by selecting subsets of your data


  • Create a high-quality dataset by cleaning data and fixing missing values


  • Compute actionable analytics with grouping and aggregation in pandas


  • Master time series data analysis in pandas


  • Make powerful reports in pandas using Jupyter notebooks



Who this book is for



This book is for data scientists, analysts and Python developers who wish to explore advanced data analysis and scientific computing techniques using pandas. Some fundamental understanding of Python programming and familiarity with the basic data analysis concepts is all you need to get started with this book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 545

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering pandasSecond Edition

 

 

 

 

A complete guide to pandas, from installation to advanced data analysis techniques

 

 

 

 

 

 

 

 

 

 

Ashish Kumar

 

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Mastering pandas Second Edition

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

Commissioning Editor: Sunith ShettyAcquisition Editor: Amey VarangaonkarContent Development Editor: Roshan KumarSenior Editor: Ayaan HodaTechnical Editor: Utkarsha S. KadamCopy Editor: Safis EditingProject Coordinator: Kirti PisatProofreader: Safis EditingIndexer: Tejal Daruwale SoniProduction Designer: Deepika Naik

First published: June 2015 Second edition: October 2019

Production reference: 1251019

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-323-6

www.packt.com

 

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the author

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks pff to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledling start-ups.

 

I owe all of my career accomplishments to my beloved grandfather. I thank my mom, my siblings Sanjeev, Ritesh, Rahul, Sandeep, Ritesh, and my sister-in-law Anamika for all they have done for me. My long-term friends Pranav, Ajit, Vidya, Adarsh, Ashweetha, and Simer have been my support system. I am indebted to  Ram Sukumar and Zeena Johar for their guidance. I want to thank Ramya S, S Abdullah, Sandhya S, and Kirthi T for their help on this book.

About the reviewer

Jamshaid Sohail is a data scientist who is highly passionate about data science, machine learning, deep learning, big data, and other related fields. He spends his free time learning more about the data science field and learning how to use its emerging tools and technologies. He is always looking for new ways to share his knowledge with other people and add value to other people's lives. He has also attended Cambridge University for a summer course in computer science, where he studied under great professors; he would like to impart this knowledge to others. He has extensive experience as a data scientist in a US-based company. In short, he would be extremely delighted to educate and share knowledge with other people.

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Mastering pandas Second Edition

About Packt

Why subscribe?

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Overview of Data Analysis and pandas

Introduction to pandas and Data Analysis

Motivation for data analysis

We live in a big data world

The four V's of big data

Volume of big data

Velocity of big data

Variety of big data

Veracity of big data

So much data, so little time for analysis

The move towards real-time analytics

Data analytics pipeline

How Python and pandas fit into the data analytics pipeline

What is pandas?

Where does pandas fit in the pipeline?

Benefits of using pandas

History of pandas

Usage pattern and adoption of pandas

pandas on the technology adoption curve

Popular applications of pandas

Summary

References

Installation of pandas and Supporting Software

Selecting a version of Python to use

Standalone Python installation

Linux

Installing Python from a compressed tarball

Windows

Core Python installation

Installing third-party Python and packages 

macOS/X

Installation using a package manager

Installation of Python and pandas using Anaconda

What is Anaconda?

Why Anaconda?

Installing Anaconda

Windows Installation

macOS Installation

Linux Installation

Cloud installation

Other numeric and analytics-focused Python distributions

Dependency packages for pandas

Review of items installed with Anaconda

JupyterLab

GlueViz

Walk-through of Jupyter Notebook and Spyder

Jupyter Notebook

Spyder

Cross tooling – combining pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio

Pandas with R

pandas with Azure ML Studio

pandas with Julia

pandas with H2O

Command line tricks for pandas

Options and settings for pandas

Summary

Further reading

Section 2: Data Structures and I/O in pandas

Using NumPy and Data Structures with pandas

NumPy ndarrays

NumPy array creation

Array of ones and zeros

Array based on a numerical range

Random and empty arrays

Arrays based on existing arrays

NumPy data types

NumPy indexing and slicing

Array slicing

Array masking

Complex indexing

Copies and views

Operations

Basic operators

Mathematical operators

Statistical operators

Logical operators

Broadcasting

Array shape manipulation

Reshaping

Transposing

Ravel

Adding a new axis

Basic linear algebra operations

Array sorting

Implementing neural networks with NumPy

Practical applications of multidimensional arrays

Selecting only one channel

Selecting the region of interest of an image

Multiple channel selection and suppressing other channels

Data structures in pandas

Series

Series creation

Using an ndarray

Using a Python dictionary

Using a scalar value

Operations on Series

Assignment

Slicing

Other operations

DataFrames

DataFrame creation

Using a dictionary of Series

Using a dictionary of ndarrays/lists

Using a structured array

Using a list of dictionaries

Using a dictionary of tuples for multilevel indexing

Using a Series

Operations on pandas DataFrames

Column selection

Adding a new column

Deleting columns

Alignment of DataFrames

Other mathematical operations

Panels

Using a 3D NumPy array with axis labels

Using a Python dictionary of DataFrame objects

Using the DataFrame.to_panel method

Other operations

Summary

References

I/Os of Different Data Formats with pandas

Data sources and pandas methods

CSV and TXT

Reading CSV and TXT files

Reading a CSV file

Specifying column names for a dataset

Reading from a string of data

Skipping certain rows

Row index

Reading a text file

Subsetting while reading

Reading thousand format numbers as numbers

Indexing and multi-indexing

Reading large files in chunks

Handling delimiter characters in column data

Writing to a CSV

Excel

URL and S3

HTML

Writing to an HTML file

JSON

Writing a JSON to a file

Reading a JSON

Writing JSON to a DataFrame

Subsetting a JSON

Looping over JSON keys

Reading HDF formats

Reading feather files

Reading parquet files

Reading a SQL file

Reading a SAS/Stata file

Reading from Google BigQuery

Reading from a clipboard

Managing sparse data

Writing JSON objects to a file

Serialization/deserialization

Writing to exotic file types

to_pickle()

to_parquet()

to_hdf()

to_sql()

to_feather()

to_html()

to_msgpack()

to_latex()

to_stata()

to_clipboard()

GeoPandas

What is geospatial data?

Installation and dependencies

Working with GeoPandas

GeoDataFrames

Open source APIs – Quandl

read_sql_query

Pandas plotting

Andrews curves

Parallel plot

Radviz plots

Scatter matrix plot

Lag plot

Bootstrap plot

pandas-datareader

Yahoo Finance

World Bank

Summary

Section 3: Mastering Different Data Operations in pandas

Indexing and Selecting in pandas

Basic indexing

Accessing attributes using the dot operator

Range slicing

Labels, integer, and mixed indexing

Label-oriented indexing

Integer-oriented indexing

The .iat and .at operators

Mixed indexing with the .ix operator

Multi-indexing

Swapping and re-ordering levels

Cross-sections

Boolean indexing

The isin and any all methods

Using the where() method

Operations on indexes

Summary

Grouping, Merging, and Reshaping Data in pandas

Grouping data

The groupby operation

Using groupby with a MultiIndex

Using the aggregate method

Applying multiple functions

The transform() method

Filtering

Merging and joining

The concat function

Using append

Appending a single row to a DataFrame

SQL-like merging/joining of DataFrame objects

The join function

Pivots and reshaping data

Stacking and unstacking

The stack() function

The unstack() function

Other methods for reshaping DataFrames

Using the melt function

The pandas.get_dummies() function

pivot table

Transpose in pandas

Squeeze

nsmallest and nlargest

Summary

Special Data Operations in pandas

Writing and applying one-liner custom functions

lambda and apply

Handling missing values

Sources of missing values

Data extraction 

Data collection 

Data missing at random 

Data not missing at random 

Different types of missing values

Miscellaneous analysis of missing values

Strategies for handling missing values

Deletion 

Imputation

Interpolation 

KNN 

A survey of methods on series

The items() method

The keys() method

The pop() method

The apply() method

The map() method

The drop() method

The equals() method

The sample() method

The ravel() function

The value_counts() function

The interpolate() function

The align() function

pandas string methods

upper(), lower(), capitalize(), title(), and swapcase()

contains(), find(), and replace()

strip() and split()

startswith() and endswith()

The is...() functions

Binary operations on DataFrames and series

Binning values

Using mathematical methods on DataFrames

The abs() function

corr() and cov()

cummax(), cumin(), cumsum(), and cumprod()

The describe() function

The diff() function

The rank() function

The quantile() function

The round() function

The pct_change() function

min(), max(), median(), mean(), and mode()

all() and any()

The clip() function

The count() function

Summary

Time Series and Plotting Using Matplotlib

Handling time series data

Reading in time series data

Assigning date indexes and subsetting in time series data

Plotting the time series data

Resampling and rolling of the time series data

Separating timestamp components

DateOffset and TimeDelta objects

Time series-related instance methods

Shifting/lagging

Frequency conversion

Resampling of data

Aliases for time series frequencies

Time series concepts and datatypes

Period and PeriodIndex

PeriodIndex

Conversion between time series datatypes

A summary of time series-related objects

Interconversions between strings and timestamps

Data-processing techniques for time series data

Data transformation

Plotting using matplotlib

Summary

Section 4: Going a Step Beyond with pandas

Making Powerful Reports In Jupyter Using pandas

pandas styling

In-built styling options

User-defined styling options

Navigating Jupyter Notebook

Exploring the menu bar of Jupyter Notebook

Edit mode and command mode

Mouse navigation

Jupyter Notebook Dashboard

Ipywidgets

Interactive visualizations

Writing mathematical equations in Jupyter Notebook

Formatting text in Jupyter Notebook

Headers

Bold and italics

Alignment

Font color

Bulleted lists

Tables

Tables

HTML

Citation

Miscellaneous operations in Jupyter Notebook

Loading an image

Hyperlinks

Writing to a Python file

Running a Python file

Loading a Python file

Internal Links

Sharing Jupyter Notebook reports

Using NbViewer

Using the browser

Using Jupyter Hub

Summary

A Tour of Statistics with pandas and NumPy

Descriptive statistics versus inferential statistics

Measures of central tendency and variability

Measures of central tendency

The mean

The median

The mode

Computing the measures of central tendency of a dataset in Python

Measures of variability, dispersion, or spread

Range

Quartile

Deviation and variance

Hypothesis testing – the null and alternative hypotheses

The null and alternative hypotheses

The alpha and p-values

Type I and Type II errors

Statistical hypothesis tests

Background

The z-test

The t-test

Types of t-tests

A t-test example

chi-square test

ANOVA test

Confidence intervals

An illustrative example

Correlation and linear regression

Correlation

Linear regression

An illustrative example

Summary

A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates

Introduction to Bayesian statistics

The mathematical framework for Bayesian statistics

Bayes' theory and odds

Applications of Bayesian statistics

Probability distributions

Fitting a distribution

Discrete probability distributions

Discrete uniform distribution

The Bernoulli distribution

The binomial distribution

The Poisson distribution

The geometric distribution

The negative binomial distribution

Continuous probability distributions

The continuous uniform distribution

The exponential distribution

The normal distribution

Bayesian statistics versus frequentist statistics

What is probability?

How the model is defined

Confidence (frequentist) versus credible (Bayesian) intervals

Conducting Bayesian statistical analysis

Monte Carlo estimation of the likelihood function and PyMC

Bayesian analysis example – switchpoint detection

Maximum likelihood estimate

MLE calculation examples

Uniform distribution

Poisson distribution

References

Summary

Data Case Studies Using pandas

End-to-end exploratory data analysis

Data overview

Feature selection

Feature extraction

Data aggregation

Web scraping with Python

Web scraping using pandas

Web scraping using BeautifulSoup

Data validation

Data overview

Structured databases versus unstructured databases

Validating data types

Validating dimensions

Validating individual entries

Using pandas indexing

Using loops

Summary

The pandas Library Architecture

Understanding the pandas file hierarchy

Description of pandas modules and files

pandas/core

pandas/io

pandas/tools

pandas/util

pandas/tests

pandas/compat

pandas/computation

pandas/plotting

pandas/tseries

Improving performance using Python extensions

Summary

pandas Compared with Other Tools

Comparison with R

Data types in R

R lists

R DataFrames

Slicing and selection

Comparing R-matrix and NumPy array

Comparing R lists and pandas series

Specifying a column name in R

Specifying a column name in pandas

R DataFrames versus pandas DataFrames

Multi-column selection in R

Multi-column selection in pandas

Arithmetic operations on columns

Aggregation and GroupBy

Aggregation in R

The pandas GroupBy operator

Comparing matching operators in R and pandas

R %in% operator

Pandas isin() function

Logical subsetting

Logical subsetting in R

Logical subsetting in pandas

Split-apply-combine

Implementation in R

Implementation in pandas

Reshaping using melt

R melt function

The pandas melt function

Categorical data

R example using cut()

The pandas solution

Comparison with SQL

SELECT

SQL

pandas

Where

SQL

pandas

SQL

pandas

SQL

pandas

group by

SQL

pandas

SQL

pandas

SQL

pandas

update

SQL

pandas

delete

SQL

pandas

JOIN

SQL

pandas

SQL

pandas

SQL

pandas

Comparison with SAS

Summary

A Brief Tour of Machine Learning

The role of pandas in machine learning

Installation of scikit-learn

Installing via Anaconda

Installing on Unix (Linux/macOS)

Installing on Windows

Introduction to machine learning

Supervised versus unsupervised learning

Illustration using document classification

Supervised learning

Unsupervised learning

How machine learning systems learn

Application of machine learning – Kaggle Titanic competition

The Titanic: Machine Learning from Disaster problem

The problem of overfitting

Data analysis and preprocessing using pandas

Examining the data

Handling missing values

A naive approach to the Titanic problem

The scikit-learn ML/classifier interface

Supervised learning algorithms

Constructing a model using Patsy for scikit-learn

General boilerplate code explanation

Logistic regression

Support vector machine

Decision trees

Random forest

Unsupervised learning algorithms

Dimensionality reduction

K-means clustering

XGBoost case study

Entropy

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas for performing complex data analysis in various domains. It provides features and capabilities that make data analysis much easier and faster than with many other popular languages, such as Java, C, C++, and Ruby. 

Who this book is for

This book is for data scientists, analysts, and Python developers who wish to explore advanced data analysis and scientific computing techniques using pandas. Some fundamental understanding of Python programming and familiarity with basic data analysis concepts is all you need to get started with this book.

What this book covers

Chapter 1, Introduction to pandas and Data Analysis, will introduce pandas and explain where it fits in the data analysis pipeline. We will also look into some of the popular applications of pandas and how Python and pandas can be used for data analysis. 

Chapter 2, Installation of pandas and Supporting Software, will deal with the installation of Python (if necessary), the pandas library, and all necessary dependencies for the Windows, macOS X, and Linux platforms. We will also look into the command-line tricks and options and settings for pandas as well.

Chapter 3, Using NumPy and Data Structures with pandas, will give a quick tour of the power of NumPy and provide a glimpse of how it makes life easier when working with pandas. We will also be implementing a neural network with NumPy and exploring some of the practical applications of multi-dimensional arrays.

Chapter 4, I/O of Different Data Formats with pandas, will teach you how to read and write commonplace formats, such as comma-separated value (CSV), with all the options, as well as more exotic file formats, such as URL, JSON, and XML. We will also create files in those formats from data objects and create niche plots from within pandas.

Chapter 5, Indexing and Selecting in pandas, will show you how to access and select data from pandas data structures. We will look in detail at basic indexing, label indexing, integer indexing, mixed indexing, and the operation of indexes. 

Chapter 6, Grouping, Merging, and Reshaping Data in pandas, will examine the various functions that enable us to rearrange data, by having you utilize such functions on real-world datasets. We will also learn about grouping, merging, and reshaping data. 

Chapter 7, Special Data Operations in pandas, will discuss and elaborate on the methods, syntax, and usage of some of the special data operations in pandas.

Chapter 8, Time Series and Plotting Using Matplotlib, will look at how to handle time series and dates. We will also take a tour of some topics that are necessary for you to know about in order to develop your expertise in using pandas.

Chapter 9, Making Powerful Reports Using pandas in Jupyter, will look into the application of a range of styling, as well as the formatting options that pandas has. We will also learn how to create dashboards and reports in the Jupyter Notebook.

Chapter 10, A Tour of Statistics with pandas and NumPy, will delve into how pandas can be used to perform statistical calculations using packages and calculations.

Chapter 11, A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates, will examine an alternative approach to statistics, which is the Bayesian approach. We will also look into the key statistical distributions and see how we can use various statistical packages to generate and plot distributions in matplotlib.

Chapter 12, Data Case Studies Using pandas, will discuss how we can solve real-life data case studies using pandas. We will look into web scraping with Python and data validation as well.

Chapter 13, The pandas Library Architecture, will discuss the architecture and code structure of the pandas library. This chapter will also briefly demonstrate how you can improve performance using Python extensions.

Chapter 14, pandas Compared with Other Tools, will focus on comparing pandas, with R and other tools such as SQL and SAS. We will also look into slicing and selection as well.

Chapter 15, Brief Tour of Machine Learning, will conclude the book by giving a brief introduction to the scikit-learn library for doing machine learning and show how pandas fits within that framework.

To get the most out of this book

The following software will be used while we execute the code:

Windows/macOS/Linux

Python 3.6

pandas

IPython

R

scikit-learn

For hardware, there are no specific requirements. Python and pandas can run on a Mac, Linux, or Windows machine.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

 

www.packt.com

.

Select the

 

SUPPORT

 

tab.

Click on

 

Code Downloads & Errata

.

Enter the name of the book in the

 

Search

 

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Pandas-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789343236_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Python has an built-in array module to create arrays."

A block of code is set as follows:

source_python("titanic.py")titanic_in_r <- get_data_head("titanic.csv")

Any command-line input or output is written as follows:

python --version

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Any notebooks in other directories could be transferred to the current working directory of the Jupyter Notebook through the Upload option."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Overview of Data Analysis and pandas

In this section, we give you a quick overview of the concepts of the data analysis process and where pandas fits into that picture. You will also learn how to install and set up the pandas library, along with the other supporting libraries and environments required to build an enterprise-grade data analysis pipeline.

This section is comprised of the following chapters:

Chapter 1

Introduction to pandas and Data Analysis

Chapter 2

,

 Installation of pandas and Supporting Software

Introduction to pandas and Data Analysis

We start the book and this chapter by discussing the contemporary data analytics landscape and how pandas fits into that landscape. pandas is the go-to tool for data scientists for data pre-processing tasks. We will learn about the technicalities of pandas in the later chapters. This chapter covers the context, origin, history, market share, and current standing of pandas.

The chapter has been divided into the following headers:

Motivation for data analysis

How Python and pandas can be used for data analysis

Description of the pandas library

Benefits of using pandas

Motivation for data analysis

In this section, we discuss the trends that are making data analysis an increasingly important field in today's fast-moving technological landscape.

We live in a big data world

The term big data has become one of the hottest technology buzzwords in the past two years. We now increasingly hear about big data in various media outlets, and big data start-ups have increasingly been attracting venture capital. A good example in the area of retail is Target Corporation, which has invested substantially in big data and is now able to identify potential customers by using big data to analyze people's shopping habits online; refer to a related article at http://nyti.ms/19LT8ic.

Loosely speaking, big data refers to the phenomenon wherein the amount of data exceeds the capability of the recipients of the data to process it. Here is an article on big data that sums it up nicely: https://www.oracle.com/in/big-data/guide/what-is-big-data.html.

The four V's of big data

A good way to start thinking about the complexities of big data is called the four dimensions, or Four V's of big data. This model was first introduced as the three V's by Gartner analyst Doug Laney in 2001. The three V's stood for Volume, Velocity, and Variety, and the fourth V, Veracity, was added later by IBM. Gartner's official definition states the following:

"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."                         Laney, Douglas. "The Importance of 'Big Data': A Definition", Gartner

Volume of big data

The volume of data in the big data age is simply mind-boggling. According to IBM, by 2020, the total amount of data on the planet will have ballooned to 40 zettabytes. You heard that right! 40 zettabytes is 43 trillion gigabytes. For more information on this, refer to the Wikipedia page on the zettabyte: http://en.wikipedia.org/wiki/Zettabyte.

To get a handle on how much data this is, let me refer to an EMC press release published in 2010, which stated what 1 zettabyte was approximately equal to:

"The digital information created by every man, woman and child on Earth 'Tweeting' continuously for 100 years " or "75 billion fully-loaded 16 GB Apple iPads, which would fill the entire area of Wembley Stadium to the brim 41 times, the Mont Blanc Tunnel 84 times, CERN's Large Hadron Collider tunnel 151 times, Beijing National Stadium 15.5 times or the Taipei 101 Tower 23 times..."                                                                                                                                                                        EMC study projects 45× data growth by 2020

The growth rate of data has been fuelled largely by a few factors, such as the following:

The rapid growth of the internet.

The conversion from analog to digital media, coupled with an increased ability to capture and store data, which in turn has been made possible with cheaper and better storage technology. There has been a proliferation of digital data input devices, such as cameras and wearables, and the cost of huge data storage has fallen rapidly. Amazon Web Services is a prime example of the trend toward much cheaper storage.

The internetification of devices, or rather the Internet of Things, is the phenomenon wherein common household devices, such as our refrigerators and cars, will be connected to the internet. This phenomenon will only accelerate the above trend.

Velocity of big data

From a purely technological point of view, velocity refers to the throughput of big data, or how fast the data is coming in and is being processed. This has ramifications on how fast the recipient of the data needs to process it to keep up. Real-time analytics is one attempt to handle this characteristic. Tools that can enable this include Amazon Web Services Elastic MapReduce.

At a more macro level, the velocity of data can also be regarded as the increased speed at which data and information can now be transferred and processed faster and at greater distances than ever before.

The proliferation of high-speed data and communication networks coupled with the advent of cell phones, tablets, and other connected devices are primary factors driving information velocity. Some measures of velocity include the number of tweets per second and the number of emails per minute.

Variety of big data

The variety of big data comes from having a multiplicity of data sources that generate data and the different formats of data that are produced.

This results in a technological challenge for the recipients of the data who have to process it. Digital cameras, sensors, the web, cell phones, and so on are some of the data generators that produce data in differing formats, and the challenge is being able to handle all these formats and extract meaningful information from the data. The ever-changing nature of data formats with the dawn of the big data era has led to a revolution in the database technology industry with the rise of NoSQL databases, which handle what is known as unstructured data or rather data whose format is fungible or constantly changing. 

Veracity of big data

The fourth characteristic of big data—veracity, which was added later—refers to the need to validate or confirm the correctness of the data or the fact that the data represents the truth. The sources of data must be verified and errors kept to a minimum. According to an estimate by IBM, poor data quality costs the US economy about $3.1 trillion dollars a year. For example, medical errors cost the United States $19.5 billion in 2008; you can refer to a related article at http://www.wolterskluwerlb.com/health/resource-center/articles/2012/10/economics-health-care-quality-and-medical-errors for more information.

The following link provides an infographic by IBM that summarizes the four V's of big data: https://www.ibmbigdatahub.com/infographic/four-vs-big-data.

So much data, so little time for analysis

Data analytics has been described by Eric Schmidt, the former CEO of Google, as the Future of Everything. For more information, check out a YouTube video called Why Data Analytics is the Future of Everything at https://www.youtube.com/watch?v=9hDnO_ykC7Y.

The volume and velocity of data will continue to increase in the big data age. Companies that can efficiently collect, filter, and analyze data that results in information that allows them to better meet the needs of their customers in a much quicker timeframe will gain a significant advantage over their competitors. For example, data analytics (the Culture of Metrics) plays a very key role in the business strategy of Amazon. For more information, refer to the Amazon.com case study by Smart Insights at http://bit.ly/1glnA1u.

The move towards real-time analytics

As technologies and tools have evolved to meet the ever-increasing demands of business, there has been a move towards what is known as real-time analytics. More information on this is available from Intel in their Insight Everywhere whitepaper at http://intel.ly/1899xqo.

In the big data internet era, here are some examples of real-time analytics on big data:

Online businesses demand instantaneous insights into how the new products/features they have introduced online are doing and can adjust their online product mix accordingly. Amazon is a prime example of this with their

Customers Who Viewed This Item Also Viewed

feature.

In finance, risk management and trading systems demand almost instantaneous analysis in order to make effective decisions based on data-driven insights.

Data analytics pipeline

Data modeling is the process of using data to build predictive models. Data can also be used for descriptive and prescriptive analysis. But before we make use of data, it has to be fetched from several sources, stored, assimilated, cleaned, and engineered to suit our goal. The sequential operations that need to be performed on data are akin to a manufacturing pipeline, where each subsequent step adds value to the potential end product and each progression requires a new person or skill set.

The various steps in a data analytics pipeline are shown in the following diagram: 

Steps in data analytics pipeline

Extract Data

Transform Data

Load Data

Read & Process Data

Exploratory Data Analysis

Create Features

Build Predictive Models

Validate Models

Build Products

These steps can be combined into three high-level categories: data engineering, data science, and product development.

Data Engineering

:

Step 1

to

Step 3

in the preceding diagram fall into this category. It deals with sourcing data from a variety of sources, creating a suitable database and table schema, and loading the data in a suitable database. There can be many approaches to this step depending on the following:

Type of data

: Structured (tabular data) versus unstructured (such as images and text) versus semi-structured (such as JSON and XML)

Velocity of data upgrade

: Batch processing versus real-time data streaming

Volume of data

: Distributed (or cluster-based) storage versus single instance databases

Variety of data

: Document storage, blob storage, or data lake

Data Science

:

Step 4

to

Step 8

in figure 1.2 fall into the category of data science. This is the phase where the data is made usable and used to predict the future, learn patterns, and extrapolate these patterns. Data science can further be sub-divided into two phases.

Step 4 to Step 6 comprise the first phase, wherein the goal is to understand the data better and make it usable. Making the data usable requires considerable effort to clean it by removing invalid characters and missing values. It also involves understanding the nitty-gritty of the data at hand—what is the distribution of data, what is the relationship between different data variables, is there a causatory relationship between the input and outcome variable, and so on. It also involves exploring numerical transformations (features) that might explain this causation (between input and outcome variables) better. This phase entails the real forensic effort that goes into the ultimate use of data. To use an analogy, bamboo seeds remain buried in the soil for years with no signs of a sapling growing, and suddenly a sapling grows, and within months a full bamboo tree is ready. This phase of data science is akin to the underground preparation the bamboo seeds undergo before the rapid growth. This is like the stealth mode of a start up wherein a lot of time and effort is committed. And this is where the pandas library, protagonist of this book, finds it raison d'etre and sweet spot.

Step 7 to Step 8 constitute the part where patterns (the parameters of a mathematical expression) are learned from historic data and extrapolated to future data. It involves a lot of experimentation and iterations to get to the optimal results. But if Step 4 to Step 6 have been done with the utmost care, this phase can be implemented pretty quickly thanks to the number of packages in Python, R, and many other data science tools. Of course, it requires a sound understanding of the math and algorithms behind the applied model in order to tweak its parameters to perfection.

Product Development

: This is the phase where all the hard work bears fruit and all the insights, results, and patterns are served to the users in a way that they can consume, understand, and act upon. It might range from building a dashboard on data with additional derived fields to an API that calls a trained model and returns an output on incoming data. A product can also be built to encompass all the stages of the data pipeline, from extracting the data to building a predictive model or creating an interactive dashboard.

Apart from these steps in the pipeline, there are some additional steps that might come into the picture. This is due to the highly evolving nature of the data landscape. For example, deep learning, which is used extensively to build intelligent products around image, text, and audio data, often requires the training data to be labeled into a category or augmented if the quantity is too small to create an accurate model.

For example, an object detection task on video data might require the creation of training data for object boundaries and object classes using some tools, or even manually. Data augmentation helps with image data by creating slightly perturbed data (rotated or grained images, for example) and adding it to training data. For a supervised learning task, labels are mandatory. This label is generally generated together with the data. For example, to train a churn model, a dataset with customer descriptions and when they churned out is required. This information is generally available in the company's CRM tool.

How Python and pandas fit into the data analytics pipeline

The Python programming language is one of the fastest-growing languages today in the emerging field of data science and analytics. Python was created by Guido van Rossum in 1991, and its key features include the following:

Interpreted rather than compiled

Dynamic type system

Pass by value with object references

Modular capability

Comprehensive libraries

Extensibility with respect to other languages

Object orientation

Most of the major programming paradigms: procedural, object-oriented, and, to a lesser extent, functional

For more information, refer to the following article on Python at https://www.python.org/about/.

Among the characteristics that make Python popular for data science are its very user-friendly (human-readable) syntax, the fact that it is interpreted rather than compiled (leading to faster development time), and it has very comprehensive libraries for parsing and analyzing data, as well as its capacity for numerical and statistical computations. Python has libraries that provide a complete toolkit for data science and analysis. The major ones are as follows:

NumPy

: The general-purpose array functionality with an emphasis on numeric computation

SciPy

: Numerical computing

Matplotlib

: Graphics

pandas

: Series and data frames (1D and 2D array-like types)

Scikit-learn

: Machine learning

NLTK

: Natural language processing

Statstool

: Statistical analysis

For this book, we will be focusing on the fourth library in the preceding list, pandas.

What is pandas?

The pandas we are going to obsess over in this book are not the cute and lazy animals that also do kung fu when needed.

pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. pandas stands for panel data, a reference to the tabular format in which it processes the data. It is available for free and is distributed with a 3-Clause BSD License under the open source initiative.

Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, and there's a large community behind it, (1,200+ contributors, 17,000+ commits, 23 versions, and 15,000+ stars) rapid iteration, features, and enhancements are continuously made.

Some key features of pandas include the following:

It can process a variety of datasets in different formats: time series, tabular heterogeneous, and matrix data.

It facilitates loading/importing data from varied sources, such as CSV and databases such as SQL.

It can handle myriad operations on datasets: subsetting, slicing, filtering, merging, groupBy, re-ordering, and re-shaping.

It can deal with missing data according to rules defined by the user/developer, such as ignore, convert to 0, and so on.

It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.

It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.

It delivers fast performance and can be sped up even more by making use of

Cython

(C extensions to Python).

For more information, go through the official pandas documentation at http://pandas.pydata.org/pandas-docs/stable/.

Where does pandas fit in the pipeline?

As discussed in the previous section, pandas can be used to perform Step 4 to Step 6 in the pipeline. And Step 4 to Step 6 are the backbone of any data science process, application, or product:

Where does pandas fit in the data analytics pipeline?

The Step 1 to Step 6 can be performed in pandas by some methods. Those in the Step 4 to Step 6 are the primary tasks while the Step 1 to Step 3 can also be done in some way or other in pandas.

pandas is an indispensable library if you're working with data, and it would be near impossible to find code for data modeling that doesn't import pandas into the working environment. Easy-to-use syntax in Python and the availability of a spreadsheet-like data structure called a dataframe make it amenable even to users who are too comfortable and too unwilling to move away from Excel. At the same time, it is loved by scientists and researchers to handle exotic file formats such as parquet, feather file, and many more. It can read data in batch mode without clogging all the machine's memory. No wonder the famous news aggregator Quartz called it themost important tool in data science.

pandas is suited well for the following types of dataset:

Tabular with heterogeneous type columns

Ordered and unordered time series

Matrix/array data with labeled or unlabeled rows and columns

pandas can perform the following operations on data with finesse:

Easy handling of missing and NaN data

Addition and deletion of columns

Automatic and explicit data alignment with labels

GroupBy for aggregating and transforming data using split-apply-combine

Converting differently indexed Python or NumPy data to DataFrame

Slicing, indexing, hierarchical indexing, and subsetting of data

Merging, joining, and concatenating data

I/O methods for flat files, HDF5, feather, and parquet formats

Time series functionality

History of pandas

The basic version of pandas was built in 2008 by Wes McKinney, an MIT grad with heavy quantitative finance experience. Now a celebrity in his own right, thanks to his open source contributions and the wildly popular book called Data Analysis with Python, he was reportedly frustrated with the time he had to waste doing simple data manipulation tasks at his job, such as reading a CSV file, with the popular tools at that time. He said he quickly fell in love with Python for its intuitive and accessible nature after not finding Excel and R suitable for his needs. But he found that it was missing key features that would make it the go-to tool for data analysis—for example, an intuitive format to deal with spreadsheet data or to create new calculated columns from existing columns.

According to an interview he gave to Quartz, the design considerations and vision that he had in mind while creating the tool were the following:

Quality of data is far more important than any fancy analysis

Treating in-memory data like a SQL table or an Excel spreadsheet

Intuitive analysis and exploration with minimal and elegant code

Easier compatibility with other libraries used for the same or different steps in the data pipeline

After building the basic version, he went on to pursue a PhD at Duke University but dropped out in a quest to make the tool he had created a cornerstone for data science and Python. With his dedicated contribution, together with the release of popular Python visualization libraries such as Matplotlib, followed by machine learning libraries such as Scikit-Learn and interactive user interfaces such as Jupyter and Spyder, pandas and eventually Python became the hottest tool in the armory of any data scientist.

Wes is heavily invested in the constant improvement of the tool he created from scratch. He coordinates the development of new features and the improvement of existing ones. The data science community owes him big time.

Usage pattern and adoption of pandas

The popularity of Python has skyrocketed over the years, especially after 2012; a lot of this can be attributed to the popularity of pandas. Python-related questions make up around 12% of the total questions asked from high-income countries on Stack Overflow, a popular platform for developers to ask questions and get answers from other people in the community about how to get things done and fix bugs in different programming languages. Given that there are hundreds of programming languages, one language occupying 12% of market share is an extraordinary achievement:

The most popular data analytics tools based on a survey of Kaggle users conducted in 2017-18

According to this survey conducted by Kaggle, 60% of the respondents said that they were aware of or have used Python for their data science jobs.

According to the data recorded by Stack Overflow about the types of question asked on their platform, Python and pandas have registered steady growth year on year, while some of the other programming languages, such as Java and C, have declined in popularity and are playing catch-up. Python has almost caught up with the number of questions asked about Java on the platform, while the number for Java has shown a negative trend. pandas has been showing constant growth in numbers.

The following chart is based on data gathered from the SQL API exposed by Stack Overflow. The y axis represents the number of questions asked about that topic on Stack Overflow in a particular year:

Popularity of tools across years based on the # questions asked on Stack Overflow

Google Trend also shows a surge in popularity for pandas, as demonstrated in the following chart. Numbers represent surge in interest for pandas relative to the highest point (historically) on the chart for the given region and time.

Popularity of pandas based on data from Google Trends

The geographical split of the popularity of pandas is even more interesting. The highest interest has come from China, which might be an indicator of the high adoption of open source tools and/or a very high inclination towards building powerful tech for data science:

Popularity of pandas across geographies based on Google Trends data

Apart from the popularity with its users, pandas (owing to its open source origins) also has a thriving community that is committed to constantly improving it and making it easier for the users to get answers about the issues. The following chart shows the weekly modifications (additions/deletions) to the pandas source code by the contributors:

Number of additions/deletions done to the pandas source code by contributors

pandas on the technology adoption curve

According to a popular framework called Gartner Hype Cycle, there are five phases in the process of the proliferation and adoption of technologies:

Technology trigger

Peak of inflated expectations

Trough of disillusionment

Slope of enlightenment

Plateau of productivity

The following link contains a chart that shows different technologies and the stage they are at on the technology adoption curve https://blogs-images.forbes.com/gartnergroup/files/2012/09/2012Emerging-Technologies-Graphic4.gif.

As can be seen, Predictive Analytics has already reached the steady plateau of productivity, which is where the optimum and stable return on investment can be extracted from a technology. Since pandas is an essential component of most predictive analytics initiatives, it is safe to say that pandas has reached the plateau of productivity.

Popular applications of pandas

pandas is built on top of NumPy. Some of the noteworthy uses of the pandas, apart from every other data science project of course, are the following:

pandas is a dependency of statsmodels (

http://www.statsmodels.org/stable/index.html

), making it a significant part of Python's numerical computing ecosystem.

pandas has been used extensively in the production of many financial applications.

Summary

We live in a big data era characterized by the four V's- volume, velocity, variety, and veracity. The volume and velocity of data are set to increase for the foreseeable future. Companies that can harness and analyze big data to extract information and take actionable decisions based on this information will be the winners in the marketplace. Python is a fast-growing, user-friendly, extensible language that is very popular for data analysis.

pandas is a core library of the Python toolkit for data analysis. It provides features and capabilities that make it much easier and faster than many other popular languages, such as Java, C, C++, and Ruby.

Thus, given the strengths of Python outlined in this chapter as a choice for the analysis of data and the popularity it has gained from users, contributors, and industry leaders, data analysis practitioners utilizing Python should become adept at pandas in order to become more effective. This book aims to help you achieve this goal.

In the next chapter, we proceed towards this goal by first setting up the infrastructure required to run pandas on your computer. We will also see different ways and scenarios in which pandas can be used and run.

References

https://activewizards.com/blog/top-20-python-libraries-for-data-science-in-2018/

https://qz.com/1126615/the-story-of-the-most-important-tool-in-data-science/

Installation of pandas and Supporting Software

Before we can start work on pandas for doing data analysis, we need to make sure that the software is installed and the environment is in proper working order. This chapter deals with the installation of Python (if necessary), the pandas library, and all necessary dependencies for the Windows, macOS/X, and Linux platforms. The topics we address include, among other things, selecting a version of Python, installing Python, and installing pandas. 

The steps outlined in the following section should work for the most part, but your mileage may vary depending upon the setup. On different operating system versions, the scripts may not always work perfectly, and the third-party software packages already in the system may sometimes conflict with the instructions provided.

The following topics will be covered in this chapter:

Selecting a version of Python to use

Installation of Python and pandas using Anaconda

Dependency packages for pandas

Review of items installed with Anaconda

Cross tooling—combining the pandas awesomeness with R, Julia, H20.ai, and Azure ML Studio command line tricks for pandas

Options and settings for Pandas

Selecting a version of Python to use

This is a classic battle among Python developers—Python 2.7.x or Python 3.x—which is better? Until a year back, it was Python 2.7.x that topped the charts; the reason being it was a stable version. More than 70% of projects used Python 2.7, in the year 2016. This number began to fall and by 2017 it was 63%. This shift in trends was due to the announcement that Python 2.7 would not be maintained from January 1, 2018, meaning that there would be no more bug fixes or new releases. Some libraries released after this announcement are only compatible with Python 3.x. Several businesses have started migrating towards Python 3.x. Hence, as of 2018, Python 3.x is the preferred version.

For further information, please see https://wiki.python.org/moin/Python2orPython3.

The main differences between Python 2.x and 3 include better Unicode support in Python 3, print and exec changed to functions, and integer division. For more details, see What's New in Python 3.0 at http://docs.python.org/3/whatsnew/3.0.html.