29,99 €
This book is designed for aspiring data scientists and those involved in data cleaning. It covers features of NumPy and Pandas, along with creating databases and tables in MySQL. It also addresses various data wrangling tasks using Python scripts and awk-based shell scripts. Companion files with code are available from the publisher.
Understanding data cleaning and manipulation is vital for data scientists. This book provides a comprehensive introduction to essential tools and techniques. From Python basics to advanced data wrangling, it equips readers with the skills needed to manage and clean data effectively.
The journey begins with an introduction to Python and progresses through working with data, Pandas, and SQL. It also covers Java, JSON, XML, and specific data cleaning tasks. The book culminates with detailed data wrangling techniques, ensuring readers gain practical, hands-on experience in data management.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 344
Veröffentlichungsjahr: 2024
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information, files, or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.
Companion files also available for downloading from the publisher by writing to [email protected].
Using Pandas, SQL, and Java
Oswald Campesato
MERCURY LEARNING AND INFORMATION
Dulles, Virginia
Boston, Massachusetts
New Delhi
Copyright ©2023 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNING AND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
1-800-232-0223
O. Campesato. Data Wrangling Using Pandas, SQL, and Java.
ISBN: 978-1-68392-904-8
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2022945211
222324321 Printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files for this title are available by writing to the publisher [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents —may this bring joy and happiness into their lives.
Preface
Chapter 1: Introduction to Python
Tools for Python
easy_install and pip
virtualenv
IPython
Python Installation
Setting the PATH Environment Variable (Windows Only)
Launching Python on Your Machine
The Python Interactive Interpreter
Python Identifiers
Lines, Indentation, and Multi-Lines
Quotation and Comments
Saving Your Code in a Module
Some Standard Modules
The help() and dir() Functions
Compile Time and Runtime Code Checking
Simple Data Types
Working with Numbers
Working with Other Bases
The chr() Function
The round() Function in Python
Formatting Numbers in Python
Working with Fractions
Unicode and UTF-8
Working with Unicode
Working with Strings
Comparing Strings
Formatting Strings in Python
Uninitialized Variables and the Value None
Slicing and Splicing Strings
Testing for Digits and Alphabetic Characters
Search and Replace a String in Other Strings
Remove Leading and Trailing Characters
Printing Text Without NewLine Characters
Text Alignment
Working with Dates
Converting Strings to Dates
Exception Handling
Handling User Input
Command-Line Arguments
Summary
Chapter 2: Working with Data
Dealing with Data: What Can Go Wrong?
What is Data Drift?
What are Datasets?
Data Preprocessing
Data Types
Preparing Datasets
Discrete Data vs. Continuous Data
“Binning” Continuous Data
Scaling Numeric Data via Normalization
Scaling Numeric Data via Standardization
Scaling Numeric Data via Robust Standardization
What to Look for in Categorical Data
Mapping Categorical Data to Numeric Values
Working with Dates
Working with Currency
Working with Outliers and Anomalies
Outlier Detection/Removal
Finding Outliers with NumPy
Finding Outliers with Pandas
Calculating Z-Scores to Find Outliers
Finding Outliers with SkLearn (Optional)
Working with Missing Data
Imputing Values: When is Zero a Valid Value?
Dealing with Imbalanced Datasets
What is SMOTE?
SMOTE Extensions
The Bias-Variance Tradeoff
Types of Bias in Data
Analyzing Classifiers (Optional)
What is LIME?
What is ANOVA?
Summary
Chapter 3: Introduction to Pandas
What is Pandas?
Pandas Data Frames
Data Frames and Data Cleaning Tasks
A Pandas Data Frame Example
Describing a Pandas Data Frame
Pandas Boolean Data Frames
Transposing a Pandas Data Frame
Pandas Data Frames and Random Numbers
Converting Categorical Data to Numeric Data
Merging and Splitting Columns in Pandas
Combining Pandas Data Frames
Data Manipulation with Pandas Data Frames
Pandas Data Frames and CSV Files
Useful Options for the Pandas read_csv() Function
Reading Selected Rows from CSV Files
Pandas Data Frames and Excel Spreadsheets
Useful Options for Reading Excel Spreadsheets
Select, Add, and Delete Columns in Data Frames
Handling Outliers in Pandas
Pandas Data Frames and Simple Statistics
Finding Duplicate Rows in Pandas
Finding Missing Values in Pandas
Missing Values in an Iris-Based Dataset
Sorting Data Frames in Pandas
Working with groupby() in Pandas
Aggregate Operations with the titanic.csv Dataset
Working with apply() and mapapply() in Pandas
Useful One-line Commands in Pandas
Working with JSON-based Data
Python Dictionary and JSON
Python, Pandas, and JSON
Summary
Chapter 4: RDBMS and SQL
What is an RDBMS?
What Relationships Do Tables Have in an RDBMS?
Features of an RDBMS
What is ACID?
When Do We Need an RDBMS?
The Importance of Normalization
A Four-Table RDBMS
Detailed Table Descriptions
The customers Table
The purchase_orders Table
The line_items Table
The item_desc Table
What is SQL?
DCL, DDL, DQL, DML, and TCL
SQL Privileges
Properties of SQL Statements
The CREATE Keyword
What is MySQL?
What about MariaDB?
Installing MySQL
Data Types in MySQL
The CHAR and VARCHAR Data Types
String-based Data Types
FLOAT and DOUBLE Data Types
BLOB and TEXT Data Types
MySQL Database Operations
Creating a Database
Display a List of Databases
Display a List of Database Users
Dropping a Database
Exporting a Database
Renaming a Database
The INFORMATION_SCHEMA Table
The PROCESSLIST Table
SQL Formatting Tools
Summary
Chapter 5: Java, JSON, and XML
Working with Java and MySQL
Performing the Set-up Steps
Creating a MySQL Database in Java
Creating a MySQL Table in Java
Inserting Data into a MySQL Table in Java
Deleting Data and Dropping MySQL Tables in Java
Selecting Data from a MySQL Table in Java
Updating Data in a MySQL Table in Java
Working with JSON, MySQL, and Java
Select JSON-based Data from a MySQL Table in Java
Working with XML, MySQL, and Java
What is XML?
What is an XML Schema?
When are XML Schemas Useful?
Create a MySQL Table for XML Data in Java
Read an XML Document in Java
Read an XML Document as a String in Java
Insert XML-based Data into a MySQL Table in Java
Select XML-based Data from a MySQL Table in Java
Parse XML-based String Data from a MySQL Table in Java
Working with XML Schemas
Summary
Chapter 6: Data Cleaning Tasks
What is Data Cleaning?
Data Cleaning for Personal Titles
Data Cleaning in SQL
Replace NULL with 0
Replace NULL Values with Average Value
Replace Multiple Values with a Single Value
Handle Mismatched Attribute Values
Convert Strings to Date Values
Data Cleaning from the Command Line (Optional)
Working with the sed Utility
Working with Variable Column Counts
Truncating Rows in CSV Files
Generating Rows with Fixed Columns with the awk Utility
Converting Phone Numbers
Converting Numeric Date Formats
Converting Alphabetic Date Formats
Working with Date and Time Date Formats
Working with Codes, Countries, and Cities
Data Cleaning on a Kaggle Dataset
Summary
Chapter 7: Data Wrangling
What is Data Wrangling?
Data Transformation: What Does This Mean?
CSV Files with Multi-Row Records
Pandas Solution (1)
Pandas Solution (2)
CSV Solution
CSV Files, Multi-row Records, and the awk Command
Quoted Fields Split on Two Lines (Optional)
Overview of the Events Project
Why This Project?
Project Tasks
Generate Country Codes
Prepare a List of Cities in Countries
Generating City Codes from Country Codes: awk
Generating City Codes from Country Codes: Python
Generating SQL Statements for the city_codes Table
Generating a CSV File for Band Members (Java)
Generating a CSV File for Band Members (Python)
Generating a Calendar of Events (COE)
Project Automation Script
Project Follow-up Comments
Summary
Appendix A: Working with awk
The awk Command
Built-in Variables That Control awk
How Does the awk Command Work?
Aligning Text with the printf() Statement
Conditional Logic and Control Statements
The while Statement
A for Loop in awk
A for Loop with a break Statement
The next and continue Statements
Deleting Alternate Lines in Datasets
Merging Lines in Datasets
Printing File Contents as a Single Line
Joining Groups of Lines in a Text File
Joining Alternate Lines in a Text File
Matching with Meta Characters and Character Sets
Printing Lines Using Conditional Logic
Splitting Filenames with awk
Working with Postfix Arithmetic Operators
Numeric Functions in awk
One-line awk Commands
Useful Short awk Scripts
Printing the Words in a Text String in awk
Count Occurrences of a String in Specific Rows
Printing a String in a Fixed Number of Columns
Printing a Dataset in a Fixed Number of Columns
Aligning Columns in Datasets
Aligning Columns and Multiple Rows in Datasets
Removing a Column from a Text File
Subsets of Column-aligned Rows in Datasets
Counting Word Frequency in Datasets
Displaying Only “Pure” Words in a Dataset
Working with Multi-line Records in awk
A Simple Use Case
Another Use Case
Summary
Index
This book contains a fast-paced introduction to as much relevant information about managing data that can be reasonably included in a book of this size. However, you will be exposed to a variety of features of NumPy and Pandas, how to create databases and tables in MySQL, and how to perform many data cleaning tasks and data wrangling.
Some topics are presented in a cursory manner, which is for two main reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. In other words, you decide whether to delve into more detail regarding the topics in this book.
Second, a full treatment of all the topics that are covered in this book would significantly increase its size, and few people have the time to read technical tomes.
This book is intended primarily for people who plan to become data scientists as well as anyone who needs to perform data cleaning tasks. This book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. Hence, this book uses standard English rather than colloquial expressions that might be confusing to those readers. People learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration to provide a comfortable and meaningful learning experience for the intended readers.
The first chapter briefly introduces Python, followed by Chapter 2, which delves into processing different data types in a dataset, along with normalization, standardization, and handling missing data. You will learn about outliers and how to detect them via z-scores and quantile transformation. Then you will learn about SMOTE for handling imbalanced datasets.
Chapter 3 introduces Pandas, which is a powerful Python library that enables you to read the contents of CSV files (and other text files) into data frames (somewhat analogous to Excel spreadsheets), where you can programmatically slice-and-dice the data to conform to your requirements.
Since large quantities of data are stored in the form structured data in relational databases, Chapter 4 introduces you to SQL concepts and how to perform basic operations in MySQL, such as working with databases.
Chapter 5 contains Java-based code samples for creating and accessing data in a MySQL database. Chapter 6 introduces you to data cleaning, along with various techniques for handling different scenarios, such as missing data and outliers.
The seventh chapter of this book explains data wrangling, and contains Python scripts and awk-based shell scripts to solve various tasks. Finally, there is an appendix for awk, which will assist you in understanding the awk-based scripts in Chapter 7.
Most of the code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook. For the Python code samples that reference a CSV file, you need an additional code snippet in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.
If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.
Once again, the answer depends on the extent to which you plan to become involved in data analytics. For example, if you plan to study machine learning, then you will probably learn how to create and train a model, which is a task that is performed after data cleaning tasks. In general, you will probably need to learn everything that you encounter in this book if you are planning to become a machine learning engineer.
Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.
Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).
Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples “build” from earlier code samples.
Current knowledge of Python 3.x is the most helpful skill. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required to understand the various topics that are covered.
If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.
The primary purpose of the code samples in this book is to show you Python-based libraries for solving a variety of data-related tasks in conjunction with acquiring a rudimentary understanding of statistical concepts. Clarity has a higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production website, you should subject that code to the same rigorous analysis as the other parts of your code base.
Although the answer to this question is more difficult to quantify, it’s very important to have strong desire to learn about data cleaning and wrangling, along with the motivation and discipline to read and understand the code samples.
If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications>Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:
open /Applications/Utilities/Terminal.app
A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (open source: https://cygwin.com/), which simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).
All the code samples and figures in this book may be obtained by writing to the publisher at [email protected].
The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, then you can learn more advanced concepts, such as attention, transformers, and the BERT-related models.
If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an Internet search for those resources. One other point: the aspects of machine learning for you to learn depend on who you are. The needs of a machine learning engineer, data scientist, manager, student, or software developer are all different.