Python Tools for Data Scientists Pocket Primer - Mercury Learning and Information - E-Book

Python Tools for Data Scientists Pocket Primer E-Book

Mercury Learning and Information

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This book, part of the best-selling Pocket Primer series, offers a comprehensive introduction to essential Python tools for data scientists. It begins with an overview of Python basics, followed by in-depth coverage of NumPy and Pandas, focusing on their features and applications. The text also addresses the critical tasks of writing regular expressions and performing data cleaning.
Further sections delve into data visualization techniques and the use of Sklearn and SciPy, providing practical knowledge and skills for handling complex data analysis tasks. This structured approach ensures that readers gain a complete understanding of the tools and techniques necessary for effective data science.
Designed to be accessible yet thorough, this book includes numerous code samples to reinforce learning. Companion files with source code are available for download, making it an invaluable resource for anyone looking to master Python for data science and enhance their data analysis capabilities.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 341

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



PYTHON TOOLSFORDATA SCIENTISTS

Pocket Primer

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book / disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

Companion files for this title are available by writing to the publisher at [email protected].

PYTHON TOOLSFORDATA SCIENTISTS

Pocket Primer

Oswald Campesato

Copyright ©2023 by MERCURY LEARNINGAND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David PallaiMERCURY LEARNING AND INFORMATION22841 Quicksilver DriveDulles, VA [email protected]

O. Campesato. Python Tools for Data Scientists Pocket Primer.ISBN: 978-1-68392-823-2

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2022943452222324321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents –may this bring joy and happiness into their lives.

CONTENTS

Preface

Chapter 1: Introduction to Python

Tools for Python

easy_install and pip

virtualenv

Python Installation

Setting the PATH Environment Variable (Windows Only)

Launching Python on Your Machine

The Python Interactive Interpreter

Python Identifiers

Lines, Indentations, and Multi-Lines

Quotation and Comments in Python

Saving Your Code in a Module

Some Standard Modules in Python

The help() and dir() Functions

Compile Time and Runtime Code Checking

Simple Data Types in Python

Working with Numbers

Working with Other Bases

The chr() Function

The round() Function in Python

Formatting Numbers in Python

Unicode and UTF-8

Working with Unicode

Listing 1.1: Unicode1.py

Working with Strings

Comparing Strings

Listing 1.2: Compare.py

Formatting Strings in Python

Uninitialized Variables and the Value None in Python

Slicing and Splicing Strings

Testing for Digits and Alphabetic Characters

Listing 1.3: CharTypes.py

Search and Replace a String in Other Strings

Listing 1.4: FindPos1.py

Listing 1.5: Replace1.py

Remove Leading and Trailing Characters

Listing 1.6: Remove1.py

Printing Text without NewLine Characters

Text Alignment

Working with Dates

Listing 1.7: Datetime2.py

Listing 1.8: datetime2.out

Converting Strings to Dates

Listing 1.9: String2Date.py

Exception Handling in Python

Listing 1.10: Exception1.py

Handling User Input

Listing 1.11: UserInput1.py

Listing 1.12: UserInput2.py

Listing 1.13: UserInput3.py

Command-Line Arguments

Listing 1.14: Hello.py

Summary

Chapter 2: Introduction to NumPy

What is NumPy?

Useful NumPy Features

What are NumPy Arrays?

Listing 2.1: nparray1.py

Working with Loops

Listing 2.2: loop1.py

Appending Elements to Arrays (1)

Listing 2.3: append1.py

Appending Elements to Arrays (2)

Listing 2.4: append2.py

Multiplying Lists and Arrays

Listing 2.5: multiply1.py

Doubling the Elements in a List

Listing 2.6: double_list1.py

Lists and Exponents

Listing 2.7: exponent_list1.py

Arrays and Exponents

Listing 2.8: exponent_array1.py

Math Operations and Arrays

Listing 2.9: mathops_array1.py

Working with “−1” Sub-ranges With Vectors

Listing 2.10: npsubarray2.py

Working with “−1” Sub-ranges with Arrays

Listing 2.11: np2darray2.py

Other Useful NumPy Methods

Arrays and Vector Operations

Listing 2.12: array_vector.py

NumPy and Dot Products (1)

Listing 2.13: dotproduct1.py

NumPy and Dot Products (2)

Listing 2.14: dotproduct2.py

NumPy and the Length of Vectors

Listing 2.15: array_norm.py

NumPy and Other Operations

Listing 2.16: otherops.py

NumPy and the reshape() Method

Listing 2.17: numpy_reshape.py

Calculating the Mean and Standard Deviation

Listing 2.18: sample_mean_std.py

Code Sample with Mean and Standard Deviation

Listing 2.19: stat_values.py

Trimmed Mean and Weighted Mean

Working with Lines in the Plane (Optional)

Plotting Randomized Points with NumPy and Matplotlib

Listing 2.20: np_plot.py

Plotting a Quadratic with NumPy and Matplotlib

Listing 2.21: np_plot_quadratic.py

What is Linear Regression?

What is Multivariate Analysis?

What about Non-Linear Datasets?

The MSE (Mean Squared Error) Formula

Other Error Types

Non-Linear Least Squares

Calculating the MSE Manually

Find the Best-Fitting Line in NumPy

Listing 2.22: find_best_fit.py

Calculating MSE by Successive Approximation (1)

Listing 2.23: plain_linreg1.py

Calculating MSE by Successive Approximation (2)

Listing 2.24: plain_linreg2.py

Google Colaboratory

Uploading CSV Files in Google Colaboratory

Listing 2.25: upload_csv_file.ipynb

Summary

Chapter 3: Introduction to Pandas

What is Pandas?

Pandas Options and Settings

Pandas Data Frames

Data Frames and Data Cleaning Tasks

Alternatives to Pandas

A Pandas Data Frame with a NumPy Example

Listing 3.1: pandas_df.py

Describing a Pandas Data Frame

Listing 3.2: pandas_df_describe.py

Pandas Boolean Data Frames

Listing 3.3: pandas_boolean_df.py

Transposing a Pandas Data Frame

Pandas Data Frames and Random Numbers

Listing 3.4: pandas_random_df.py

Listing 3.5: pandas_combine_df.py

Reading CSV Files in Pandas

Listing 3.6: sometext.txt

Listing 3.7: read_csv_file.py

The loc() and iloc() Methods in Pandas

Converting Categorical Data to Numeric Data

Listing 3.8: cat2numeric.py

Listing 3.9: shirts.csv

Listing 3.10: shirts.py

Matching and Splitting Strings in Pandas

Listing 3.11: shirts_str.py

Converting Strings to Dates in Pandas

Listing 3.12: string2date.py

Merging and Splitting Columns in Pandas

Listing 3.13: employees.csv

Listing 3.14: emp_merge_split.py

Combining Pandas Data Frames

Listing 3.15: concat_frames.py

Data Manipulation with Pandas Data Frames (1)

Listing 3.16: pandas_quarterly_df1.py

Data Manipulation with Pandas Data Frames (2)

Listing 3.17: pandas_quarterly_df2.py

Data Manipulation with Pandas Data Frames (3)

Listing 3.18: pandas_quarterly_df3.py

Pandas Data Frames and CSV Files

Listing 3.19: weather_data.py

Listing 3.20: people.csv

Listing 3.21: people_pandas.py

Managing Columns in Data Frames

Switching Columns

Appending Columns

Deleting Columns

Inserting Columns

Scaling Numeric Columns

Listing 3.22: numbers.csv

Listing 3.23: scale_columns.py

Managing Rows in Pandas

Selecting a Range of Rows in Pandas

Listing 3.24: duplicates.csv

Listing 3.25: row_range.py

Finding Duplicate Rows in Pandas

Listing 3.26: duplicates.py

Listing 3.27: drop_duplicates.py

Inserting New Rows in Pandas

Listing 3.28: emp_ages.csv

Listing 3.29: insert_row.py

Handling Missing Data in Pandas

Listing 3.30: employees2.csv

Listing 3.31: missing_values.py

Multiple Types of Missing Values

Listing 3.32: employees3.csv

Listing 3.33: missing_multiple_types.py

Test for Numeric Values in a Column

Listing 3.34: test_for_numeric.py

Replacing NaN Values in Pandas

Listing 3.35: missing_fill_drop.py

Sorting Data Frames in Pandas

Listing 3.36: sort_df.py

Working with groupby() in Pandas

Listing 3.37: groupby1.py

Working with apply() and mapapply() in Pandas

Listing 3.38: apply1.py

Listing 3.39: apply2.py

Listing 3.40: mapapply1.py

Listing 3.41: mapapply2.py

Handling Outliers in Pandas

Listing 3.42: outliers_zscores.py

Pandas Data Frames and Scatterplots

Listing 3.43: pandas_scatter_df.py

Pandas Data Frames and Simple Statistics

Listing 3.44: housing.csv

Listing 3.45: housing_stats.py

Aggregate Operations in Pandas Data Frames

Listing 3.46: aggregate1.py

Aggregate Operations with the titanic.csv Dataset

Listing 3.47: aggregate2.py

Save Data Frames as CSV Files and Zip Files

Listing 3.48: save2csv.py

Pandas Data Frames and Excel Spreadsheets

Listing 3.49: write_people_xlsx.py

Listing 3.50: read_people_xslx.py

Working with JSON-based Data

Python Dictionary and JSON

Listing 3.51: dict2json.py

Python, Pandas, and JSON

Listing 3.52: pd_python_json.py

Useful One-line Commands in Pandas

What is Method Chaining?

Pandas and Method Chaining

Pandas Profiling

Listing 3.53: titanic.csv

Listing 3.54: profile_titanic.py

Summary

Chapter 4: Working with Sklearn and Scipy

What is Sklearn?

Sklearn Features

The Digits Dataset in Sklearn

Listing 4.1: load_digits1.py

Listing 4.2: load_digits2.py

Listing 4.3: sklearn_digits.py

The train_test_split() Class in Sklearn

Selecting Columns for X and y

What is Feature Engineering?

The Iris Dataset in Sklearn (1)

Listing 4.4: sklearn_iris1.py

Sklearn, Pandas, and the Iris Dataset

Listing 4.5: pandas_iris.py

The Iris Dataset in Sklearn (2)

Listing 4.6: sklearn_iris2.py

The Faces Dataset in Sklearn (Optional)

Listing 4.7: sklearn_faces.py

What is SciPy?

Installing SciPy

Permutations and Combinations in SciPy

Listing 4.8: scipy_perms.py

Listing 4.9: scipy_combinatorics.py

Calculating Log Sums

Listing 4.10: scipy_matrix_inv.py

Calculating Polynomial Values

Listing 4.11: scipy_poly.py

Calculating the Determinant of a Square Matrix

Listing 4.12: scipy_determinant.py

Calculating the Inverse of a Matrix

Listing 4.13: scipy_matrix_inv.py

Calculating Eigenvalues and Eigenvectors

Listing 4.14: scipy_eigen.py

Calculating Integrals (Calculus)

Listing 4.15: scipy_integrate.py

Calculating Fourier Transforms

Listing 4.16: scipy_fourier.py

Flipping Images in SciPy

Listing 4.17: scipy_flip_image.py

Rotating Images in SciPy

Listing 4.18: scipy_rotate_image.py

Google Colaboratory

Uploading CSV Files in Google Colaboratory

Listing 4.19: upload_csv_file.ipynb

Summary

Chapter 5: Data Cleaning Tasks

What is Data Cleaning?

Data Cleaning for Personal Titles

Data Cleaning in SQL

Replace NULL with 0

Replace NULL Values with the Average Value

Listing 5.1: replace_null_values.sql

Replace Multiple Values with a Single Value

Listing 5.2: reduce_values.sql

Handle Mismatched Attribute Values

Listing 5.3: type_mismatch.sql

Convert Strings to Date Values

Listing 5.4: str_to_date.sql

Data Cleaning from the Command Line (optional)

Working with the sed Utility

Listing 5.5: delimiter1.txt

Listing 5.6: delimiter1.sh

Working with Variable Column Counts

Listing 5.7: variable_columns.csv

Listing 5.8: variable_columns.sh

Listing 5.9: variable_columns2.sh

Truncating Rows in CSV Files

Listing 5.10: variable_columns3.sh

Generating Rows with Fixed Columns with the awk Utility

Listing 5.11: FixedFieldCount1.sh

Listing 5.12: employees.txt

Listing 5.13: FixedFieldCount2.sh

Converting Phone Numbers

Listing 5.14: phone_numbers.txt

Listing 5.15: phone_numbers.sh

Converting Numeric Date Formats

Listing 5.16: dates.txt

Listing 5.17: dates.sh

Listing 5.18: dates2.sh

Converting Alphabetic Date Formats

Listing 5.19: dates2.txt

Listing 5.20: dates3.sh

Working with Date and Time Date Formats

Listing 5.21: date-times.txt

Listing 5.22: date-times-padded.sh

Working with Codes, Countries, and Cities

Listing 5.23: country_codes.csv

Listing 5.24: add_country_codes.sh

Listing 5.25: countries_cities.csv

Listing 5.26: split_countries_codes.sh

Listing 5.27: countries_cities2.csv

Listing 5.28: split_countries_codes2.sh

Data Cleaning on a Kaggle Dataset

Listing 5.29: convert_marketing.sh

Summary

Chapter 6: Data Visualization

What is Data Visualization?

Types of Data Visualization

What is Matplotlib?

Diagonal Lines in Matplotlib

Listing 6.1: diagonallines.py

A Colored Grid in Matplotlib

Listing 6.2: plotgrid2.py

Randomized Data Points in Matplotlib

Listing 6.3: lin_plot_reg.py

A Histogram in Matplotlib

Listing 6.4: histogram1.py

A Set of Line Segments in Matplotlib

Listing 6.5: line_segments.py

Plotting Multiple Lines in Matplotlib

Listing 6.6: plt_array2.py

Trigonometric Functions in Matplotlib

Listing 6.7: sincos.py

Display IQ Scores in Matplotlib

Listing 6.8: iq_scores.py

Plot a Best-Fitting Line in Matplotlib

Listing 6.9: plot_best_fit.py

The Iris Dataset in SkLearn

Listing 6.10: sklearn_iris1.py

SkLearn, Pandas, and the Iris Dataset

Listing 6.11: pandas_iris.py

Working with Seaborn

Features of Seaborn

Seaborn Built-in Datasets

Listing 6.12: seaborn_tips.py

The Iris Dataset in Seaborn

Listing 6.13: seaborn_iris.py

The Titanic Dataset in Seaborn

Listing 6.14: seaborn_titanic_plot.py

Extracting Data from the Titanic Dataset in Seaborn (1)

Listing 6.15: seaborn_titanic.py

Extracting Data from the Titanic Dataset in Seaborn (2)

Listing 6.16: seaborn_titanic2.py

Visualizing a Pandas Dataset in Seaborn

Listing 6.17: pandas_seaborn.py

Data Visualization in Pandas

Listing 6.18: pandas_viz1.py

What is Bokeh?

Listing 6.19: bokeh_trig.py

Summary

Appendix A: Working with Data

What are Datasets?

Data Preprocessing

Data Types

Preparing Datasets

Discrete Data vs. Continuous Data

“Binning” Continuous Data

Scaling Numeric Data via Normalization

Scaling Numeric Data via Standardization

What to Look for in Categorical Data

Mapping Categorical Data to Numeric Values

Working with Dates

Working with Currency

Missing Data, Anomalies, and Outliers

Missing Data

Anomalies and Outliers

Outlier Detection

What is Data Drift?

What is Imbalanced Classification?

What is SMOTE?

SMOTE Extensions

Analyzing Classifiers (Optional)

What is LIME?

What is ANOVA?

The Bias-Variance Trade-Off

Types of Bias in Data

Summary

Appendix B: Working with awk

The awk Command

Built-in Variables that Control awk

How Does the awk Command Work?

Aligning Text with the printf Statement

Listing B.1: columns2.txt

Listing B.2: AlignColumns1.sh

Conditional Logic and Control Statements

The while Statement

A for loop in awk

Listing B.3: Loop.sh

A for loop with a break Statement

The next and continue Statements

Deleting Alternate Lines in Datasets

Listing B.4: linepairs.csv

Listing B.5: deletelines.sh

Merging Lines in Datasets

Listing B.6: columns.txt

Listing B.7: ColumnCount1.sh

Printing File Contents as a Single Line

Joining Groups of Lines in a Text File

Listing B.8: digits.txt

Listing B.9: digits.sh

Joining Alternate Lines in a Text File

Listing B.10: columns2.txt

Listing B.11: JoinLines.sh

Listing B.12: JoinLines2.sh

Listing B.13: JoinLines2.sh

Matching with Meta Characters and Character Sets

Listing B.14: Patterns1.sh

Listing B.15: columns3.txt

Listing B.16: MatchAlpha1.sh

Printing Lines Using Conditional Logic

Listing B.17: products.txt

Splitting Filenames with awk

Listing B.18: SplitFilename2.sh

Working with Postfix Arithmetic Operators

Listing B.19: mixednumbers.txt

Listing B.20: AddSubtract1.sh

Numeric Functions in awk

One Line awk Commands

Useful Short awk Scripts

Listing B.21: data.txt

Printing the Words in a Text String in awk

Listing B.22: Fields2.sh

Count Occurrences of a String in Specific Rows

Listing B.23: data1.csv

Listing B.24: data2.csv

Listing B.25: checkrows.sh

Printing a String in a Fixed Number of Columns

Listing B.26: FixedFieldCount1.sh

Printing a Dataset in a Fixed Number of Columns

Listing B.27: VariableColumns.txt

Listing B.28: Fields3.sh

Aligning Columns in Datasets

Listing B.29: mixed-data.csv

Listing B.30: mixed-data.sh

Aligning Columns and Multiple Rows in Datasets

Listing B.31: mixed-data2.csv

Listing B.32: aligned-data2.csv

Listing B.33: mixed-data2.sh

Removing a Column from a Text File

Listing B.34: VariableColumns.txt

Listing B.35: RemoveColumn.sh

Subsets of Column-aligned Rows in Datasets

Listing B.36: sub-rows-cols.txt

Listing B.37: sub-rows-cols.sh

Counting Word Frequency in Datasets

Listing B.38: WordCounts1.sh

Listing B.39: WordCounts2.sh

Listing B.40: columns4.txt

Displaying Only “Pure” Words in a Dataset

Listing B.41: onlywords.sh

Working with Multi-line Records in awk

Listing B.42: employees.txt

Listing B.43: employees.sh

A Simple Use Case

Listing B.44: quotes3.csv

Listing B.45 delim1.sh

Another Use Case

Listing B.46: dates2.csv

Listing B.47: string2date2.sh

Summary

Index

PREFACE

What is the Primary Value Proposition for this Book?

This book contains a fast-paced introduction to as much relevant information about Python tools for data scientists as possible that can be reasonably included in a book of this size. If you are a novice, this book will give you a starting point from which you can decide which Python technologies that you want to explore in greater detail.

You will be exposed to features of NumPy and Pandas, how to write regular expressions, and how to perform data cleaning tasks. Some topics are presented in a cursory manner, which is for two main reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. In other words, you decide whether to delve deeply into each of the topics in this book.

Second, a full treatment of all the topics that are covered in this book would significantly increase its size, and few people are interested in reading technical tomes with 500 or more pages.

However, it’s important for you to decide if this approach is suitable for your needs and learning style. If not, you can select one or more of the plethora of data analytics books that are available.

The Target Audience

This book is intended primarily for people who have worked with Python and are interested in learning about several important Python libraries. Moreover, this book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. Consequently, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration to provide a comfortable and meaningful learning experience for the intended readers.

What Will I Learn from This Book?

The first chapter contains a quick tour of basic Python, followed by a chapter that introduces you to Python data structures. Next, Chapter 3 introduces you to NumPy, followed by a chapter for Pandas. Chapter 5 provides a high-level view of Sklearn, which is an extremely powerful Python library that is central to many machine learning tasks.

Chapter 6 contains an assortment of data cleaning tasks that are solved via Python as well as the awk programming language. Chapter 6 delves into data visualization with Matplotlib, Seaborn, and Bokeh. Next, one appendix explores issues that can arise with data, followed by an appendix for awk.

Why is an Appendix for awk Included in This Book?

While many data cleaning tasks can be performed via Python, sometimes it’s much easier to perform data cleaning via awk. If you have not worked with awk, it’s a venerable Unix utility that was developed almost 50 years ago by Aho, Weinberger, and Kernighan (the latter is a coauthor of the famous K&R book for C).

Incidentally, most of the Python code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook. For the Python code samples that reference a CSV file, you do not need any additional code in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.

If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.

Do I Need to Learn the Theory Portions of this Book?

Once again, the answer depends on the extent to which you plan to become involved in data analytics. For example, if you plan to study machine learning, then you will probably learn how to create and train a model, which is a task that is performed after data cleaning tasks. In general, you will probably need to learn everything that you encounter in this book if you are planning to become a machine learning engineer.

Why Does This Book Include Sklearn Material?

The amount of Sklearn material in this book is minimal because this book is not about machine learning. The Sklearn material is located in Chapter 6, where you will learn about some of the Sklearn built-in datasets. If you decide to delve into machine learning, you will have already been introduced to some aspects of Sklearn.

Getting the Most from This Book

Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.

Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).

Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples “build” from earlier code samples.

What Do I Need to Know for This Book?

Current knowledge of Python 3.x is the most helpful skill. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required to understand the various topics that are covered.

If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.

Do the Companion Files Obviate the Need for This Book?

The companion files contain all the code samples to save you time and effort from the error-prone process of manually typing code into a text file. In addition, there are situations in which you might not have easy access to the companion files. Furthermore, the code samples in the book provide explanations that are not available on the companion files.

Does This Book Contain Production-level Code Samples?

The primary purpose of the code samples in this book is to show you Python-based libraries for solving a variety of data-related tasks in conjunction with acquiring a rudimentary understanding of statistical concepts. Clarity has a higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.

What are the Non-Technical Prerequisites for This Book?

Although the answer to this question is more difficult to quantify, it’s very important to have strong desire to learn about data analytics, along with the motivation and discipline to read and understand the code samples.

How Do I Set Up a Command Shell?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your MacBook will launch another command shell.

If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

Companion Files

All the code samples and figures in this book may be obtained by writing to the publisher at [email protected].

What are the “Next Steps” After Finishing This Book?

The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, then you can start by learning the fundamentals of NLP and then proceed to more advanced concepts, such as attention, transformers, and the BERT-related models.

If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an Internet search for those resources. One other point: the aspects of machine learning for you to learn depend on who you are. The needs of a machine learning engineer, data scientist, manager, student or software developer are all different.

CHAPTER 1

INTRODUCTION TO PYTHON

This chapter contains an introduction to Python, with information about useful tools for installing its modules, working with its basic constructs, and managing some data types.

The first part of this chapter covers a Python installation, some environment variables, and usage of the interpreter. We include code samples and how to save code in text files that you can launch from the command line. The second part of this chapter shows you how to work with simple data types, such as numbers, fractions, and strings. The third part of this chapter discusses exceptions and how to use them in scripts.

NOTE

The scripts in this book are for Python 3.x.

Tools for Python

The Anaconda Python distribution available for Windows, Linux, and Mac, and is downloadable: http://continuum.io/downloads.

Anaconda is well-suited for modules such as NumPy and scipy, and if you are a Windows user, Anaconda appears to be a better alternative.

easy_install and pip

Both easy_install and pip are easy to use when you need to install Python modules. Whenever you need to install a Python module, use either easy_install or pip with the following syntax:

easy_install <module-name>

pip install <module-name>

NOTE

Python-based modules are easy to install. Modules with code written in C are usually faster, but more difficult in terms of installation.

virtualenv

The virtualenv tool enables you to create isolated Python environments:

http://www.virtualenv.org/en/latest/virtualenv.html

virtualenv addresses the problem of preserving the correct dependencies and versions (and indirectly permissions) for different applications. (If you are a Python novice, you might not need virtualenv right now). The next section shows you how to check whether Python is installed on your machine, and also where you can download Python.

Python Installation

Before you download anything, check if you have Python already installed on your machine (which is likely if you have a Macbook or a Linux machine) by typing the following command in a command shell:

python -V

The output for the Macbook used in this book is

Python 3.9.1

NOTE

Install Python 3.9.1 (or as close as possible to this version) on your machine so that you will have the same version of Python that was used to test the scripts in this book.

If you need to install Python on your machine, navigate to the Python home page and select the downloads link or navigate directly to this website:

http://www.python.org/download/

In addition, PythonWin is available for Windows, and its home page is online:

http://www.cgl.ucsf.edu/Outreach/pc204/pythonwin.html

Use any text editor that can create, edit, and save Python scripts and save them as plain text files (don’t use Microsoft Word).

After you have Python installed and configured on your machine, you are ready to work with the Python scripts in this book.

Setting the PATH Environment Variable (Windows Only)

The PATH environment variable specifies a list of directories that are searched whenever you specify an executable program from the command line. A very good guide to setting up your environment so that the executable is always available in every command shell is to follow the instructions found online:

http://www.blog.pythonlibrary.org/2011/11/24/python-101-setting-up-python-on-windows/

Launching Python on Your Machine

There are three different ways to launch Python:

Use the Python Interactive Interpreter.

Launch Python scripts from the command line.

Use an IDE.

The next section shows you how to launch the interpreter from the command line. Later in this chapter, we show how to launch scripts from the command line and discuss IDEs.

NOTE

The emphasis in this book is to launch scripts from the command line or to enter code in the interpreter.

The Python Interactive Interpreter

Python Identifiers

A Python identifier is the name of a variable, function, class, module, or other object, and a valid identifier conforms to the following rules:

starts with a letter A to Z, or a to z, or an underscore (_)

zero or more letters, underscores, and digits (0 to 9)

NOTE

Python identifiers cannot contain characters such as @, $, and %.

Python is a case-sensitive language, so “Abc” and “abc” are different identifiers.

In addition, Python has the following naming conventions:

Class names start with an uppercase letter and all other identifiers with a lowercase letter.

An initial underscore is used for private identifiers.

Two initial underscores are used for strongly private identifiers.

An identifier with two initial underscores and two trailing underscores indicates a language-defined special name.

Lines, Indentations, and Multi-Lines

Quotation and Comments in Python

Saving Your Code in a Module

Some Standard Modules in Python

The Python Standard Library provides many modules that can simplify your own scripts. A list of the Standard Library modules is available online:

http://www.python.org/doc/

Some of the most important modules include cgi, math, os, pickle, random, re, socket, sys, time, and urllib.

The code samples in this book use the modules math,os,random, and re. You need to import these modules in order to use them in your code. For example, the following code block shows you how to import standard modules:

import re

import sys

import time

The code samples in this book import one or more of the preceding modules, as well as other Python modules.

The help() and dir() Functions

An Internet search for Python-related topics usually returns a number of links with useful information. Alternatively, you can check the official documentation site: docs.python.org.

In addition, the help() and dir() functions are accessible from the interpreter. The help() function displays documentation strings, whereas the dir() function displays defined symbols. For example, if you type help(sys,) you see documentation for the sys module, whereas dir(sys) displays a list of the defined symbols.

Type the following command in the interpreter to display the string-related methods:

>>> dir(str)

The preceding command generates the following output:

['__add__', '__class__', '__contains__', '__delattr__','__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

The preceding list gives you a consolidated list of built-in functions. Although it is clear that the max() function returns the maximum value of its arguments, the purpose of other functions, such as filter() or map(), is not immediately apparent (unless you have used them in other programming languages). The preceding list provides a starting point for finding out more about various built-in functions that are not discussed in this chapter.

Note that while dir() does not list the names of built-in functions and variables, you can obtain this information from the standard module __builtin__ that is automatically imported under the name __builtins__:

>>> dir(__builtins__)

The following command shows you how to get more information about a function:

help(str.lower)

The output from the preceding command is

Help on method_descriptor:

lower(...)

S.lower() -> string

Return a copy of the string S converted to lowercase.

(END)

Check the online documentation and also experiment with help() and dir() when you need additional information about a particular function or module.

Compile Time and Runtime Code Checking

Simple Data Types in Python

Python supports primitive data types, such as numbers (integers, floating point numbers, and exponential numbers), strings, and dates. It also supports more complex data types, such as lists (or arrays), tuples, and dictionaries, all of which are discussed later in this chapter. The next several sections discuss some of the primitive data types, along with code snippets that show you how to perform operations on those data types.

Working with Numbers

Working with Other Bases

The chr() Function