SQL Pocket Primer - Mercury Learning and Information - E-Book

SQL Pocket Primer E-Book

Mercury Learning and Information

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

As part of the best-selling Pocket Primer series, this book is designed for data scientists and machine learning engineers seeking to deepen their SQL knowledge using MySQL as the primary RDBMS. It features Python-based code samples for accessing data from MySQL tables in Pandas data frames and Java-based samples for data access in MySQL, along with handling XML and JSON documents.
The book also introduces NoSQL, with an overview of MongoDB, and SQLite, an open-source RDBMS for mobile devices. The final section covers diverse topics like normalization, schemas, database optimization, and performance. This comprehensive approach ensures a well-rounded understanding of SQL and related technologies, enhancing data manipulation and database management skills.
Numerous code samples and listings throughout the book support the various topics covered. Companion files with source code and figures are available from the publisher, making this an essential resource for advancing your SQL proficiency and database management expertise.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 410

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



SQL

Pocket Primer

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book/disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

Companion files for this title are available by writing to the publisher at [email protected].

SQL

Pocket Primer

Oswald Campesato

MERCURY LEARNING AND INFORMATION

Dulles, Virginia

Boston, Massachusetts

New Delhi

Copyright ©2022 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

MERCURY LEARNING AND INFORMATION

22841 Quicksilver Drive

Dulles, VA 20166

[email protected]

www.merclearning.com

800-232-0223

O. Campesato. SQL Pocket Primer.

ISBN: 978-1-68392-814-0

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2022930720

222324321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents – may this bring joy and happiness into their lives.

CONTENTS

Preface

Chapter 1: Introduction to RDBMSs and MySQL

What is MySQL?

What about MariaDB?

Installing MySQL

Useful Links for MySQL

What is an RDBMS?

What Relationships Do Tables Have in an RDBMS?

Features of an RDBMS

What is ACID?

When Do We Need an RDBMS?

Transferring Money Between Bank Accounts

The Importance of Normalization

A Four-Table RDBMS

Detailed Table Descriptions

The Customers Table

The purchase_orders Table

The line_items Table

The item_desc Table

SQL Statements for the Impatient (Optional)

What About an Item Inventory Table?

The Role of SQL

DCL, DDL, DQL, DML, and TCL

SQL Privileges

Properties of SQL Statements

The CREATE Keyword

Data Types in MySQL

The CHAR and VARCHAR Data Types

String-Based Data Types

FLOAT and DOUBLE Data Types

BLOB and TEXT Data Types

MySQL Database Operations

Creating a Database

Display a List of Databases

Display a List of Database Users

Dropping a Database

Exporting a Database

Renaming a Database

Show Database Tables

The INFORMATION_SCHEMA Table

The PROCESSLIST Table

SQL Formatting Tools

Summary

Chapter 2: Working with SQL and MySQL

Drop Database Tables

Create Database Tables

Manually Creating Tables for mytools.com

Creating Tables via a SQL Script for mytools.com

Creating Tables with Japanese Text

Creating Tables from the Command Line

Defining Table Attributes

Working with Aliases in SQL

Alter Database Tables with the ALTER Keyword

Add a Column to a Database Table

Drop a Column from a Database Table

Change the Data Type of a Column

What are Referential Constraints?

Combining Data for a Table Update (Optional)

Merging Data Columns in Multiple CSV Files via Pandas

Concatenating Data from Multiple CSV Files

Appending Table Data from CSV Files via SQL

Inserting Data into Database Tables

Populating Tables from Text Files

Working with Simple SELECT Statements

Duplicate Versus Distinct Rows

Unique Rows Versus Distinct Rows

The EXISTS Keyword

The LIMIT Keyword

DELETE, TRUNCATE, and DROP in SQL

SELECT, DELETE, and LIMIT Combinations

More Options for the DELETE Statement in SQL

Creating Tables from Existing Tables in SQL

Working with Temporary Tables in SQL

Creating Copies of Existing Tables in SQL

What is a SQL Index?

Types of Indexes

Creating an Index

Disabling and Enabling an Index

View and Drop Indexes

Overhead of Indexes

Considerations for Defining Indexes

When to Disable Indexes on a Table

Selecting Columns for an Index

Finding Columns Included in Indexes

Enhancing the mytools Database (Optional)

Entity Relationships

Summary

Chapter 3: Joins, Views, and Subqueries

Query Execution Order in SQL

Joining Tables in SQL

Types of SQL JOIN Statements

Examples of SQL JOIN Statements

An INNER JOIN Statement

A LEFT JOIN Statement

A RIGHT JOIN Statement

A CROSS JOIN Statement

MySQL NATURAL JOIN Statement

An INNER JOIN to Delete Duplicate Attributes

JOIN Statements on Tables with International Text

What is a View?

Creating a View

Dropping a View in SQL

Advantages of Views in SQL Statements

Views Involving a Single Table

Views Involving Multiple Tables

Updatable Views

Keys, Primary Keys, and Foreign Keys

Foreign Keys versus Primary Keys

A MySQL Example of Foreign Keys

Working with Subqueries in SQL

Two Types of Subqueries

A Subquery to Find Customers Without Purchase Orders

Subqueries with IN and NOT IN Clause

Subqueries with SOME, ALL, ANY Clause

Subqueries with the MAX() and AVG() Functions

Find Tallest Students in Each Classroom via a Subquery

SQL and Histograms

What are GROUP BY, ORDER BY, and HAVING Clauses?

Displaying Duplicate Attribute Values

Examples of the SQL GROUP BY and ORDER BY Clause

SQL Histograms on a Table Copy

Combine GROUP BY and ROLLUP Clause

The 2021 Olympics Medals and the ROLLUP Keyword

The 2021 Olympics Medals and the RANK Operator

The PARTITION BY Clause

GROUP BY, HAVING, and ORDER BY Clause

Combined GROUP BY, HAVING, and ORDER BY Clause

Updating the item_desc Table from the new_items Table

A SQL Query Involving a Four-Table Join

Operations with Dates in SQL

Day and Month Components of Dates in SQL

Rounding Dates in SQL

Working with Date Ranges

Tables Containing Modification Times

Arithmetic Operations with Dates

Date Components and Date Formats

Finding the Week in Date Values

Displaying Weekly Revenue

Assorted SQL Operators

Working with Column Aliases

SQL Variables

SQL Summary Reports

Simple SQL Reports

Calculating SubTotals

Calculating “Running” (Cumulative) Totals

Summary

Chapter 4: Assorted SQL Functions

Numeric Functions in SQL

Calculated Columns

The round(), ceil(), and floor() Functions

SQL Queries with the rand() Function

Log, Exponential, and Trig Functions in SQL

Scalar Functions in SQL

Aggregate Functions in SQL

SQL Queries with the max() and min() Functions

Find Maximum Values with SQL Subqueries

Simplify SQL Queries Containing Subqueries

Find Top-Ranked Numeric Values

Find the Second and Third Largest Values in a Column

Find the Top Three Values in a Column

Find Values with the OFFSET Keyword

String Functions in SQL

SQL Queries with the SUBSTRING() Function

The SUBSTRING() Function in SQL

Boolean Operators in SQL

The IN Keyword

Set Operators in SQL

AND, OR, and NOT Operators in SQL

Working with Arithmetic Operators

Arithmetic Aggregate Operators in SQL

Finding Average Values

SELECT Clauses with Multiple Aggregate Functions

The ORDER BY Clause in SQL

ORDER BY with Aggregate Functions

Largest Distinct Values and Frequency of Values

Character Functions and String Operators

SQL Character Functions

String Operators in SQL

The MATCH() Function and Text Search

CTEs and the “with” Keyword in MySQL (Version 8)

The with Keyword and a Recursive SQL Query

CTEs and the Mean, Stddev, and Z-scores

Linear Regression in SQL

Window Functions

Types of Window Functions in SQL

The SQL CASE Clause

Working with NULL Values in SQL

Miscellaneous One-Liners

Working with the CAST() Function in SQL

Summary

Chapter 5: NoSQL, SQLite, and Python

Non-Relational Database Systems

Advantages of Non-Relational Databases

What is NoSQL?

What is NewSQL?

RDBMSs Versus NoSQL: Which One to Use?

Good Data Types for NoSQL

Some Guidelines for Selecting a Database

NoSQL Databases

What is MongoDB?

Features of MongoDB

Installing MongoDB

Launching MongoDB

Useful Mongo APIs

Meta Characters in Mongo Queries

MongoDB Collections and Documents

Document Format in MongoDB

Create a MongoDB Collection

Working with MongoDB Collections

Find all Android Phones

Find All Android Phones in 2018

Insert a New Item (document)

Update an Existing Item (document)

Calculate the Average Price for Each Brand

Calculate the Average Price for Each Brand in 2019

Import Data with mongoimport

What is Fugue?

What is Compass?

What is PyMongo?

MySQL, SQLAlchemy, and Pandas

What is SQLAlchemy?

Read MySQL Data via SQLAlchemy

Export SQL Data from Pandas to Excel

MySQL and Connector/Python

Establishing a Database Connection

Reading Data from a Database Table

Creating a Database Table

What is SQLite?

SQLite Features

SQLite Installation

SQLiteStudio Installation

DB Browser for SQLite Installation

SQLiteDict (Optional)

Summary

Chapter 6: Miscellaneous Topics

Managing Users

Listing Current Users

Creating and Altering MySQL Users

Dropping MySQL Users

What are Roles?

Create Roles and Grant Privileges

Revoke Roles and Drop Roles

What is a User-Defined Function?

What is a Stored Procedure?

IN and OUT Parameters in Stored Procedures

A Simple Stored Procedure

What is a Stored Function?

A Simple Stored Function

What are SQL Triggers?

A Simple MySQL Trigger

MySQL Engines

What is Normalization?

What is Denormalization?

What are Schemas?

MySQL Workbench

Exporting a Schema in Workbench

Creating a Schema in Workbench

ERM and Tools

What is a Transaction?

The COMMIT and ROLLBACK Statements

The SAVEPOINT Statement

Database Optimization and Performance

Performance Tuning Considerations

SQL Query Optimization

Analyzing SQL Queries for Their Performance

Performance Tuning Tools

Cost-Based Optimizers (Optional)

Table Fragmentation

Table Partitioning

What is an EXPLAIN Plan?

Explain Analyze

Scaling an RDBMS

What is SQL Tuning?

What is Sharding?

RDBMS Support for Sharding

What is Federation?

Database Replication

Distributed Databases, Scalability, and the CAP Theorem

Master-Slave Replication

The CAP Theorem

What are Consistency Patterns?

MySQL Command Line Utilities

Database Backups, Restoring Data, and Upgrades

MySQL and JSON Data

Data Cleaning in SQL

Replace NULL with 0

Replace NULL Values with Average Value

Replace Multiple Values with a Single Value

Handle Mismatched Attribute Values

Convert Strings to Date Values

Data Cleaning From the Command Line (Optional)

Working with the sed Utility

Working with the awk Utility

Next Steps

Summary

Appendix: Introduction to Probability and Statistics

Index

PREFACE

What is the Value Proposition for This Book?

This book is primarily for data scientists and machine learning engineers who want to expand their current knowledge of SQL using MySQL as the primary RDBMS. While this book does contain relevant information for novices in other fields, the structure of this book differs from typical database books.

In addition, this book attempts to balance depth and breadth, along with a decent number of SQL statements to illustrate the important features of SQL. Although it’s not possible to describe the exact set of features that constitute basic, intermediate, and advanced SQL queries (i.e., opinions will differ), this book contains SQL examples that belong to each of those three groups.

At the same time, remember that some topics in the final chapter are presented in a cursory manner, which is for two main reasons. First, although you don’t need an in-depth understanding of every facet of SQL and RDBMSs, it’s important that you be aware of these concepts if you plan to become highly proficient in managing database data. In addition, you will be in a better position to plan an itinerary for the set of topics that you will learn at some point in the future.

Second, a full treatment of every topic in this book would significantly increase the page count, and it’s debatable whether all the additional details would be beneficial to you as a machine learning engineer or a data scientist.

The Target Audience

As you read in the previous section, this book is meant primarily for machine learning engineers and data scientists who already have a basic understanding of SQL, which means that they have executed some SQL statements in a database such as MySQL. As such, they will learn more details about SQL and MySQL so they can manage data in database tables. Moreover, the knowledge that they gain while working with MySQL can easily transfer to other RDBMSs such as ORACLE.

In addition, this book is intended to reach an international audience of readers, so this book uses standard English rather than colloquial expressions. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration in order to provide a comfortable and meaningful learning experience for the intended readers.

What’s Different About This SQL Book?

Before delving into the differences, it’s worth noting that this book covers many topics that you will find in database books of comparable length. At a minimum, any RDBMS book needs to include SQL, along with examples of how to select, delete, update, and insert data into a database table. Other mandatory topics include an explanation of views, indexes, joining tables, subqueries, normalization, and database schemas.

However, this book differs from generic database books because there are topics that are relevant to this target audience, which are not necessary for readers of generic database books. Some of those additional topics are discussed in chapter 6 (miscellaneous topics).

Another difference is a portion of Chapter 5, which contains Python-based code samples to access data from a MySQL table in a Pandas data frame. A third difference is the inclusion of the appendix that contains an introduction to probability and statistics, and a discussion of of entropy, cross-entropy, and KL divergence. Thus, it’s the collective set of differences that differentiate this book from generic SQL books.

What Will I Learn From This Book?

The first chapter contains a short introduction to RDBMSs and MySQL, along with information about installing MySQL. In addition, you will see SQL statements for creating, dropping, and exporting a database. Although other books sometimes defer these operations to later chapters, they are easy to perform with empty or very small databases that do not contain any critical data. Therefore, you don’t have to worry about making costly mistakes because of a blunder in a SQL query.

The second chapter delves into creating database tables and various ways to populate them with data. This chapter also describes various ways of deleting data from database tables, followed by a discussion of indexes on tables and why they are important.

The third chapter explains the concept of “joining” database tables, followed by a discussion of views: what they are, what advantages they provide, and how to create them over a single table or multiple tables. You will also learn how to work with subqueries in SQL. In addition, this chapter introduces you to the notion of normalization, along with a clear and compelling reason for adopting database normalization.

The fourth chapter is primarily about SQL functions, which involves numeric functions such as ceil(), floor(), and random(). Aggregate functions are also discussed, followed by string-oriented SQL functions such as the substring() function. This chapter contains an assortment of SQL statements, some of which involve various combinations of GROUP BY, HAVING, and ORDER BY.

The fifth chapter introduces NoSQL, followed by an overview of MongoDB, which is a popular NoSQL database. Next you will learn about SQLite, which is an open-source RDBMS that is available on mobile devices.

Chapter six contains a diverse set of miscellaneous topics, such as normalization, schemas, database optimization, and performance. Then you will be introduced to EXPLAIN plans, SQL tuning, managing users, roles, stored procedures, and triggers.

A Simple Way to Create the Entire mytools Database

As a convenience, Chapter 6 contains the SQL file mytools.sql that contains all the tables that are defined in this book. Moreover, the SQL file also contains the data for all the database tables. Of course, you can launch the individual SQL files for each of the tables if you prefer to do so the “long way”.

You can import the complete mytools database by starting MySQL and then issuing the following command from the command line in the directory that contains mytools.sql:

mysql -u root -p mytools < mytools.sql

NOTEIf you encounter issues when you launch the preceding command, read the section in Chapter 6 regardingMySQL Workbenchthat enables you to import databases and export databases.

What Do I Need to Know for This Book?

Although this is an introductory book with minimal prerequisites, obviously you will benefit from having existing knowledge of various topics. Specifically, some knowledge of SQL will facilitate learning the SQL-related concepts more quickly. In addition, knowledge of Java is helpful for Appendix A, as well as some familiarity with XML and JSON. Familiarity with normalization will help you understand the relationships among the tables in the fictitious application that is discussed in Chapter 1 and Chapter 2.

If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.

Do the Companion Files Obviate the Need for This Book?

The companion files contains all the code samples to save you time and effort from the error-prone process of manually typing code into a text file. Furthermore, there are situations in which you might not have easy access to the companion files. In addition, the code samples in the book provide explanations that are not available on the companion files.

Does This Book Contain Production-Level Code Samples?

The primary purpose of the code samples in this book is to provide a variety of SQL statements that enable you to perform common and useful tasks in MySQL. Clarity has higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.

What Are the Non-Technical Prerequisites for This Book?

Although the answer to this question is more difficult to quantify, it’s very important to have strong desire to learn about data analytics, along with the motivation and discipline to read and understand the code samples.

How Do I Set Up a Command Shell?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

If you are a PC user, you can install Cygwin (open source https://cywin.com) which simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

Companion Files

All of the code samples and figures in this book may be obtained by writing to the publisher at [email protected].

What Are the “Next Steps” After Finishing This Book?

The answer to this question varies, mainly because the answer depends heavily on your objectives. If you are interested primarily in working with structured data, then you can look for online resources that delve into more advanced topics.

If you are primarily interested in machine learning, then you have several options: NLP (natural language processing), deep learning, and reinforcement learning (and also deep reinforcement learning).

Fortunately, you can perform an Internet search to find many resources. One other point: the aspects of machine learning for you to learn depend on who you are: the needs of a machine learning engineer, data scientist, manager, student or software developer are all different.

CHAPTER 1

INTRODUCTION TO RDBMSs AND MYSQL

This chapter introduces you to RDBMSs and various SQL concepts, along with a quick introduction to MySQL. MySQL is used in most of this book because it is a robust RDBMS that is available as a free download from an ORACLE website. Current trends suggest that MySQL will continue its dominant role for the foreseeable future. Moreover, virtually everything that you learn about MySQL in this chapter transfers to other RDBMSs, such as PostgreSQL and ORACLE.

This chapter describes a fictitious website that enables users to register themselves for the purpose of purchasing various home improvement tools (hammers, wrenches, and so forth). Instead of SQL statements, you will learn about the tables that are required, their relationships, and the structure of those tables. You will also see some SQL INSERT statements for inserting data into database tables. Although we have yet to create any database tables, these SQL statements are intuitive and easy to grasp. Then, in Chapter 2, you will see the SQL statements that create the tables that are discussed in this chapter.

The first part of this chapter introduces the concept of an RDBMS, and the rationale for using an RDBMS. In particular, you will see an example of an RDBMS with a single table, two tables, and four tables (and much larger RDBMSs exist). This section also introduces the notion of database normalization, and how it assists you in maintaining data integrity (“single source of truth”) in an RDBMS.

The second part of this chapter describes the structure of the tables in a four-table database that keeps track of customer purchases of home improvement tools that consumers can purchase through the associated Web page. You will also see the different relationships among pairs of tables, and how a one-to-many relationship enables you to find all the line items that belong to a given purchase order.

The third portion of this chapter contains a brief introduction to SQL and some basic examples of SQL queries (more details are in Chapter 2). You will also learn about the terminology for various types of SQL statements that can be classified as DCL (Data Control Language, DDL (Data Definition Language), DQL (Data Query Language), or DML (Data Manipulation Language).

The fourth portion of this chapter discusses SQL data types, and the fifth portion discusses database operations, such as creating, dropping, and renaming a database in MySQL. The final portion discusses two useful built-in tables that enable you to find the columns of a given table and the status of SQL statements.

There are several points to keep in mind before reading this chapter. First, the style for this chapter (and also the next chapter) is a “top-down” approach whereby high-level details are described and then hands-on coding details are discussed. However, you are free to reverse the order in which you read the first two chapters, if you prefer a “bottom-up” approach whereby you first learn more details regarding SQL statements and then learn about a use case in this chapter.

Second, there is an important detail that is mentioned in the preface that is worth repeating here: this book is primarily for data scientists who want to increase their knowledge of SQL to manage data in a database. Although this book can be useful for any motivated beginner, its primary purpose is different from books that prepare readers to become database administrators (DBAs).

Third, there is a section in the middle of this chapter that shows you the SQL statements that create several tables, along with details of purchase orders. This section is a preview of what you will learn in subsequent chapters, and it’s intended primarily for readers who already have a good understanding of SQL statements. However, if you are unfamiliar with the syntax of the SQL statements in that section, there’s no need to worry: you can return to this section after reading subsequent chapters that explain the details of the SQL syntax and functionality.

WHAT IS MYSQL?

MySQL is an open source database that is portable and provides many features that are available in commercial databases. Oracle is the steward of the MySQL database, and you can download MySQL 8.0 from the following site:

https://www.mysql.com/downloads/

MySQL is a highly popular database that is used by many companies, including Amazon, Google, LinkedIn, Netflix, and Twitter. MySQL is written in C++, whereas the user-level interaction is through SQL. Other add-ons for MySQL can be purchased from Oracle, as well as free third-party tools are available for monitoring and managing MySQL databases.

If you prefer, MySQL also provides a GUI interface for performing database-related operations. The code samples in this book have been written for MySQL 8, which provides the following new features beyond earlier versions:

A transactional data dictionary

Improved support for BLOB, TEXT, GEOMETRY, and JSON data types

Support for CTEs (common table expressions)

Support for window functions

As you will see in Chapter 6, MySQL supports pluggable storage engines, such as InnoDB (the most commonly used MySQL storage engine). In addition, Facebook developed an open source storage engine called MyRocks that has better compression and performance, so it might be worth while to explore the advantage of MyRocks over the other storage engines for MySQL.

What about MariaDB?

MySQL began as an open source project, and retained its name after the Oracle acquisition. Shortly thereafter, the MariaDB database was created, which is a “fork” of the MySQL database. Although MariaDB supports all the features of MySQL, there are important differences between MySQL and MariaDB that you can read about online:

https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/

Installing MySQL

Download the MySQL distribution for your machine and perform the installation procedure. After you complete the installation, log into MySQL as root with the following command, which will prompt you for the root password:

$ mysql -u root -p

If you installed MySQL via a DMG file, then the root password is the same as the password for your machine.

Useful Links for MySQL

This section contains various links that may be useful as you read the chapters of this book. Although SQL is not discussed in detail until the next chapter, the SQL links are included here for your convenience.

MySQL won the DBMS of the Year award in 2019:

https://db-engines.com/en/blog_post/83

The following link contains the list of platforms that support MySQL:

https://www.mysql.com/de/support/supportedplatforms/database.html

The following link contains a comparison between SQL and MySQL:

https://www.softwaretestinghelp.com/sql-vs-mysql-vs-sql-server/

A comparison of MySQL, Microsoft SQL Server, and PostgreSQL is available online:

https://db-engines.com/en/system/Microsoft+SQL+Server%3BMySQL%3BPostgreSQL

The latest version of SQL is SQL:2016:

https://en.wikipedia.org/wiki/SQL:2016

The following website contains details regarding MySQL Standards Compliance:

https://dev.mysql.com/doc/refman/8.0/en/compatibility.html

The following website describes MySQL Extensions to Standard SQL:

https://dev.mysql.com/doc/refman/8.0/en/extensions-to-ansi.html

The following website is a FAQ for MySQL 8.0, along with download links for the MySQL manual in multiple formats:

https://dev.mysql.com/doc/refman/8.0/en/faqs.html

WHAT IS AN RDBMS?

RDBMS is an initialism for Relational DataBase Management System. RDBMSs store data in tables that contain labeled attributes (sometimes called columns) that have a specific data type. Examples of an RDBMS include MySQL, ORACLE, and IBM DB2. While an RDBMS is software that manages data, a DBMS is the underlying “store” where the data resides.

Although relational databases often provide a very good solution for managing data, speed and scalability might be an issue in some cases. Chapter 5 discusses NoSQL databases, such as MongoDB, that might be more suitable for speed and scalability.

What Relationships Do Tables Have in an RDBMS?

While an RDBMS can consist of a single table, it often comprises multiple tables that can have various types of associations with each other. For example, when you buy various items at a food store, your receipt consists of one purchase order that contains one or more “line items,” where each line item indicates the details of a particular item that you purchased. This is called a one-to-many relationship between a purchase order (which is stored in a purchase_orders table) and the line items (stored in a line_items table) for each item that you purchased.

Another example involves students and courses: each student is enrolled in one or more courses, which is a one-to-many relationship from students to courses. Moreover, each course contains one or more students, so there is a one-to-many relationship from courses to students. Hence, the students and course tables have a many-to-many relationship.

A third example is an employees table, where each row contains information about one employee. If each row includes the id of the manager of the given employee, then the employees table is a self-referential table because finding the manager of the employee involves searching the employees table with the manager’s id that is stored in the given employee record. However, if the rows in an employees table do not contain information about an employee’s manager, then the table is not self-referential.

In addition to table definitions, a database frequently contains indexes, primary keys, and foreign keys that facilitate searching for data in tables and also connecting a row in a given table with its logically related row (or rows) in another table. For example, if we have the id value for a particular purchase order in the purchase_orders table, we can find all the line items (i.e., the items that were purchased) in the line_items table that contain the same purchase order id.

Features of an RDBMS

An RDBMS provides a convenient way to store data, often associated with some type of application. For example, later you will see the details of a four-table RDBMS that keeps track of tools that are purchased via a Web-based application. From a high-level perspective, an RDBMS provides the following characteristics:

a database contains one or more tables

data is stored in tables

data records have the same structure

well-suited for vertical scaling

support for ACID (explained below)

Another useful concept is a logical schema that consists of the collection of tables and their relationships (along with indexes, views, triggers, and so forth) in an RDBMS. The schema is used for generating a physical schema, which consists of all the SQL statements that are required to create the specified tables and their relationships.

For example, Chapter 6 contains a SQL file mytools.sql that contains the definition of every entity in the mytools database, as well as the directory mytools-sql-files-20211120 that contains a SQL file for every table in the mytools database. Moreover, Chapter 6 describes two techniques for exporting a MySQL database. After the tables have been generated, you can begin inserting data and then managing the consistency of the data.

What is ACID?

ACID is an acronym for Atomicity, Consistency, Isolation, and Durability, which refers to properties of RDBMS transactions, as summarized below.

Atomicity

means that each transaction is all-or-nothing, so if a transaction fails, the system is rolled back to its previous state.

Consistency

means that successful transactions always result in a valid database state.

Isolation

means that executing transactions concurrently or serially will result in the state.

Durability

means that a committed transaction will remain in the same state.

RDBMSs support ACID, whereas NoSQL databases usually do not support ACID.

WHEN DO WE NEED AN RDBMS?

The short answer is that an RDBMS is useful when we need to store records of events that have occurred, which can be involve simple item purchases as well as complex multi-table financial transactions.

An RDBMS allows you to define a collection of tables that contain rows of data, where a row contains one or more attributes (informally called fields). A row of data is a record of an event that occurred at a specific point in time, which can involve more than one table, and can also involve some type of transaction.

Transferring Money Between Bank Accounts

Consider a simple money transfer between two bank accounts in which you want to transfer $100 from a savings account to a checking account. The process involves two steps:

debiting (subtracting) the savings account by $100 and

crediting (adding) the checking account with $100.

However, if a system failure occurs after step 1 and before step 2 can be completed, you have lost $100. Obviously, steps 1 and 2 must be treated as an atomic transaction, which means that the transaction is successful only when both steps have completed successfully. If the transaction is unsuccessful, the transaction is “rolled back” so the system is returned to the state prior to transferring money between the two accounts.

As you learned earlier in this chapter, RDBMSs support ACID, which ensures that the previous transaction (i.e., transferring money between accounts) is treated as an atomic transaction.

Although atomic transactions are indispensable in financial systems, they might not be as critical for other systems. For example, a database that contains a lone events table in which each row contains information about a single event that you created by some process (such as a registration form) whenever a new event occurs in a system. Although this is conceptually simple, notice that the following attributes are relevant for each row in the events table: event_id, event_time, event_title, event_duration, and event_location, and possibly additional attributes.

As another example, displaying a set of pictures might not show the pictures in the correct order (e.g., based on their creation time). However, a failure in the event creation is not as critical as a failure in a financial system, and displaying images in the wrong sequence will probably be rectified when the page is refreshed.

THE IMPORTANCE OF NORMALIZATION

This section contains an introduction to the concept of normalization. As a starting point, consider an RDBMS that stores records for the temperature of a room during a time interval (such as a day, a week, or some other time interval). We just need one device_temperature table where each row contains the temperature of a room at a specific time. In the case of the IoT (Internet of Things), the temperature is recorded during regular time intervals (such as minute-by-minute or hourly).

If you need to track only one room, the device_temperature table is probably sufficient. However, if you need to track multiple devices in a room, then it’s convenient to create a second table called device_details that contains attributes for each device, such as device_id, device_name, device_year, device_price, and device_warranty.

Whenever we want the details of a temperature-related event, we need information from both tables, which consists of one row in the device_temperature table and its associated row in the device_details table. The way to perform the two-table connection is simple: each row in the device_details table contains a device_id that uniquely identifies the given row. Moreover, the samedevice_id appears in any row of the device_temperature table that refers to the given device.

The preceding two-table structure is a minimalistic example of something called database normalization, whose purpose is to reduce data redundancy in database tables. Normalization can result in a slower performance during the execution of some types of SQL statements (e.g., those that contain a JOIN keyword).

If you are new to the concept of database normalization, you might be thinking that normalization increases complexity and reduces performance without providing tangible benefits. While this is a valid thought, the trade-off is worthwhile because normalization enables you to maintain data consistency.

For example, suppose that every record in the purchase_orders table contains all the details of the customer who made the associated purchase. As a result, we can eliminate the customers table. However, if we ever need to update the address of a particular customer, we need to update all the rows in thepurchase_orderstable that contain that customer. By contrast, if we maintain acustomerstable, then updating a customer’s address involves changing a single row in thecustomerstable.

Normalization enables us to avoid data duplication so that there is a single “source of truth” in the event that information (such as a customer’s address) must be updated. From another perspective, data duplication means that the same data appears in two (or possibly more) locations, and if an update is not applied to all those locations, the database data is in an inconsistent state. Depending on the nature of the application, the consequences of inconsistent data can range from minor to catastrophic.

Always remember the following point: whenever you need to update the same data that resides in two different locations, you increase the risk of a data inconsistency, which can adversely affect the data integrity.

As another example, suppose that a site sells widgets online. At a minimum, the associated database needs the following four tables:

customer_details

purchase_orders

po_line_items

item_desc

The preceding scenario is explored in greater detail in the next section that specifies the attributes of each of the preceding tables.

A FOUR-TABLE RDBMS

Suppose that www.mytools.com sells tools (the details of which are not important). For simplicity, let’s pretend that an actual website is available at the preceding URL and it contains the following sections:

new user register registration

existing user log in

input fields for selecting items for purchase (and the quantities)

For example, the registered user John Smith wants to purchase one hammer, two screwdrivers, and three wrenches. The website needs to provide users with the ability to search for products by their type (e.g., a hammer, a screwdriver, or a wrench) and then display a list of matching products. Each product in that list would also contain an SKU, which is an industry-standard labeling mechanism for products (just like ISBNs for identifying books).

The preceding functionality is necessary in order to develop a website that enables users to purchase products. However, the purpose of this section is to describe a set of tables (and their relationships to each other) in an RDBMS, so we will assume that the necessary Web-based features are available at our URL.

Let’s describe a use case that contains the sequence of steps that are performed on behalf of an existing customer John Smith (whose customer ID is 1000), who wants to purchase 1 hammer, 2 screwdrivers, and 3 wrenches:

Step 1: Customer John Smith (with cust_id 1000) initiates a new purchase.

Step 2: A new purchase order is created with the value 12500 for po_id.

Step 3: John Smith selects 1 hammer, 2 screwdrivers, and 3 wrenches.

Step 4: The associated prices for the items are $20.00, $16.00, and $30.00.

Step 5: The subtotals for the items are $20.00, $16.00, and $30.00.

Step 6: A 10% tax for the items is $2.00, $1.60, and $3.00.

Step 7: The total cost of this purchase order is $72.60.

There are additional steps that you could perform. For example, Step 8 would allow John Smith to remove an item, increase/decrease the quantity for each selected item, delete items, or cancel the purchase order. Step 9 would enable John Smith to make a payment. Once again, for the sake of simplicity, we will assume that Step 8 and Step 9 are available in an enhanced version of this Web application.

Note that Step 8 involves updating several of our tables with the details of the purchase order. Step 9 creates a time stamp for the date when the purchase order was created, as well as the status of the purchase order (“paid” versus “pending”). The status of a purchase order is used to generated reports to display the customers whose payment is overdue (and perhaps also send them friendly reminders). Sometimes companies have a reward-based system whereby customers who have paid on time can collect credits that can be applied to other purchases (which is essentially a discount mechanism).

DETAILED TABLE DESCRIPTIONS

If you visualize the use case described in the previous section, you can probably see that we need

a table for storing customer-specific information

a table to store purchase orders (which is somehow linked to the associated customer)

a table that contains the details of the items and quantity that are purchased (which are commonly called “line items”)

a table that contains information about each tool (which includes the name, the description, and the price of the tool).

Hence, the RDBMS for our website requires the following tables:

customers

purchase_orders

line_items

item_desc

The following subsections describe the contents of the preceding tables, along with the relationships among these tables.

The Customers Table

Although there are different ways to specify the attributes of the customers table, you need enough information to uniquely identify each customer in the table. By analogy, the following information (except for cust_id) is required to send physical mail to a person:

cust_id

first_name

last_name

home_address

city

state

zip_code

We will create the customers table with the attributes in the preceding list. Although we’ll defer the discussion of keys to a later chapter, it’s obvious that we need a mechanism for uniquely identifying every customer. In this table, notice that the cust_id attribute uniquely identifies every customer, and therefore it’s a key for this table. Other examples of keys for database tables include

social security numbers for people

student id numbers for students

course id numbers for classes

drivers’ licenses

Whenever we need to refer to the details of a particular customer, we will use the associated value of cust_id to retrieve those details from the row in the customers table that has the associated cust_id.

The preceding paragraph describes the essence of linking related tables T1 and T2 in an RDBMS: the key in T1 is stored as an attribute value in T2. If we need to access related information in table T3, then we store the key in T2 as an attribute value in T3.

Note that a customers table in a production system would contain additional attributes, such as the following:

title (Mr, Mrs, Ms, and so forth)

shipping_address

cell_phone

For the sake of simplicity, we’ll use the initial set of attributes to define the customers table. Later, you can add the new attributes to the four-table schema to make the system more like a real system.

Suppose that the following information pertains to customer John Smith, who has been assigned a cust_id of 1000:

cust_id: 1000

first_name: John

last_name: Smith

home_address: 1000 Appian Way

city: Sunnyvale

state: California

zip_code:95959

Whenever John Smith makes a new purchase, we will use the cust_id value of 1000 to create a new row for this customer in the purchase_orders table. Then whenever we need to find the purchase orders associated with John Smith, we simply look for the rows in the purchase_orders table whose cust_id value equals 1000.

The purchase_orders Table

When existing customers visit the website, they must log into the system, after which they can initiate a new purchase. After they select one or more items, the system creates a purchase order to insert as a new row in the purchase_orders table, and a new row in the line_items table for each item that was selected. While you might be tempted to place all the customers’ details in the new row, we will identify the customer by the associated cust_id and use this value instead.

However, we must create a new row in the customers table whenever new users register at the website. Repeat customers are identified by an existing cust_id that must be determined by searching the customers table with the information that the customer types into the input fields of the main webpage.

The customers table contains a key attribute; similarly, the purchase_orders table contains an attribute that we call po_id (you are free to use a different string) in order to associate a purchase order for a given customer.

Keep in mind the following detail: a row with a given po_id also contains the cust_id value of the customer (in the customers table) who initiated the current purchase order. Although there are multiple ways to define a set of suitable attributes, let’s use the following set of attributes for the purchase_orders table:

cust_id

po_id

purchase_date

For example, suppose that customer John Smith, whose cust_id is 1000, purchases some tools on December 01, 2021. Although there are dozens of different date formats that are supported in RDBMS, we use the YYYY-MM-DD format (which you can change to suit your particular needs). Then the new row for John Smith in the purchase_orders looks like this, where the po_id value was arbitrarily assigned:

cust_id: 1000

po_id: 12500

purchase_date: 2021-12-01

As mentioned earlier, a purchase order involves one or more items, each of which is stored in the line_items table that is discussed in the next section.

The line_items Table

As a concrete example, suppose that customer John Smith requested 1 hammer, 2 screwdrivers, and 3 wrenches in his most recent purchase order. Each of these purchased items requires a row in the line_items table that

is identified by a

line_id

value

specifies the quantity of each purchased item

contains the value for the associated

po_id

in the

purchase_orders

table

contains the value for the associated

item_id

in the

item_desc

table

For simplicity, let’s assign the values 5001, 5002, and 5003 to the line_id attribute for the three new rows in the line_items table that represent the hammer, screwdriver, and wrench items in the current purchase order. A line_item row might look like the following code:

po_id: 12500

line_id: 5001

item_id: 100 <= we'll discuss this soon

item_count: 1

item_price: 20.00

item_tax: 2.00

item_subtotal: 22.00

Notice there is no cust_id in the preceding line_item: that’s because of the top-down approach for retrieving data. Specifically, we start with a particular cust_id that we use to find a list of purchase orders in the purchase_orders table that belong to the given cust_id. For each purchase order in the purchase_orders table, we perform a search for the associated line items in the line_items table. We can repeat the preceding sequence of steps for each customer in a list of cust_id values.

Let us return to the line_item details. We need to reference each purchased item by its associated identifier in the item_desc table. Once again, we arbitrarily assign item_id values of 100, 200, and 300, respectively, for the hammer, screwdriver, and wrench items. The actual values will undoubtedly be different in your application, so there is no special significance to the numbers 100, 200, and 300.

The three rows in the line_items table (that belong to the same purchase order) look like this (we’ll look at the corresponding SQL statements later):

po_id: 12500

line_id: 5001

item_id: 100

item_count: 1

item_price: 20.00

item_tax: 2.00

item_subtotal: 22.00

po_id: 12500

line_id: 5002

item_id: 200

item_count: 2

item_price: 8.00

item_tax: 1.60

item_subtotal: 17.60

po_id: 12500

line_id: 5003

item_id: 300

item_count: 3

item_price: 10.00

item_tax: 3.00

item_subtotal: 33.00

The item_desc Table

Recall that the customers table contains information about each customer, and a new row is created each time that a new customer creates an account for our Web application. In a somewhat analogous fashion, the item_desc table contains information about each item (aka product) that can be purchased from our website. If our website becomes popular, the contents of the item_desc table contents are updated more frequently than the customers table, typically in the following situations:

A new tool (aka product) is available for purchase

An existing tool is no longer available for purchase

Thus, the item_desc table contains all the details for every tool that is available for sale, and it’s the “source of truth” for the tools that customers can purchase from the website. At a minimum, this table contains three fields (we’ll discuss the SQL statement for creating and populating this table later):

SELECT *

FROM item_desc;

+---------+-------------+------------+

| item_id | item_desc   | item_price |

+---------+-------------+------------+

|     100 | hammer      |      20.00 |

|     200 | screwdriver |       8.00 |

|     300 | wrench      |      10.00 |

+---------+-------------+------------+

3 rows in set (0.001 sec)

There is one more important detail to discuss: if an item is no longer for sale, can we simply drop its row from the item_desc