29,99 €
As part of the best-selling Pocket Primer series, this book is designed for data scientists and machine learning engineers seeking to deepen their SQL knowledge using MySQL as the primary RDBMS. It features Python-based code samples for accessing data from MySQL tables in Pandas data frames and Java-based samples for data access in MySQL, along with handling XML and JSON documents.
The book also introduces NoSQL, with an overview of MongoDB, and SQLite, an open-source RDBMS for mobile devices. The final section covers diverse topics like normalization, schemas, database optimization, and performance. This comprehensive approach ensures a well-rounded understanding of SQL and related technologies, enhancing data manipulation and database management skills.
Numerous code samples and listings throughout the book support the various topics covered. Companion files with source code and figures are available from the publisher, making this an essential resource for advancing your SQL proficiency and database management expertise.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 410
Veröffentlichungsjahr: 2024
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book and companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book/disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.
Companion files for this title are available by writing to the publisher at [email protected].
Oswald Campesato
MERCURY LEARNING AND INFORMATION
Dulles, Virginia
Boston, Massachusetts
New Delhi
Copyright ©2022 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNING AND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
800-232-0223
O. Campesato. SQL Pocket Primer.
ISBN: 978-1-68392-814-0
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2022930720
222324321 This book is printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents – may this bring joy and happiness into their lives.
Preface
Chapter 1: Introduction to RDBMSs and MySQL
What is MySQL?
What about MariaDB?
Installing MySQL
Useful Links for MySQL
What is an RDBMS?
What Relationships Do Tables Have in an RDBMS?
Features of an RDBMS
What is ACID?
When Do We Need an RDBMS?
Transferring Money Between Bank Accounts
The Importance of Normalization
A Four-Table RDBMS
Detailed Table Descriptions
The Customers Table
The purchase_orders Table
The line_items Table
The item_desc Table
SQL Statements for the Impatient (Optional)
What About an Item Inventory Table?
The Role of SQL
DCL, DDL, DQL, DML, and TCL
SQL Privileges
Properties of SQL Statements
The CREATE Keyword
Data Types in MySQL
The CHAR and VARCHAR Data Types
String-Based Data Types
FLOAT and DOUBLE Data Types
BLOB and TEXT Data Types
MySQL Database Operations
Creating a Database
Display a List of Databases
Display a List of Database Users
Dropping a Database
Exporting a Database
Renaming a Database
Show Database Tables
The INFORMATION_SCHEMA Table
The PROCESSLIST Table
SQL Formatting Tools
Summary
Chapter 2: Working with SQL and MySQL
Drop Database Tables
Create Database Tables
Manually Creating Tables for mytools.com
Creating Tables via a SQL Script for mytools.com
Creating Tables with Japanese Text
Creating Tables from the Command Line
Defining Table Attributes
Working with Aliases in SQL
Alter Database Tables with the ALTER Keyword
Add a Column to a Database Table
Drop a Column from a Database Table
Change the Data Type of a Column
What are Referential Constraints?
Combining Data for a Table Update (Optional)
Merging Data Columns in Multiple CSV Files via Pandas
Concatenating Data from Multiple CSV Files
Appending Table Data from CSV Files via SQL
Inserting Data into Database Tables
Populating Tables from Text Files
Working with Simple SELECT Statements
Duplicate Versus Distinct Rows
Unique Rows Versus Distinct Rows
The EXISTS Keyword
The LIMIT Keyword
DELETE, TRUNCATE, and DROP in SQL
SELECT, DELETE, and LIMIT Combinations
More Options for the DELETE Statement in SQL
Creating Tables from Existing Tables in SQL
Working with Temporary Tables in SQL
Creating Copies of Existing Tables in SQL
What is a SQL Index?
Types of Indexes
Creating an Index
Disabling and Enabling an Index
View and Drop Indexes
Overhead of Indexes
Considerations for Defining Indexes
When to Disable Indexes on a Table
Selecting Columns for an Index
Finding Columns Included in Indexes
Enhancing the mytools Database (Optional)
Entity Relationships
Summary
Chapter 3: Joins, Views, and Subqueries
Query Execution Order in SQL
Joining Tables in SQL
Types of SQL JOIN Statements
Examples of SQL JOIN Statements
An INNER JOIN Statement
A LEFT JOIN Statement
A RIGHT JOIN Statement
A CROSS JOIN Statement
MySQL NATURAL JOIN Statement
An INNER JOIN to Delete Duplicate Attributes
JOIN Statements on Tables with International Text
What is a View?
Creating a View
Dropping a View in SQL
Advantages of Views in SQL Statements
Views Involving a Single Table
Views Involving Multiple Tables
Updatable Views
Keys, Primary Keys, and Foreign Keys
Foreign Keys versus Primary Keys
A MySQL Example of Foreign Keys
Working with Subqueries in SQL
Two Types of Subqueries
A Subquery to Find Customers Without Purchase Orders
Subqueries with IN and NOT IN Clause
Subqueries with SOME, ALL, ANY Clause
Subqueries with the MAX() and AVG() Functions
Find Tallest Students in Each Classroom via a Subquery
SQL and Histograms
What are GROUP BY, ORDER BY, and HAVING Clauses?
Displaying Duplicate Attribute Values
Examples of the SQL GROUP BY and ORDER BY Clause
SQL Histograms on a Table Copy
Combine GROUP BY and ROLLUP Clause
The 2021 Olympics Medals and the ROLLUP Keyword
The 2021 Olympics Medals and the RANK Operator
The PARTITION BY Clause
GROUP BY, HAVING, and ORDER BY Clause
Combined GROUP BY, HAVING, and ORDER BY Clause
Updating the item_desc Table from the new_items Table
A SQL Query Involving a Four-Table Join
Operations with Dates in SQL
Day and Month Components of Dates in SQL
Rounding Dates in SQL
Working with Date Ranges
Tables Containing Modification Times
Arithmetic Operations with Dates
Date Components and Date Formats
Finding the Week in Date Values
Displaying Weekly Revenue
Assorted SQL Operators
Working with Column Aliases
SQL Variables
SQL Summary Reports
Simple SQL Reports
Calculating SubTotals
Calculating “Running” (Cumulative) Totals
Summary
Chapter 4: Assorted SQL Functions
Numeric Functions in SQL
Calculated Columns
The round(), ceil(), and floor() Functions
SQL Queries with the rand() Function
Log, Exponential, and Trig Functions in SQL
Scalar Functions in SQL
Aggregate Functions in SQL
SQL Queries with the max() and min() Functions
Find Maximum Values with SQL Subqueries
Simplify SQL Queries Containing Subqueries
Find Top-Ranked Numeric Values
Find the Second and Third Largest Values in a Column
Find the Top Three Values in a Column
Find Values with the OFFSET Keyword
String Functions in SQL
SQL Queries with the SUBSTRING() Function
The SUBSTRING() Function in SQL
Boolean Operators in SQL
The IN Keyword
Set Operators in SQL
AND, OR, and NOT Operators in SQL
Working with Arithmetic Operators
Arithmetic Aggregate Operators in SQL
Finding Average Values
SELECT Clauses with Multiple Aggregate Functions
The ORDER BY Clause in SQL
ORDER BY with Aggregate Functions
Largest Distinct Values and Frequency of Values
Character Functions and String Operators
SQL Character Functions
String Operators in SQL
The MATCH() Function and Text Search
CTEs and the “with” Keyword in MySQL (Version 8)
The with Keyword and a Recursive SQL Query
CTEs and the Mean, Stddev, and Z-scores
Linear Regression in SQL
Window Functions
Types of Window Functions in SQL
The SQL CASE Clause
Working with NULL Values in SQL
Miscellaneous One-Liners
Working with the CAST() Function in SQL
Summary
Chapter 5: NoSQL, SQLite, and Python
Non-Relational Database Systems
Advantages of Non-Relational Databases
What is NoSQL?
What is NewSQL?
RDBMSs Versus NoSQL: Which One to Use?
Good Data Types for NoSQL
Some Guidelines for Selecting a Database
NoSQL Databases
What is MongoDB?
Features of MongoDB
Installing MongoDB
Launching MongoDB
Useful Mongo APIs
Meta Characters in Mongo Queries
MongoDB Collections and Documents
Document Format in MongoDB
Create a MongoDB Collection
Working with MongoDB Collections
Find all Android Phones
Find All Android Phones in 2018
Insert a New Item (document)
Update an Existing Item (document)
Calculate the Average Price for Each Brand
Calculate the Average Price for Each Brand in 2019
Import Data with mongoimport
What is Fugue?
What is Compass?
What is PyMongo?
MySQL, SQLAlchemy, and Pandas
What is SQLAlchemy?
Read MySQL Data via SQLAlchemy
Export SQL Data from Pandas to Excel
MySQL and Connector/Python
Establishing a Database Connection
Reading Data from a Database Table
Creating a Database Table
What is SQLite?
SQLite Features
SQLite Installation
SQLiteStudio Installation
DB Browser for SQLite Installation
SQLiteDict (Optional)
Summary
Chapter 6: Miscellaneous Topics
Managing Users
Listing Current Users
Creating and Altering MySQL Users
Dropping MySQL Users
What are Roles?
Create Roles and Grant Privileges
Revoke Roles and Drop Roles
What is a User-Defined Function?
What is a Stored Procedure?
IN and OUT Parameters in Stored Procedures
A Simple Stored Procedure
What is a Stored Function?
A Simple Stored Function
What are SQL Triggers?
A Simple MySQL Trigger
MySQL Engines
What is Normalization?
What is Denormalization?
What are Schemas?
MySQL Workbench
Exporting a Schema in Workbench
Creating a Schema in Workbench
ERM and Tools
What is a Transaction?
The COMMIT and ROLLBACK Statements
The SAVEPOINT Statement
Database Optimization and Performance
Performance Tuning Considerations
SQL Query Optimization
Analyzing SQL Queries for Their Performance
Performance Tuning Tools
Cost-Based Optimizers (Optional)
Table Fragmentation
Table Partitioning
What is an EXPLAIN Plan?
Explain Analyze
Scaling an RDBMS
What is SQL Tuning?
What is Sharding?
RDBMS Support for Sharding
What is Federation?
Database Replication
Distributed Databases, Scalability, and the CAP Theorem
Master-Slave Replication
The CAP Theorem
What are Consistency Patterns?
MySQL Command Line Utilities
Database Backups, Restoring Data, and Upgrades
MySQL and JSON Data
Data Cleaning in SQL
Replace NULL with 0
Replace NULL Values with Average Value
Replace Multiple Values with a Single Value
Handle Mismatched Attribute Values
Convert Strings to Date Values
Data Cleaning From the Command Line (Optional)
Working with the sed Utility
Working with the awk Utility
Next Steps
Summary
Appendix: Introduction to Probability and Statistics
Index
This book is primarily for data scientists and machine learning engineers who want to expand their current knowledge of SQL using MySQL as the primary RDBMS. While this book does contain relevant information for novices in other fields, the structure of this book differs from typical database books.
In addition, this book attempts to balance depth and breadth, along with a decent number of SQL statements to illustrate the important features of SQL. Although it’s not possible to describe the exact set of features that constitute basic, intermediate, and advanced SQL queries (i.e., opinions will differ), this book contains SQL examples that belong to each of those three groups.
At the same time, remember that some topics in the final chapter are presented in a cursory manner, which is for two main reasons. First, although you don’t need an in-depth understanding of every facet of SQL and RDBMSs, it’s important that you be aware of these concepts if you plan to become highly proficient in managing database data. In addition, you will be in a better position to plan an itinerary for the set of topics that you will learn at some point in the future.
Second, a full treatment of every topic in this book would significantly increase the page count, and it’s debatable whether all the additional details would be beneficial to you as a machine learning engineer or a data scientist.
As you read in the previous section, this book is meant primarily for machine learning engineers and data scientists who already have a basic understanding of SQL, which means that they have executed some SQL statements in a database such as MySQL. As such, they will learn more details about SQL and MySQL so they can manage data in database tables. Moreover, the knowledge that they gain while working with MySQL can easily transfer to other RDBMSs such as ORACLE.
In addition, this book is intended to reach an international audience of readers, so this book uses standard English rather than colloquial expressions. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration in order to provide a comfortable and meaningful learning experience for the intended readers.
Before delving into the differences, it’s worth noting that this book covers many topics that you will find in database books of comparable length. At a minimum, any RDBMS book needs to include SQL, along with examples of how to select, delete, update, and insert data into a database table. Other mandatory topics include an explanation of views, indexes, joining tables, subqueries, normalization, and database schemas.
However, this book differs from generic database books because there are topics that are relevant to this target audience, which are not necessary for readers of generic database books. Some of those additional topics are discussed in chapter 6 (miscellaneous topics).
Another difference is a portion of Chapter 5, which contains Python-based code samples to access data from a MySQL table in a Pandas data frame. A third difference is the inclusion of the appendix that contains an introduction to probability and statistics, and a discussion of of entropy, cross-entropy, and KL divergence. Thus, it’s the collective set of differences that differentiate this book from generic SQL books.
The first chapter contains a short introduction to RDBMSs and MySQL, along with information about installing MySQL. In addition, you will see SQL statements for creating, dropping, and exporting a database. Although other books sometimes defer these operations to later chapters, they are easy to perform with empty or very small databases that do not contain any critical data. Therefore, you don’t have to worry about making costly mistakes because of a blunder in a SQL query.
The second chapter delves into creating database tables and various ways to populate them with data. This chapter also describes various ways of deleting data from database tables, followed by a discussion of indexes on tables and why they are important.
The third chapter explains the concept of “joining” database tables, followed by a discussion of views: what they are, what advantages they provide, and how to create them over a single table or multiple tables. You will also learn how to work with subqueries in SQL. In addition, this chapter introduces you to the notion of normalization, along with a clear and compelling reason for adopting database normalization.
The fourth chapter is primarily about SQL functions, which involves numeric functions such as ceil(), floor(), and random(). Aggregate functions are also discussed, followed by string-oriented SQL functions such as the substring() function. This chapter contains an assortment of SQL statements, some of which involve various combinations of GROUP BY, HAVING, and ORDER BY.
The fifth chapter introduces NoSQL, followed by an overview of MongoDB, which is a popular NoSQL database. Next you will learn about SQLite, which is an open-source RDBMS that is available on mobile devices.
Chapter six contains a diverse set of miscellaneous topics, such as normalization, schemas, database optimization, and performance. Then you will be introduced to EXPLAIN plans, SQL tuning, managing users, roles, stored procedures, and triggers.
As a convenience, Chapter 6 contains the SQL file mytools.sql that contains all the tables that are defined in this book. Moreover, the SQL file also contains the data for all the database tables. Of course, you can launch the individual SQL files for each of the tables if you prefer to do so the “long way”.
You can import the complete mytools database by starting MySQL and then issuing the following command from the command line in the directory that contains mytools.sql:
mysql -u root -p mytools < mytools.sql
NOTEIf you encounter issues when you launch the preceding command, read the section in Chapter 6 regardingMySQL Workbenchthat enables you to import databases and export databases.
Although this is an introductory book with minimal prerequisites, obviously you will benefit from having existing knowledge of various topics. Specifically, some knowledge of SQL will facilitate learning the SQL-related concepts more quickly. In addition, knowledge of Java is helpful for Appendix A, as well as some familiarity with XML and JSON. Familiarity with normalization will help you understand the relationships among the tables in the fictitious application that is discussed in Chapter 1 and Chapter 2.
If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.
The companion files contains all the code samples to save you time and effort from the error-prone process of manually typing code into a text file. Furthermore, there are situations in which you might not have easy access to the companion files. In addition, the code samples in the book provide explanations that are not available on the companion files.
The primary purpose of the code samples in this book is to provide a variety of SQL statements that enable you to perform common and useful tasks in MySQL. Clarity has higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.
Although the answer to this question is more difficult to quantify, it’s very important to have strong desire to learn about data analytics, along with the motivation and discipline to read and understand the code samples.
If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:
open /Applications/Utilities/Terminal.app
A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (open source https://cywin.com) which simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).
All of the code samples and figures in this book may be obtained by writing to the publisher at [email protected].
The answer to this question varies, mainly because the answer depends heavily on your objectives. If you are interested primarily in working with structured data, then you can look for online resources that delve into more advanced topics.
If you are primarily interested in machine learning, then you have several options: NLP (natural language processing), deep learning, and reinforcement learning (and also deep reinforcement learning).
Fortunately, you can perform an Internet search to find many resources. One other point: the aspects of machine learning for you to learn depend on who you are: the needs of a machine learning engineer, data scientist, manager, student or software developer are all different.
This chapter introduces you to RDBMSs and various SQL concepts, along with a quick introduction to MySQL. MySQL is used in most of this book because it is a robust RDBMS that is available as a free download from an ORACLE website. Current trends suggest that MySQL will continue its dominant role for the foreseeable future. Moreover, virtually everything that you learn about MySQL in this chapter transfers to other RDBMSs, such as PostgreSQL and ORACLE.
This chapter describes a fictitious website that enables users to register themselves for the purpose of purchasing various home improvement tools (hammers, wrenches, and so forth). Instead of SQL statements, you will learn about the tables that are required, their relationships, and the structure of those tables. You will also see some SQL INSERT statements for inserting data into database tables. Although we have yet to create any database tables, these SQL statements are intuitive and easy to grasp. Then, in Chapter 2, you will see the SQL statements that create the tables that are discussed in this chapter.
The first part of this chapter introduces the concept of an RDBMS, and the rationale for using an RDBMS. In particular, you will see an example of an RDBMS with a single table, two tables, and four tables (and much larger RDBMSs exist). This section also introduces the notion of database normalization, and how it assists you in maintaining data integrity (“single source of truth”) in an RDBMS.
The second part of this chapter describes the structure of the tables in a four-table database that keeps track of customer purchases of home improvement tools that consumers can purchase through the associated Web page. You will also see the different relationships among pairs of tables, and how a one-to-many relationship enables you to find all the line items that belong to a given purchase order.
The third portion of this chapter contains a brief introduction to SQL and some basic examples of SQL queries (more details are in Chapter 2). You will also learn about the terminology for various types of SQL statements that can be classified as DCL (Data Control Language, DDL (Data Definition Language), DQL (Data Query Language), or DML (Data Manipulation Language).
The fourth portion of this chapter discusses SQL data types, and the fifth portion discusses database operations, such as creating, dropping, and renaming a database in MySQL. The final portion discusses two useful built-in tables that enable you to find the columns of a given table and the status of SQL statements.
There are several points to keep in mind before reading this chapter. First, the style for this chapter (and also the next chapter) is a “top-down” approach whereby high-level details are described and then hands-on coding details are discussed. However, you are free to reverse the order in which you read the first two chapters, if you prefer a “bottom-up” approach whereby you first learn more details regarding SQL statements and then learn about a use case in this chapter.
Second, there is an important detail that is mentioned in the preface that is worth repeating here: this book is primarily for data scientists who want to increase their knowledge of SQL to manage data in a database. Although this book can be useful for any motivated beginner, its primary purpose is different from books that prepare readers to become database administrators (DBAs).
Third, there is a section in the middle of this chapter that shows you the SQL statements that create several tables, along with details of purchase orders. This section is a preview of what you will learn in subsequent chapters, and it’s intended primarily for readers who already have a good understanding of SQL statements. However, if you are unfamiliar with the syntax of the SQL statements in that section, there’s no need to worry: you can return to this section after reading subsequent chapters that explain the details of the SQL syntax and functionality.
MySQL is an open source database that is portable and provides many features that are available in commercial databases. Oracle is the steward of the MySQL database, and you can download MySQL 8.0 from the following site:
https://www.mysql.com/downloads/
MySQL is a highly popular database that is used by many companies, including Amazon, Google, LinkedIn, Netflix, and Twitter. MySQL is written in C++, whereas the user-level interaction is through SQL. Other add-ons for MySQL can be purchased from Oracle, as well as free third-party tools are available for monitoring and managing MySQL databases.
If you prefer, MySQL also provides a GUI interface for performing database-related operations. The code samples in this book have been written for MySQL 8, which provides the following new features beyond earlier versions:
A transactional data dictionary
Improved support for BLOB, TEXT, GEOMETRY, and JSON data types
Support for CTEs (common table expressions)
Support for window functions
As you will see in Chapter 6, MySQL supports pluggable storage engines, such as InnoDB (the most commonly used MySQL storage engine). In addition, Facebook developed an open source storage engine called MyRocks that has better compression and performance, so it might be worth while to explore the advantage of MyRocks over the other storage engines for MySQL.
MySQL began as an open source project, and retained its name after the Oracle acquisition. Shortly thereafter, the MariaDB database was created, which is a “fork” of the MySQL database. Although MariaDB supports all the features of MySQL, there are important differences between MySQL and MariaDB that you can read about online:
https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
Download the MySQL distribution for your machine and perform the installation procedure. After you complete the installation, log into MySQL as root with the following command, which will prompt you for the root password:
$ mysql -u root -p
If you installed MySQL via a DMG file, then the root password is the same as the password for your machine.
This section contains various links that may be useful as you read the chapters of this book. Although SQL is not discussed in detail until the next chapter, the SQL links are included here for your convenience.
MySQL won the DBMS of the Year award in 2019:
https://db-engines.com/en/blog_post/83
The following link contains the list of platforms that support MySQL:
https://www.mysql.com/de/support/supportedplatforms/database.html
The following link contains a comparison between SQL and MySQL:
https://www.softwaretestinghelp.com/sql-vs-mysql-vs-sql-server/
A comparison of MySQL, Microsoft SQL Server, and PostgreSQL is available online:
https://db-engines.com/en/system/Microsoft+SQL+Server%3BMySQL%3BPostgreSQL
The latest version of SQL is SQL:2016:
https://en.wikipedia.org/wiki/SQL:2016
The following website contains details regarding MySQL Standards Compliance:
https://dev.mysql.com/doc/refman/8.0/en/compatibility.html
The following website describes MySQL Extensions to Standard SQL:
https://dev.mysql.com/doc/refman/8.0/en/extensions-to-ansi.html
The following website is a FAQ for MySQL 8.0, along with download links for the MySQL manual in multiple formats:
https://dev.mysql.com/doc/refman/8.0/en/faqs.html
RDBMS is an initialism for Relational DataBase Management System. RDBMSs store data in tables that contain labeled attributes (sometimes called columns) that have a specific data type. Examples of an RDBMS include MySQL, ORACLE, and IBM DB2. While an RDBMS is software that manages data, a DBMS is the underlying “store” where the data resides.
Although relational databases often provide a very good solution for managing data, speed and scalability might be an issue in some cases. Chapter 5 discusses NoSQL databases, such as MongoDB, that might be more suitable for speed and scalability.
While an RDBMS can consist of a single table, it often comprises multiple tables that can have various types of associations with each other. For example, when you buy various items at a food store, your receipt consists of one purchase order that contains one or more “line items,” where each line item indicates the details of a particular item that you purchased. This is called a one-to-many relationship between a purchase order (which is stored in a purchase_orders table) and the line items (stored in a line_items table) for each item that you purchased.
Another example involves students and courses: each student is enrolled in one or more courses, which is a one-to-many relationship from students to courses. Moreover, each course contains one or more students, so there is a one-to-many relationship from courses to students. Hence, the students and course tables have a many-to-many relationship.
A third example is an employees table, where each row contains information about one employee. If each row includes the id of the manager of the given employee, then the employees table is a self-referential table because finding the manager of the employee involves searching the employees table with the manager’s id that is stored in the given employee record. However, if the rows in an employees table do not contain information about an employee’s manager, then the table is not self-referential.
In addition to table definitions, a database frequently contains indexes, primary keys, and foreign keys that facilitate searching for data in tables and also connecting a row in a given table with its logically related row (or rows) in another table. For example, if we have the id value for a particular purchase order in the purchase_orders table, we can find all the line items (i.e., the items that were purchased) in the line_items table that contain the same purchase order id.
An RDBMS provides a convenient way to store data, often associated with some type of application. For example, later you will see the details of a four-table RDBMS that keeps track of tools that are purchased via a Web-based application. From a high-level perspective, an RDBMS provides the following characteristics:
a database contains one or more tables
data is stored in tables
data records have the same structure
well-suited for vertical scaling
support for ACID (explained below)
Another useful concept is a logical schema that consists of the collection of tables and their relationships (along with indexes, views, triggers, and so forth) in an RDBMS. The schema is used for generating a physical schema, which consists of all the SQL statements that are required to create the specified tables and their relationships.
For example, Chapter 6 contains a SQL file mytools.sql that contains the definition of every entity in the mytools database, as well as the directory mytools-sql-files-20211120 that contains a SQL file for every table in the mytools database. Moreover, Chapter 6 describes two techniques for exporting a MySQL database. After the tables have been generated, you can begin inserting data and then managing the consistency of the data.
ACID is an acronym for Atomicity, Consistency, Isolation, and Durability, which refers to properties of RDBMS transactions, as summarized below.
Atomicity
means that each transaction is all-or-nothing, so if a transaction fails, the system is rolled back to its previous state.
Consistency
means that successful transactions always result in a valid database state.
Isolation
means that executing transactions concurrently or serially will result in the state.
Durability
means that a committed transaction will remain in the same state.
RDBMSs support ACID, whereas NoSQL databases usually do not support ACID.
The short answer is that an RDBMS is useful when we need to store records of events that have occurred, which can be involve simple item purchases as well as complex multi-table financial transactions.
An RDBMS allows you to define a collection of tables that contain rows of data, where a row contains one or more attributes (informally called fields). A row of data is a record of an event that occurred at a specific point in time, which can involve more than one table, and can also involve some type of transaction.
Consider a simple money transfer between two bank accounts in which you want to transfer $100 from a savings account to a checking account. The process involves two steps:
debiting (subtracting) the savings account by $100 and
crediting (adding) the checking account with $100.
However, if a system failure occurs after step 1 and before step 2 can be completed, you have lost $100. Obviously, steps 1 and 2 must be treated as an atomic transaction, which means that the transaction is successful only when both steps have completed successfully. If the transaction is unsuccessful, the transaction is “rolled back” so the system is returned to the state prior to transferring money between the two accounts.
As you learned earlier in this chapter, RDBMSs support ACID, which ensures that the previous transaction (i.e., transferring money between accounts) is treated as an atomic transaction.
Although atomic transactions are indispensable in financial systems, they might not be as critical for other systems. For example, a database that contains a lone events table in which each row contains information about a single event that you created by some process (such as a registration form) whenever a new event occurs in a system. Although this is conceptually simple, notice that the following attributes are relevant for each row in the events table: event_id, event_time, event_title, event_duration, and event_location, and possibly additional attributes.
As another example, displaying a set of pictures might not show the pictures in the correct order (e.g., based on their creation time). However, a failure in the event creation is not as critical as a failure in a financial system, and displaying images in the wrong sequence will probably be rectified when the page is refreshed.
This section contains an introduction to the concept of normalization. As a starting point, consider an RDBMS that stores records for the temperature of a room during a time interval (such as a day, a week, or some other time interval). We just need one device_temperature table where each row contains the temperature of a room at a specific time. In the case of the IoT (Internet of Things), the temperature is recorded during regular time intervals (such as minute-by-minute or hourly).
If you need to track only one room, the device_temperature table is probably sufficient. However, if you need to track multiple devices in a room, then it’s convenient to create a second table called device_details that contains attributes for each device, such as device_id, device_name, device_year, device_price, and device_warranty.
Whenever we want the details of a temperature-related event, we need information from both tables, which consists of one row in the device_temperature table and its associated row in the device_details table. The way to perform the two-table connection is simple: each row in the device_details table contains a device_id that uniquely identifies the given row. Moreover, the samedevice_id appears in any row of the device_temperature table that refers to the given device.
The preceding two-table structure is a minimalistic example of something called database normalization, whose purpose is to reduce data redundancy in database tables. Normalization can result in a slower performance during the execution of some types of SQL statements (e.g., those that contain a JOIN keyword).
If you are new to the concept of database normalization, you might be thinking that normalization increases complexity and reduces performance without providing tangible benefits. While this is a valid thought, the trade-off is worthwhile because normalization enables you to maintain data consistency.
For example, suppose that every record in the purchase_orders table contains all the details of the customer who made the associated purchase. As a result, we can eliminate the customers table. However, if we ever need to update the address of a particular customer, we need to update all the rows in thepurchase_orderstable that contain that customer. By contrast, if we maintain acustomerstable, then updating a customer’s address involves changing a single row in thecustomerstable.
Normalization enables us to avoid data duplication so that there is a single “source of truth” in the event that information (such as a customer’s address) must be updated. From another perspective, data duplication means that the same data appears in two (or possibly more) locations, and if an update is not applied to all those locations, the database data is in an inconsistent state. Depending on the nature of the application, the consequences of inconsistent data can range from minor to catastrophic.
Always remember the following point: whenever you need to update the same data that resides in two different locations, you increase the risk of a data inconsistency, which can adversely affect the data integrity.
As another example, suppose that a site sells widgets online. At a minimum, the associated database needs the following four tables:
customer_details
purchase_orders
po_line_items
item_desc
The preceding scenario is explored in greater detail in the next section that specifies the attributes of each of the preceding tables.
Suppose that www.mytools.com sells tools (the details of which are not important). For simplicity, let’s pretend that an actual website is available at the preceding URL and it contains the following sections:
new user register registration
existing user log in
input fields for selecting items for purchase (and the quantities)
For example, the registered user John Smith wants to purchase one hammer, two screwdrivers, and three wrenches. The website needs to provide users with the ability to search for products by their type (e.g., a hammer, a screwdriver, or a wrench) and then display a list of matching products. Each product in that list would also contain an SKU, which is an industry-standard labeling mechanism for products (just like ISBNs for identifying books).
The preceding functionality is necessary in order to develop a website that enables users to purchase products. However, the purpose of this section is to describe a set of tables (and their relationships to each other) in an RDBMS, so we will assume that the necessary Web-based features are available at our URL.
Let’s describe a use case that contains the sequence of steps that are performed on behalf of an existing customer John Smith (whose customer ID is 1000), who wants to purchase 1 hammer, 2 screwdrivers, and 3 wrenches:
Step 1: Customer John Smith (with cust_id 1000) initiates a new purchase.
Step 2: A new purchase order is created with the value 12500 for po_id.
Step 3: John Smith selects 1 hammer, 2 screwdrivers, and 3 wrenches.
Step 4: The associated prices for the items are $20.00, $16.00, and $30.00.
Step 5: The subtotals for the items are $20.00, $16.00, and $30.00.
Step 6: A 10% tax for the items is $2.00, $1.60, and $3.00.
Step 7: The total cost of this purchase order is $72.60.
There are additional steps that you could perform. For example, Step 8 would allow John Smith to remove an item, increase/decrease the quantity for each selected item, delete items, or cancel the purchase order. Step 9 would enable John Smith to make a payment. Once again, for the sake of simplicity, we will assume that Step 8 and Step 9 are available in an enhanced version of this Web application.
Note that Step 8 involves updating several of our tables with the details of the purchase order. Step 9 creates a time stamp for the date when the purchase order was created, as well as the status of the purchase order (“paid” versus “pending”). The status of a purchase order is used to generated reports to display the customers whose payment is overdue (and perhaps also send them friendly reminders). Sometimes companies have a reward-based system whereby customers who have paid on time can collect credits that can be applied to other purchases (which is essentially a discount mechanism).
If you visualize the use case described in the previous section, you can probably see that we need
a table for storing customer-specific information
a table to store purchase orders (which is somehow linked to the associated customer)
a table that contains the details of the items and quantity that are purchased (which are commonly called “line items”)
a table that contains information about each tool (which includes the name, the description, and the price of the tool).
Hence, the RDBMS for our website requires the following tables:
customers
purchase_orders
line_items
item_desc
The following subsections describe the contents of the preceding tables, along with the relationships among these tables.
Although there are different ways to specify the attributes of the customers table, you need enough information to uniquely identify each customer in the table. By analogy, the following information (except for cust_id) is required to send physical mail to a person:
cust_id
first_name
last_name
home_address
city
state
zip_code
We will create the customers table with the attributes in the preceding list. Although we’ll defer the discussion of keys to a later chapter, it’s obvious that we need a mechanism for uniquely identifying every customer. In this table, notice that the cust_id attribute uniquely identifies every customer, and therefore it’s a key for this table. Other examples of keys for database tables include
social security numbers for people
student id numbers for students
course id numbers for classes
drivers’ licenses
Whenever we need to refer to the details of a particular customer, we will use the associated value of cust_id to retrieve those details from the row in the customers table that has the associated cust_id.
The preceding paragraph describes the essence of linking related tables T1 and T2 in an RDBMS: the key in T1 is stored as an attribute value in T2. If we need to access related information in table T3, then we store the key in T2 as an attribute value in T3.
Note that a customers table in a production system would contain additional attributes, such as the following:
title (Mr, Mrs, Ms, and so forth)
shipping_address
cell_phone
For the sake of simplicity, we’ll use the initial set of attributes to define the customers table. Later, you can add the new attributes to the four-table schema to make the system more like a real system.
Suppose that the following information pertains to customer John Smith, who has been assigned a cust_id of 1000:
cust_id: 1000
first_name: John
last_name: Smith
home_address: 1000 Appian Way
city: Sunnyvale
state: California
zip_code:95959
Whenever John Smith makes a new purchase, we will use the cust_id value of 1000 to create a new row for this customer in the purchase_orders table. Then whenever we need to find the purchase orders associated with John Smith, we simply look for the rows in the purchase_orders table whose cust_id value equals 1000.
When existing customers visit the website, they must log into the system, after which they can initiate a new purchase. After they select one or more items, the system creates a purchase order to insert as a new row in the purchase_orders table, and a new row in the line_items table for each item that was selected. While you might be tempted to place all the customers’ details in the new row, we will identify the customer by the associated cust_id and use this value instead.
However, we must create a new row in the customers table whenever new users register at the website. Repeat customers are identified by an existing cust_id that must be determined by searching the customers table with the information that the customer types into the input fields of the main webpage.
The customers table contains a key attribute; similarly, the purchase_orders table contains an attribute that we call po_id (you are free to use a different string) in order to associate a purchase order for a given customer.
Keep in mind the following detail: a row with a given po_id also contains the cust_id value of the customer (in the customers table) who initiated the current purchase order. Although there are multiple ways to define a set of suitable attributes, let’s use the following set of attributes for the purchase_orders table:
cust_id
po_id
purchase_date
For example, suppose that customer John Smith, whose cust_id is 1000, purchases some tools on December 01, 2021. Although there are dozens of different date formats that are supported in RDBMS, we use the YYYY-MM-DD format (which you can change to suit your particular needs). Then the new row for John Smith in the purchase_orders looks like this, where the po_id value was arbitrarily assigned:
cust_id: 1000
po_id: 12500
purchase_date: 2021-12-01
As mentioned earlier, a purchase order involves one or more items, each of which is stored in the line_items table that is discussed in the next section.
As a concrete example, suppose that customer John Smith requested 1 hammer, 2 screwdrivers, and 3 wrenches in his most recent purchase order. Each of these purchased items requires a row in the line_items table that
is identified by a
line_id
value
specifies the quantity of each purchased item
contains the value for the associated
po_id
in the
purchase_orders
table
contains the value for the associated
item_id
in the
item_desc
table
For simplicity, let’s assign the values 5001, 5002, and 5003 to the line_id attribute for the three new rows in the line_items table that represent the hammer, screwdriver, and wrench items in the current purchase order. A line_item row might look like the following code:
po_id: 12500
line_id: 5001
item_id: 100 <= we'll discuss this soon
item_count: 1
item_price: 20.00
item_tax: 2.00
item_subtotal: 22.00
Notice there is no cust_id in the preceding line_item: that’s because of the top-down approach for retrieving data. Specifically, we start with a particular cust_id that we use to find a list of purchase orders in the purchase_orders table that belong to the given cust_id. For each purchase order in the purchase_orders table, we perform a search for the associated line items in the line_items table. We can repeat the preceding sequence of steps for each customer in a list of cust_id values.
Let us return to the line_item details. We need to reference each purchased item by its associated identifier in the item_desc table. Once again, we arbitrarily assign item_id values of 100, 200, and 300, respectively, for the hammer, screwdriver, and wrench items. The actual values will undoubtedly be different in your application, so there is no special significance to the numbers 100, 200, and 300.
The three rows in the line_items table (that belong to the same purchase order) look like this (we’ll look at the corresponding SQL statements later):
po_id: 12500
line_id: 5001
item_id: 100
item_count: 1
item_price: 20.00
item_tax: 2.00
item_subtotal: 22.00
po_id: 12500
line_id: 5002
item_id: 200
item_count: 2
item_price: 8.00
item_tax: 1.60
item_subtotal: 17.60
po_id: 12500
line_id: 5003
item_id: 300
item_count: 3
item_price: 10.00
item_tax: 3.00
item_subtotal: 33.00
Recall that the customers table contains information about each customer, and a new row is created each time that a new customer creates an account for our Web application. In a somewhat analogous fashion, the item_desc table contains information about each item (aka product) that can be purchased from our website. If our website becomes popular, the contents of the item_desc table contents are updated more frequently than the customers table, typically in the following situations:
A new tool (aka product) is available for purchase
An existing tool is no longer available for purchase
Thus, the item_desc table contains all the details for every tool that is available for sale, and it’s the “source of truth” for the tools that customers can purchase from the website. At a minimum, this table contains three fields (we’ll discuss the SQL statement for creating and populating this table later):
SELECT *
FROM item_desc;
+---------+-------------+------------+
| item_id | item_desc | item_price |
+---------+-------------+------------+
| 100 | hammer | 20.00 |
| 200 | screwdriver | 8.00 |
| 300 | wrench | 10.00 |
+---------+-------------+------------+
3 rows in set (0.001 sec)
There is one more important detail to discuss: if an item is no longer for sale, can we simply drop its row from the item_desc