29,99 €
Data analytics is becoming increasingly important in our daily lives. This book offers a comprehensive view of data analytics skills, starting with a primer on statistics and progressing to the application of these methods. The text includes various formulas and algorithms used in data analytics, which can be applied in any software to achieve desired results. Through numerous demonstrations, it provides clear instruction on how to incorporate data analytics into critical thinking.
The book covers a range of methods and techniques, supplemented with case studies specific to project managers, systems engineers, and cybersecurity professionals. Each profession can practice data analytics relevant to their fields. The main objective is to refresh statistical knowledge necessary for building data analytics models and to foster analytical thinking essential across these professions.
From introducing statistics and data to reviewing central tendency measures and probability, the book moves to more complex topics like effect size, analysis methods, and data presentation. By the end of the course, readers will be well-versed in data analytics, ready to apply these skills effectively in their respective fields, enhancing decision-making and analytical thinking.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 219
Veröffentlichungsjahr: 2024
DATA ANALYTICS
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information, files, or products contained in it. This license does not permit uploading of theWork onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state and might not apply to the purchaser of this product.
Copyright ©2021 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display, or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNING AND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
(800) 232-0223
C. Greco. Data Analytics: Systems Engineering • Cybersecurity • Project Management
ISBN: 978-1-6839264-81
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
The use of US government Websites and data are permitted under the rules governed at this site: https://www.usa.gov/government-works. Using the information on these sites does not mean this book is endorsed in any way by any agency of the US Federal Government. The opinions in this text are only those of the author.
Library of Congress Control Number: 2020952869
212223 321 This book is printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc.
For additional information, please contact the Customer Service Dept. at (800) 232-0223(toll free). Digital versions of our titles are available at: www.academiccourseware.com and other electronic vendors.
The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book and/or disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
CONTENTS
Preface
Acknowledgments
1 INTRODUCTION TO STATISTICS FOR DATA ANALYSTS
1.1Objectives
1.2The Three Professions
2 WHAT IS DATA?
2.1Data Types
2.1.1Quantitative values
2.1.2Qualitative values
2.1.3Application of Each Type of Data
3 STATISTICS REVIEW – MEASURES OF THE CENTRAL TENDENCY
3.1Mean
3.1.0Averaging with the PERT Method
3.1.1Geometric Mean
3.2Median
3.3Mode
3.4Data Skew
3.4.1Kurtosis
3.5Measures of Variation
3.5.1Variance
3.5.2Standard Deviation
3.5.2.1Real-World Use of the Standard Deviation
3.6Standard Normal Curve vs. Normal Curve
3.7Other Measures of Variation
3.7.1Mean Absolute Deviation
3.7.2Median Absolute Deviation
3.7.3Still More Tests for Variation
3.7.3.1Range
3.7.3.2Inter-Quartile Range (IQR)
3.7.3.3Percentile
3.7.4Five Number Summary
4 PROBABILITY PRIMER
4.1Addition Method in Probability
4.2Multiplication Property of Probability
4.3Bayesian Probability
5 OCCAM’S RAZOR AND DATA ANALYTICS
5.1Data Origination
6 DATA ANALYSIS TOOLS
6.1Microsoft Excel
6.2R Stats
6.3Open Office
6.4Minitab
6.5Tableau, SPSS, QLIK, and others
6.6Geospatial Statistical Systems
6.6.1ARCGIS
6.6.2QGIS
7 EFFECT SIZE
7.1Correlation
7.1.1Correlation does not mean causation, but…
8 ANALYSIS PROCESS METHODS
8.1CRISP-DM Method
8.1.1Understand the Organization
8.1.2Understanding the Data
8.1.3Preparing the Data
8.1.4Analyze and Interpret the Data
8.1.5Evaluate the Analysis
8.1.6Communicate and Deploy the Results
8.2Alternative Method
8.2.1Framing the Question
8.2.2Understanding Data
8.2.3Choose a Method
8.2.4Calculate the Statistics
8.2.5Interpret the Statistics
8.2.6Test the Significance of the Statistics
8.2.7Question the Results
9 DATA ANALYTICS THINKING
9.1Elements of Data Analytic Thinking
9.1.1Data Structure
9.1.2Analysis Elements Inside Data
9.1.3Analysis Elements Outside Data
9.2There is a “Why” in Analysis
9.2.1The “V’s” in Data
9.2.1.1Data Velocity
9.2.1.2Data Variety
9.2.1.3Data Volume
9.2.1.4Data Vulnerability
9.3Risk
9.3.1Probability of Risk
9.3.2Risk Impact
9.3.3The Risk Chart
10 WHERE’S THE DATA?
10.1 Data Locations
10.2 How Much Data?
10.3 Sampling
10.3.1Random Sampling
10.3.2Systematic Sampling
10.3.3Sampling Bias
10.3.3.1Mitigating Data Bias
10.3.4Determinism
10.3.4.1Lift
10.3.4.2Leverage
10.3.4.3Support
10.3.4.4Strength
11 DATA PRESENTATION
11.1The Good, the Bad, and the OMG
11.2Real-World Example from a Project Management Perspective
12 GEOSPATIAL DATA ANALYTICS
12.1Geospatial Mean Center
12.1.1Real-World Application of Geospatial Mean
12.2Standard Distance
12.3Standard Deviational Ellipse
12.4Geary’s C
13 ADDITIONAL DATA ANALYTICS METHODS
13.1Entropy
13.2Effect Size, Part 2
13.3Modeling and Simulation
13.3.1Model Type
13.3.2Simulation
14 SUMMARY
15 CASE STUDIES
15.1Case Study Scenario
15.2Case Study: Description of Data
15.3Case Study: Normal Curve
15.4Case Study: Variation Measures
15.5Case Study: Probability
15.6Case Study: Occam’s Razor
APPENDIX
RECOMMENDED SOLUTIONS FOR CASE STUDIES
AIntroduction
A.1Recommended Approach for Case Study 15.2
A.2Recommended Approach for Case Study 15.3
A.3Recommended Approach for Case Study 15.4
A.4Recommended Approach for Case Study 15.5
A.5Recommended Approach for Case Study 15.6
REFERENCES
INDEX
PREFACE
The idea of data analytics has been applied to data since the beginning of human existence. People may say that we are just maturing in this endeavor, but time has shown that humans are more than ready to accept data and analyze that data to achieve greatness. When Wilbur and Orville Wright flew in 1903, they did not do so on a whim, but gathered data to see how and when their airplane would fly. There was a book written several decades ago about the owner of a TV dinner brand that made a million dollars by understanding the demographics of the time. The author saw the data on homemakers who were tired of making dinner every night. The TV dinner was a way to provide a solution to this issue. As I recall, the work of selling the product was hard, but the outcome was successful. Another book that showed data is the focus of human existence is the book Outliers: The Story of Success by Malcolm Gladwell [Gladwell-2008]. Gladwell highlights the idea that creativity, linked with data analysis, could lead the user and analyst to new perspectives, pointing out the successes of this activity.
If you have not heard Gladwell’s talk on Pepsi, watch it online at (https://www.youtube.com/watch?v=VkhFh5Ms1vc). You will forever be changed about the use of data in any form.
Tool-centric learning has to be mitigated since “buttonology” is not the way to really teach analytics (or any other subject for that matter – other than those classes that teach the tool details). Therefore, this book is helpful to an instructor of the craft. Please buy this book and use it as a classroom study guide. Please enjoy and share the content in this book with your students and other teachers, and feel free to provide feedback. I would love to hear from you about the book and how to make it better. You can contact me through my website, www.grectech.com,and I would be happy to answer your questions. Thank you in advance for your help and support.
One last item, there are screenshots of US government websites that are permitted under the rules governed at this site: https://www.usa.gov/government-works. Using these sites and the information on them does not mean this book is endorsed by any agency of the US Federal Government. The opinions in this text are only those of the author.
ACKNOWLEDGMENTS
In this, and every other endeavor, there are people who are the unsung heroes who help and provide support in the background, but in the end, are never listed on the cover of the book. Of course, without my family supporting me, this would have been a failure. My wife, daughters, son-in-law, and grandson are all part of the family unit that helped me get through this. It is not a sprint but a marathon, and thanks to you all, I ended up finishing the race. My siblings are part of this support unit and I thank them for the consistent positive support they give me. In addition, I want to thank my friends, like Greg, Barbara, Rich, and others, whose consistent kindness and generosity make me want to share my work. My company motto is “Learn, Offer, Value, and Educate.” You likely figured out that the first letters of each word spell “LOVE,” and that is what makes us all a success. I would also like to thank Jennifer Blaney and the copy editors for their diligence and work on this product.
CHAPTER 1
INTRODUCTION TO STATISTICS FOR DATA ANALYSTS
Data analytics is creeping into the lexicon of our daily language. If the remarks heard throughout the day include “big data” or “AI” or “algorithm,” then data analytics is probably involved. This book gives the reader a perspective as to the overall data analytics skill set, starting with a primer on statistics, and works toward the application of those methods. However, and most importantly, this book includes the use of data that is available to anyone to analyze, and it can be used as a reference for more complex data analytics tasks. As the reader will experience in this book, there are a variety of formulas and algorithms used in the data analytics process. These formulas are there so that no matter what technology the reader uses, they will be able to plug the formula into the software application to get the answer they need. There are several demonstrations of this process in the book, so please do not worry about knowing the formulas in detail as much as being able to recognize and use them.
Every statistical formula contains some symbols that may not be recognized by the reader, and this is OK. An explanation of those symbols is included in the formula description so the reader can understand what the formula is trying to calculate. Think of the formulas as a universal key to a problem. Once the formula is translated into “buttonology” or the procedure for the software, the reader will be able to achieve the result for that specific analysis. Although data science tools are not covered in detail in this book, some are introduced. Many who are reading this text have been inundated with suggestions for using one or two common statistical tools, but this book is more “tool agnostic” than “tool endorsing.” By using formulas, the tool chosen makes no difference. The main point is that no matter what tool the reader uses, the formula can be used with that tool.
1.1 Objectives
This section describes the objectives of this book, starting with the most fundamental of definitions. There are analysts who do not understand what data entails, and thereby jump to conclusions without really understanding what the data implies or contains. In law enforcement, there is a modus operandi (or method of operation, otherwise known as the “MO”) that criminals use in order to steal, scam, or otherwise fool the law-abiding citizen. It is the same way with data, since data has trending or anomalous events that reveal what the analyst is trying to answer.
After understanding the definition of data definition, we review statistical fundamentals. This is a perfunctory approach to the fundamentals, but introduces certain concepts that were probably not raised in your “Introduction to Statistics” courses.
The “Occam’s Razor” approach to data analysis is a very quick overview of keeping it simple. Named after William of Occam, this adage is the quintessential reason to follow to make data analytics simple but effective. There are colleagues who have addressed the yearning of students to go straight to regression analysis instead of trying a simpler approach to data analytics. A story making the rounds through the statistics realm is one where a junior analyst comes to the senior analyst and expresses excitement over using regression analysis to find an answer. The student is met with the following from the senior analyst:
“Why would you use such a complex method to find the answer? There has got to be a simpler way to figure this out!”
The student was hurt.
“That is just mean, sir.”
The senior analyst replied,
“You are correct! The mean or average is a much better indication in this case!”
Simple should be the watchword of the analyst. This section will get into a little more detail on this concept.
Data origination and data validity raise the specter of what is called data dilution. In the old days, there was a drink made with a packet of artificial color, sugar, and water. The more water added, the more the drink tasted more like red (or green or orange) water and not the flavor intended. That is the same with data dilution. If the data is transformed three or four times, it becomes a shadow of what it once was, especially if the last analyst did not perform a data dictionary and left it up to the next analyst to figure out what the data meant in the first place!
1.2 The Three Professions
This book presents a variety of methods and techniques, as well as case studies, to enrich the knowledge of data analytics for project managers, systems engineers, and cybersecurity professionals. There are many times when a project manager needs to analyze data obtained from a product owner or show a stakeholder the consequences of actions implemented during the project. The development of any product process can be richly accompanied by data that the systems engineers need to analyze. The cybersecurity professional is inundated with data regarding the confidentiality, integrity, and availability of the software or hardware application. This can range from the port use to the access frequency. In all these cases, data analytics skills can, and should, be applied to each of the professions. This text separates the case studies so that each profession can practice some straightforward data analytics. The main purpose of this text is not bore the professional with elementary statistics but refresh the knowledge necessary to build models for data analytics. Along with that, this book encompasses the analytics thinking that is essential to all the professions. Without this basic critical thinking skill, the project manager, systems engineer, or cybersecurity professional will spin their wheels in useless analytics that waste time and money. The one section that covers Occam’s Razor helps the analyst go straight to the simple rather than the complex. Not many people go to the Grand Canyon to see the composition of the rock formations. They go for the view. Think of the stakeholder like that tourist. Give them an overview of the data. If they want more, then massage the requirements to do so.
CHAPTER 2
WHAT IS DATA?
The beginning of a data analytics book should begin with exactly what “data” refers to and what to do with it. The working definition for this book is that data refers to information gathered for analysis and subsequent discussion. The reader is not reading this book to determine the factual verification of the data, but to use the data with conventional techniques or methods. Performing the process of the verification and validation of datasets is prudent and diligent, but the preliminary stages of analysis take place after the verification and validation process. For instance, if the reader obtains data from a federal website (like www.cdc.gov for the Centers for Disease Control), then the validation and verification do not need to be completed, since that data is already prepared and distributed via the Cross Industry Standard Process for Data Mining (CRISP-DM) method. Current data analytics training for government analysts includes the CRISP-DM methods.
2.1 Data Types
There are two types of data: quantitative and qualitative. These two types are broken down further into discrete or continuous (for quantitative) and categorical or nominal (for qualitative). Categorical data may refer to a range of numbers that could be considered quantitative, but for the most part, categorical data is qualitative in nature for the purposes of this book (and from experience). For instance, in a dataset that includes all tornados that occurred in 1950 (from the website www.ncdc.noaa.gov), the damage in dollars was categorized with K or M to denote thousands or millions, respectively. By assigning that designator, the data originators changed a quantitative measure to a qualitative measure, even though the field focuses on dollars. In this instance, by changing the type of data, the originators limited that data description because the mean and standard deviation cannot be calculated on that field without changing it to a quantitative measure. That is why it is important to consider the analysis value at the outset of a data analysis task.
2.1.1 Quantitative values
The quantitative values in our lives present us with meaning. Getting an 80 on a test may not mean anything, but 80 out of 100 gives us a better understanding. In most situations, more information may lead to a better understanding. Understanding also applies to perspective. If the analyst can understand the perspective of the data, then that analyst can find meaning as the data applies to other parts of the narrative. For example, during the COVID-19 pandemic, the number of cases (or deaths) was often discussed, but this number was not related to entire populations. At one point, there were over 100,000 cases of the COVID-19 reported in the US. Taken by itself, this number is frightening, but taken as a percentage of the total US population, which is 320,000,000, the number is put into context. (That percentage, by the way, is 100000/320000000 or .0003125 or .031%, which is a very small percentage when we take into consideration the entire population of the US.) If that is the situation, then that would mean quantitative measures are better than qualitative values. Of course, there is a way to transform qualitative to quantitative, which is discussed later in this book.
There are times when quantitative refers to different types of numbers. For instance, discrete quantitative measures denote those numbers that are integers. If the data describes how many TVs there are in a home, then that would be a discrete measure. There is no way to have 1.5 TVs in a home. By having discrete measures, there are certain statistical methods that may apply while others do not apply. Continuous quantitative data takes into consideration numbers that apply to other types of quantities. For instance, it is essential that the analyst determine the type of quantitative measure prior to applying certain methods to this data. For example, in the CRISP-DM process, one of the first steps is to understand the data. As obvious as this sounds, some analysts do not understand the data that they are analyzing, but conduct tests and build models on the data regardless of the analytical requirements. Although this is relatively common, the results from such models and tests may not be the results that align with verified and valid requirements.
Please remember that the quantitative data is important to determine measurement. For instance, during the COVID-19 outbreak, it was common to report the numbers of individuals that tested positive for the virus. This was an important measurement, but alone it did not give the relative impact on either the United States or the world. Another example of the quantitative nature of the statistics related to the virus is that, at one point, there were 3,000 individuals who tested positive for the virus in the US. This number means 3,000 people out of the entire US population of 320,000,000 tested positive for COVID-19. From a percentage viewpoint, the amount of people testing positive is less than .0009%, which is .000009! This number is so small that it would be considered miniscule in other situations. At that point in time, less than .0009% of the US population tested positive for this virus. Notice the phrase “tested positive” rather than “had the virus.” Analysts need to be clear about this phrasing, since there have been indications that there are false negatives in the virus testing, as well as false positives1. What a false negative means is that the test may not be able to detect the existence of COVID-19 in the individuals who have it. Understanding the data and determining the type of quantitative measure used is the analyst’s responsibility. However, this decision may have implications far beyond the narrow application of methods on a single source of data.
2.1.2 Qualitative values
Many readers have been asked to fill in surveys on services received from restaurants, hotels, and other businesses. These surveys usually have a range of numbers (i.e., from 1-10) that associate numbers to feelings. These measures are qualitative values transformed into quantitative values to measure customer service. After all, the ability to quantify “like” or “dislike” is very challenging. The first part of the challenge is that one person’s “like” might be another person’s “dislike.” Unfortunately, because we are accustomed to these types of somewhat arbitrary definitions of rating personal preferences, people accept the range of numbers as something that can be transformed from feelings. The types of qualitative data can be placed into two different types – nominal and ordinal. For instance, nominal values can be “male” or “female” or “yes” and “no.” Ordinal values describe a sequence, such as “no high school” to “graduate school.” In these instances, as in all instances of qualitative measures, these measures can be changed into quantitative measures. For example, substituting a “0” for male and “1” for female is acceptable as long as the methods used to analyze do not include descriptive statistics (like mean and median) because it would look like females have more value than males (because they are 1 more than males on the numeric scale).
When two choices are used for qualitative data, it is considered a “binary” choice. If the choices include discrete numbers, then we use the term binomial. This type of data considers a problem like a multiple-choice test that has a right or wrong answer. This book does not discuss these types of data analytics.
Qualitative data includes the binary data mentioned above as well as categorical data. Categories are data that usually have a range of values and are used to denote certain types of behavior or characteristics. The two types of categorical data are nominal and ordinal. Nominal data (nominal refers to the “name”) is a label of data that does not consider the ordering of the data, such as labeling male and female characteristics2. In this way, labeling can be accomplished without any worry about which category goes first. Ordinal data is not just labeled, but placed in a sequence, such as “high school,” “some college,” or “college graduate.” In this case, there is a sequence and a range for the data. The real issue here is arbitrarily trying to make nominal data numerical data because that could lead to certain types of conclusions. For instance, say that an analyst assigns “1” to male and “0” to female. A less experienced (or inexperienced) analyst tries to do a descriptive analysis with this data and finds (surprise!) that the males have a larger mean than the females. Well, that was a mistake waiting to happen. If anything, the proportion would be a much better way of analyzing this data, in the same way that having nominal data for Democrats and Republicans would be something of an issue in this instance. What if the analyst assigns a “1” to a high school graduate, and a “5” to a post-grad student? This could lead to a very skewed analysis and show that high school graduates are not as important as post-grad students.
2.1.3 Application of Each Type of Data
Now that each type of data has been reviewed, when would an analyst use them and for what purpose? Quantitative data is used to determine dataset configurations (such as skew or kurtosis) as well as trends and other models, while qualitative is used to satisfy a specific need, such as a survey or other requirement. Quantitative data also satisfies a specific need, but