Erhalten Sie Zugang zu diesem und mehr als 300000 Büchern ab EUR 5,99 monatlich.
Is the Brexit vote successful big data politics or the end of democracy? Why do airlines overbook, and why do banks get it wrong so often? How does big data enable Netflix to forecast a hit, CERN to find the Higgs boson and medics to discover if red wine really is good for you? And how are companies using big data to benefit from smart meters, use advertising that spies on you and develop the gig economy, where workers are managed by the whim of an algorithm? The volumes of data we now access can give unparalleled abilities to make predictions, respond to customer demand and solve problems. But Big Brother's shadow hovers over it. Though big data can set us free and enhance our lives, it has the potential to create an underclass and a totalitarian state. With big data ever-present, you can't afford to ignore it. Acclaimed science writer Brian Clegg - a habitual early adopter of new technology (and the owner of the second-ever copy of Windows in the UK) - brings big data to life.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 214
Veröffentlichungsjahr: 2017
Das E-Book (TTS) können Sie hören im Abo „Legimi Premium” in Legimi-Apps auf:
How the Information Revolution is Transforming Our Lives
BRIAN CLEGG
For Gillian, Chelsea and Rebecca
I’ve had a long relationship with data and information. When I was at school we didn’t have any computers, but patient teachers helped us to punch cards by hand, which were sent off by post to London and we’d get a print-out about a week later. This taught me the importance of accuracy in coding – so thanks to Oliver Ridge, Neil Sheldon and the Manchester Grammar School. I also owe a lot to my colleagues at British Airways, who took some nascent skills and turned me into a data professional; particular mention is needed for Sue Aggleton, John Carney and Keith Rapley. And, as always, thanks to the brilliant team at Icon Books who were involved in producing this series, notably Duncan Heath, Simon Flynn, Robert Sharman and Andrew Furlow.
1
It’s hard to avoid ‘big data’. The words are thrown at us in news reports and from documentaries all the time. But we’ve lived in an information age for decades. What has changed?
Take a look at a success story of the big data age: Netflix. Once a DVD rental service, the company has transformed itself as a result of big data – and the change is far more than simply moving from DVDs to the internet. Providing an on-demand video service inevitably involves handling large amounts of data. But so did renting DVDs. All a DVD does is store gigabytes of data on an optical disc. In either case we’re dealing with data processing on a large scale. But big data means far more than this. It’s about making use of the whole spectrum of data that is available to transform a service or organisation.
Netflix demonstrates how an on-demand video company can put big data at its heart. Services like Netflix involve more two-way communication than a conventional broadcast. The company knows who is watching what, when and where. Its systems can cross-index measures of a viewer’s interests, along with their feedback. We as viewers see the outcome of this analysis in the recommendations Netflix makes, and sometimes they seem odd, because the system is attempting to predict the likes and dislikes of a single individual. But from the Netflix viewpoint, there is a much greater and more effective benefit in matching preferences across large populations: it can transform the process by which new series are commissioned.
Take, for instance, the first Netflix commission to break through as a major series: House of Cards. Had this been a project for a conventional network, the broadcaster would have produced a pilot, tried it out on various audiences, perhaps risked funding a short season (which could be cancelled part way through) and only then committed to the series wholeheartedly. Netflix short-circuited this process thanks to big data.
The producers behind the series, Mordecai Wiczyk and Asif Satchu, had toured the US networks in 2011, trying to get funding to produce a pilot. However, there hadn’t been a successful political drama since The West Wing finished in 2006 and the people controlling the money felt that House of Cards was too high risk. However, Netflix knew from their mass of customer data that they had a large customer base who appreciated the humour and darkness of the original BBC drama the show was based on, which was already in the Netflix library. Equally, Netflix had a lot of customers who liked the work of director David Fincher and actor Kevin Spacey, who became central to the making of the series.
Rather than commission a pilot, with strong evidence that they had a ready audience, Netflix put $100 million up front for the first two series, totalling 26 episodes. This meant that the makers of House of Cards could confidently paint on a much larger canvas and give the series far more depth than it might otherwise have had. And the outcome was a huge success. Not every Netflix drama can be as successful as House of Cards. But many have paid off, and even when the takeup is slower, as with the 2016 Netflix drama The Crown, given a similar high-cost two-season start, shows have far longer to succeed than when conventionally broadcast. The model has already delivered several major triumphs, with decisions driven by big data rather than the gut feel of industry executives, infamous for getting it wrong far more frequently than they get it right.
The ability to understand the potential audience for a new series was not the only way that big data helped make House of Cards a success. Clever use of data meant, for instance, that different trailers for the series could be made available to different segments of the Netflix audience. And crucially, rather than release the series episode by episode, a week at a time as a conventional network would, Netflix made the whole season available at once. With no advertising to require an audience to be spread across time, Netflix could put viewing control in the hands of the audience. This has since become the most common release strategy for streaming series, and it’s a model that is only possible because of the big data approach.
Big data is not all about business, though. Among other things, it has the potential to transform policing by predicting likely crime locations; to animate a still photograph; to provide the first ever vehicle for genuine democracy; to predict the next New York Times bestseller; to give us an understanding of the fundamental structure of nature; and to revolutionise medicine.
Less attractively, it means that corporations and governments have the potential to know far more about you, whether to sell to you or to attempt to control you. Don’t doubt it – big data is here to stay, making it essential to understand both the benefits and the risks.
Just as happened with Netflix’s analysis of the potential House of Cards audience, the power of big data derives from collecting vast quantities of information and analysing it in ways that humans could never achieve without computers in an attempt to perform the apparently impossible.
Data has been with us a long time. We are going to reach back 6,000 years to the beginnings of agricultural societies to see the concept of data being introduced. Over time, through accounting and the written word, data became the backbone of civilisation. We will see how data evolved in the seventeenth and eighteenth centuries to be a tool to attempt to open a window on the future. But the attempt was always restricted by the narrow scope of the data available and by the limitations of our ability to analyse it. Now, for the first time, big data is opening up a new world. Sometimes it’s in a flashy way with computers like Amazon’s Echo that we interact with using only speech. Sometimes it’s under the surface, as happened with supermarket loyalty cards. What’s clear is that the applications of big data are multiplying rapidly and possess huge potential to impact us for better or worse.
How can there be so much latent power in something so basic as data? To answer that we need to get a better feel for what big data really is and how it can be used. Let’s start with that ‘d’ word.
2
According to the dictionary, ‘data’ derives from the plural of the Latin ‘datum’, meaning ‘the thing that’s given’. Most scientists pretend that we speak Latin, and tell us that ‘data’ should be a plural, saying ‘the data are convincing’ rather than ‘the data is convincing.’ However, the usually conservative Oxford English Dictionary admits that using data as a singular mass noun – referring to a collection – is now ‘generally considered standard’. It certainly sounds less stilted, so we will treat data as singular.
‘The thing that’s given’ itself seems rather cryptic. Most commonly it refers to numbers and measurements, though it could be anything that can be recorded and made use of later. The words in this book, for instance, are data.
You can see data as the base of a pyramid of understanding:
From data we construct information. This puts collections of related data together to tell us something meaningful about the world. If the words in this book are data, the way I’ve arranged the words into sentences, paragraphs and chapters makes them information. And from information we construct knowledge. Our knowledge is an interpretation of information to make use of it – by reading the book, and processing the information to shape ideas, opinions and future actions, you develop knowledge.
In another example, data might be a collection of numbers. Organising them into a table showing, say, the quantity of fish in a certain sea area, hour by hour, would give you information. And someone using this information to decide when would be the best time to go fishing would possess knowledge.
Since human civilisation began we have enhanced our technology to handle data and climb this pyramid. This began with clay tablets, used in Mesopotamia at least 4,000 years ago. The tablets allowed data to be practically and useably retained, rather than held in the head or scratched on a cave wall. These were portable data stores. At around the same time, the first data processor was developed in the simple but surprisingly powerful abacus. First using marks or stones in columns, then beads on wires, these devices enabled simple numeric data to be handled. But despite an increasing ability to manipulate data over the centuries, the implications of big data only became apparent at the end the nineteenth century as a result of the problem of keeping up with a census.
In the early days of the US census, the increasing quantity of data being stored and processed looked likely to overwhelm the resources available to deal with it. The whole process seemed doomed. There was a ten-year period between censuses – but as population and complexity of data grew, it took longer and longer to tabulate the census data. Soon, a census would not be completely analysed before the next one came round. This problem was solved by mechanisation. Electro-mechanical devices enabled punched cards, each representing a slice of the data, to be automatically manipulated far faster than any human could achieve.
By the late 1940s, with the advent of electronic computers, the equipment reached the second stage of the pyramid. Data processing gave way to information technology. There had been information storage since the invention of writing. A book is an information store that spans space and time. But the new technology enabled that information to be manipulated as never before. The new non-human computers (the term originally referred to mathematicians undertaking calculations on paper) could not only handle data but could turn it into information.
For a long while it seemed as if the final stage of automating the pyramid – turning information into valuable knowledge – would require ‘knowledge-based systems’. These computer programs attempted to capture the rules humans used to apply knowledge and interpret data. But good knowledge-based systems proved elusive for three reasons. Firstly, human experts were in no hurry to make themselves redundant and were rarely fully cooperative. Secondly, human experts often didn’t know how they converted information into knowledge and couldn’t have expressed the rules for the IT people even had they wanted to. And finally, the aspects of reality being modelled this way proved far too complex to achieve a useful outcome.
The real world is often chaotic in a mathematical sense. This doesn’t mean that what happens is random – quite the opposite. Rather, it means that there are so many interactions between the parts of the world being studied that a very small change in the present situation can make a huge change to a future outcome. Predicting the future to any significant extent becomes effectively impossible.
Now, though, as we undergo another computer revolution through the availability of the internet and mobile computing, big data is providing an alternative, more pragmatic approach to taking on the top level of the data–information–knowledge pyramid. A big data system takes large volumes of data – data that is usually fast flowing and unstructured – and makes use of the latest information technologies to handle and analyse this data in a less rigid, more responsive fashion. Until recently this was impossible. Handling data on this scale wasn’t practical, so those who studied a field would rely on samples.
A familiar use of sampling is in opinion polls, where pollsters try to deduce the attitudes of a population from a small subset. That small group is carefully selected (in a good poll) to be representative of the whole population, but there is always assumption and guesswork involved. As recent elections have shown, polls can never provide more than a good guess of the outcome. The 2010 UK general election? The polls got it wrong. The 2015 UK general election? The polls got it wrong. The 2016 Brexit referendum and US presidential election – you guessed it. We’ll look at why polls seem to be failing so often a little later (see page 23), but big data gets around the polling problem by taking on everyone – and the technology we now have available means that we can access the data continuously, rather than through the clumsy, slow mechanisms of an old-school big data exercise like a census or general election.
For lovers of data, each of past, present and future has a particular nuance. Traditionally, data from the past has been the only certainty. The earliest data seems to have been primarily records of past events to support agriculture and trade. It was the bean counters who first understood the value of data. What they worked with then wasn’t always very approachable, though, because the very concept of number was in a state of flux.
Look back, for instance, to the mighty city state of Uruk, founded around 6,000 years ago in what is now Iraq. The people of Uruk were soon capturing data about their trades, but they hadn’t realised that numbers could be universal. We take this for granted, but it isn’t necessarily obvious. So, if you were an Uruk trader and you wanted to count cheese, fresh fish and grain, you would use a totally different number system to someone counting animals, humans or dried fish. Even so, data comes hand in hand with trade, as it does with the establishment of states. The word ‘statistics’ has the same origin as ‘state’ – originally it was data about a state. Whether data was captured for trade or taxation or provision of amenities, it was important to know about the past.
In a sense, this dependence on past data was not so much a perfect solution as a pragmatic reflection of the possible. The ideal was to also know about the present. But this was only practical for local transactions until the mechanisms for big data became available towards the end of the twentieth century. Even now, many organisations pretend that the present doesn’t exist.
It is interesting to compare the approach of a business driven by big data such as a supermarket with a less data-capable organisation like a book publisher. Someone in the head office of a major supermarket can tell you what is selling across their entire array of shops, minute by minute throughout the day. He or she can instantly communicate demand to suppliers and by the end of the day, the present data is part of the big data source for the next. Publishing (as seen by an author) is very different.
Typically, an author receives a summary of sales for, say, the six months from January to June at the end of September and will be paid for this in October. It’s not that on-the-day sales systems don’t exist, but nothing is integrated. It doesn’t help that publishing operates a data-distorting approach of ‘sale or return’, whereby books are listed as being ‘sold’ when they are shipped to a bookstore, but can then be returned for a refund at any time in the future. This is an excellent demonstration of why we struggle to cope with data from the present – the technology might be there, but commercial agreements are rooted in the past, and changing to a big data approach is a significant challenge. And that’s just advancing from the past to the present – the future is a whole different ball game.
It wasn’t until the seventeenth century that there was a conscious realisation that data collected from the past could have an application to the future. I’m stressing that word ‘conscious’ because it’s something we have always done as humans. We use data from experience to help us prepare for future possibilities. But what was new was to consciously and explicitly use data this way.
It began in seventeenth-century London with a button maker called John Graunt. Out of scientific curiosity, Graunt got his hands on ‘bills of mortality’ – documents summarising the details of deaths in London between 1604 and 1661. Graunt was not just interested in studying these numbers, but combined what he could glean from them with as many other data sources as he could – scrappy details, for instance, of births. As a result, he could make an attempt both to see how the population of London was varying (there was no census data) and to see how different factors might influence life expectancy.
It was this combination of data from the past and speculation about the future that helped a worldwide industry begin in London coffee houses, based on the kind of calculations that Graunt had devised. In a way, it was like the gambling that had taken place for millennia. But the difference was that the data was consciously studied and used to devise plans. This new, informed type of gambling became the insurance business. But this was just the start of our insatiable urge to use data to quantify the future.
There was nothing new, of course, about wanting to foretell what would happen. Who doesn’t want to know what’s in store for them, who will win a war, or which horse will win the 2.30 at Chepstow? Augurs, astrologers and fortune tellers have done steady business for millennia. Traditionally, though, the ability to peer into the future relied on imaginary mystical powers. What Graunt and the other early statisticians did was offer the hope of a scientific view of the future. Data was to form a glowing chain, linking what had been to what was to come.
This was soon taken far beyond the quantification of life expectancies, useful though that might be for the insurance business. The science of forecasting, the prediction of the data of the future, was essential for everything from meteorology to estimating sales volumes. Forecasting literally means to throw or project something ahead. By collecting data from the past, and as much as possible about the present, the idea of the forecast was to ‘throw’ numbers into the future – to push aside the veil of time with the help of data.
The quality of such attempts has always been very variable. Moaning about the accuracy of weather forecasts has been a national hobby in the UK since they started in The Times in the 1860s, though they are now far better than they were 40 years ago, for reasons we will discover in a moment. We find it very difficult to accept how qualitatively different data from the past and data on the future are. After all, they are sets of numbers and calculations. It all seems very scientific. We have a natural tendency to give each equal weighting, sometimes with hilarious consequences.
Take, for example, a business mainstay, the sales forecast. This is a company’s attempt to generate data on future sales based on what has happened before. In every business, on a regular basis, those numbers are inaccurate. And when this happens, companies traditionally hold a post-mortem on ‘what went wrong’ with their business. This post-mortem process blithely ignores the reality that the forecast, almost by definition, was going to be wrong. What happened is that the forecast did not match the sales, but the post-mortem attempts to establish why the sales did not match the forecast. The reason behind this confusion is a common problem whenever we deal with statistics. We are over-dependent on patterns.
Patterns are the principal mechanism used to understand the world. Without making deductions from patterns to identify predators and friends, food or hazards, we wouldn’t last long. If every time a large object with four wheels came hurtling towards us down a road we had to work out if it was a threat, we wouldn’t survive crossing the road. We recognise a car or a lorry, even though we’ve never seen that specific example in that specific shape and colour before. And we act accordingly. For that matter, science is all about using patterns – without patterns we would need a new theory for every atom, every object, every animal, to explain their behaviour. It just wouldn’t work.
This dependence on patterns is fine, but we are so finely tuned to recognise things through pattern that we are constantly being fooled. When the 1976 Viking 1 probe took detailed photographs of the surface of Mars, it sent back an image that our pattern-recognising brains instantly told us was a face, a carving on a vast scale. More recent pictures have shown this was an illusion, caused by shadows when the Sun was at a particular angle. The rocky outcrop bears no resemblance to a face – but it’s almost impossible not to see one in the original image. There’s even a word for seeing an image of something that isn’t there: pareidolia. Similarly the whole business of forecasting is based on patterns – it is both its strength and its ultimate downfall.
The ‘face on Mars’ as photographed in 2001, and inset the 1976 image.
If there are no patterns at all in the historical data we have available, we can’t say anything useful about the future. A good example of data without any patterns – specifically designed to be that way – is the balls drawn in a lottery. Currently, the UK Lotto game features 59 balls. If the mechanism of the draw is undertaken properly, there is no pattern to the way these balls are drawn week on week. This means that it is impossible to forecast what will happen in the next draw. But logic isn’t enough to stop people trying.
Take a look on the lottery’s website and you will find a page giving the statistics on each ball. For example, a table shows how many times each number has been drawn. At the time of writing, the 59-ball draw has been run 116 times. The most frequently drawn balls were 14 (drawn nineteen times) and 41 (drawn seventeen times). Despite there being no connection between them, it’s almost impossible to stop a pattern-seeking brain from thinking ‘Hmm, that’s interesting. Why are the two most frequently drawn numbers reversed versions of each other?’
The least frequent numbers were 6, 48 and 45, each with only five draws each. This is just the nature of randomness. Random things don’t occur evenly, but have clusters and gaps. When this is portrayed in a simple, physical fashion it is obvious. Imagine tipping a can of ball bearings on to the floor. We would be very suspicious if they were all evenly spread out on a grid – we expect clusters and gaps. But move away from such an example and it’s hard not to feel that there must be a cause for such a big gap between nineteen draws of ball 14 and just five draws of ball 6.
Once such pattern sickness has set in, we find it hard to resist its powerful symptoms. The reason the lottery company provides these statistics is that many people believe that a ball that has not being drawn often recently is ‘overdue’. It isn’t. There is no connection to link one draw with another. The lottery does not have a memory. We can’t use the past here to predict the future. But still we attempt to do so. It is almost impossible to avoid the self-deception that patterns force on us.
Other forecasts are less cut and dried than attempting to predict the results of the lottery. In most systems, whether it’s the weather, the behaviour of the stock exchange or sales of wellington boots, the future isn’t entirely detached from the past. Here there is a connection that can be explored. We can, to some degree, use data to make meaningful forecasts. But still we need to be careful to understand the limitations of the forecasting process.
The easiest way to use data to predict the future is to assume things will stay the same as yesterday. This simplest of methods can work surprisingly well, and requires minimal computing power. I can forecast that the Sun will rise tomorrow morning (or, if you’re picky, that the Earth will rotate such that the Sun appears to rise) and the chances are high that I will be right. Eventually my prediction will be wrong, but it is unlikely to be so in the lifetime of anyone reading this book.