Big Data's Big 5 Yes, we've been hit over the head enough times with the phrase "big data" to be aware of its presence, even though we've been up to our armpits in streams of huge unstructured datasets for years.
Those of you who are analysts or data scientists will have already picked up a set of tools that help you find hidden information buried deep in the data. Those tools may be languages (for example R), statistical tests (t-test, Analysis of Variance) and/or data mining techniques (clustering).
More ReadingWho you callin' stoopid? No excuses for biz intelligence's poor statsThank heavens for the silicon chip: A BRIEF history of dataDarwin, Beer and Big Data? Must be a Reg Lecture VideoWhat kind of Big Data is yours? Is it data bauxite, data aluminium ... or data Dreamliner?Finding the formula for the travelling salesman problem
But there's a set of theorems, laws and simulations from the world of mathematics that can help you to solve more problems faster. As an added upside, you can increase your value - not that I am suggesting that a true artist, such as yourself, is concerned with anything as tacky as salary, of course.
The Reg has selected five such examples that we think are the most compelling for our purposes from the field of maths. Over the next few weeks we shall be looking at them from a high level to discover how they can potentially enhance and add value to what you do.
The five we will be looking at are:
- Benford’s Law: Numbers can be distributed in very unintuitive ways. Most fraudsters don’t understand that so their frauds can stick out like a sore thumb – as long as you know about Benford’s work.
- The German Tank Problem (and its solution): This can let you to estimate data that people don’t want you to have.
- Nyquist–Shannon sampling theorem: Now this does sound obscure because it is about the minimum sampling rate of a continuous wave, but in practice it will tell you how frequently you need to collect that big data from sensors like smart meters.
- Simpson’s paradox: If you don’t know about it, one day it will bite you.
- Monte Carlo simulations: One of the best and yet least-used tools in a data scientist tool box. They let you solve problems that probability calculations simply can’t touch.
For each one I’ll first give you a type of problem that can arise and then show you why the theorem helps to solve it. No difficult sums will be harmed in the making of this series.
So there you are, working with sales data and you have been given the job of detecting fraudulent transactions. A huge number of transactions are in the system and you have reason to believe that those originating from a particular country and credited to a particular sales person (J Smith) are fraudulent.
Your colleague: “OK, let’s check the mean and standard deviation of the transactions we suspect against those of the rest. Hmmm. No significant difference. Maybe we were wrong about poor old J Smith. She is kind to cats after all, she has about 12 rescued moggies that she looks after; perhaps we should look elsewhere for the evil perp.”
You: “Fair enough, but let’s do one more check. Take the value of all of the suspect transactions...
... and select just the leading number from each value:
Then, count the number of ones, the number of twos and so on (up to nine) and plot these as a frequency distribution.”
Your colleague: “OK, if it makes you happy, but you owe me a pint if this doesn’t show anything.”
Later that same day.
Your colleague: There is no pattern here, the distribution is essentially flat. So J Smith is off the hook and you owe me a pint.”
You: “Au contraire my fine colleague, we need to find new homes for those felines and you owe me a pint.”
J Smith is about to be banged to rights... because she’d never heard of Benford’s law.
Benford’s Law (AKA First-Digit Law)
Benford comes to us courtesy of GE Research Laboratories physicist Frank Benford in the 1920s, who began looking into digital frequencies when he noticed his logarithm table books were unevenly worn. His law essentially says that the leading digits of numbers collected “from the wild” – real life – are not evenly distributed. Rather, they follow a predictable distribution where there are more ones than two, more two than threes and so on up to nine.
The differences are non-trivial. On average about 30 per cent of the numbers will start with a one, only about eight per cent with a five and a mere 4.6 per cent with a nine.
We would, of course, have to check the distribution of invoice totals from the same country credited to other sales people but I would confidentially expect those to follow a Benford distribution.
So, what is meant by “wild collected” numbers and why do we get such an odd distribution?
Wild collected numbers
If you plot random numbers, they DO come out as a flat distribution. Here I have plotted the leading digit of around 600 random numbers.
Now you might think that numbers collected by actual observation of the real world (like the lengths of rivers, or their areas, or molecular masses of compounds or death rates or the heights of cities above sea level) would show the same distribution of leading integers, but in general they don’t; they show a distribution that approximates to a Benford distribution.
At this point you might be wondering if this is to do with the units in which you choose to measure, but no, this phenomenon is unit-independent. You can plot the leading digit of the height of each city above sea level in inches, feet, metres or cubits; it doesn’t matter, it still comes out as a Benford distribution.
Random numbers aren't natural... and that's important
As another example, if you take a copy of a magazine like Reader’s Digest and read it through, noting down every number that is mentioned in the text, a Benford distribution is highly likely to appear before your eyes. Below is some real data collected by Benford himself in 1938 from a copy of Reader’s Digest.
The bottom line is that random numbers don’t follow a Benford distribution but numbers that originate in the real world do. It isn’t an absolute rule, but it is a very good generalisation.
But why? Why on Earth would numbers be distributed like that?
This can be answered mathematically or by trying to give you an intuitive feel for why this happens. I prefer the latter so think about a river. It starts at a spring and runs to the coast so there is a linear distance between the two points. These linear distances will show a flat distribution of leading integers, so, why do real rivers show a Benford distribution?
The reason is that as the water first makes its way to the sea, it hits real world obstacles. Perhaps a big rock around which it has to flow, then later a plain where it meanders. In other words, that original distance inevitably increases with each obstacle and by a different amount each time. The rock makes a very small difference, the plain a much larger one.
To try to model this we can start with a set of random numbers that represent those initial linear lengths of a set of rivers; the distribution of the leading digits is flat.
Then we extend the river several times, each extension being a random per cent figure of the length of the river in question. (So, if we apply five extensions, the first might increase the length of the river by four per cent, the next by 17 per cent and so on.) The point here is that values in the wild are the outcome of combinations of factors: invoices are frequently made up of multiple items, plants grow to different heights depending on soil, climate, shelter, disease and so on.
The figures below are the results after five extensions and already we can see a Benford’s distribution.
Of course, we are not trying to model how rivers really form; we are illustrating a more fundamental property of numbers in general. If you make a series of percentage increases (or decreases) to a set of numbers, they will approximate to a Benford distribution.
The reason for this property of numbers is that numbers with different leading integers respond differently when changed in size. For example, if you take a number that starts with a one (say, 100) and make a 20 per cent change to it, then it becomes 120, which still begins with a one. But if you take a number beginning with nine (say, 900), then a 20 per cent change makes it 1,080 – which also begins with a one.
To put that another way:
The number one needs to increase by 100 per cent to become a two, the number five needs to increase by only 20 per cent to become a six and nine requires an increase of a measly 11.1 per cent to become a one. So the proportion of numbers starting with one goes up as we make changes, while the number starting with nine decreases and the rest change proportionately.
So, once you can see the pattern, you realise that Benford distribution isn’t an oddity, it is an inevitability. Is there a good, solid, mathematical underpinning to this?
Of course. Good resources for further study can be found by typing “Benford’s Law” into your preferred search engine and scanning the pages presented for mathematical equations.
But you really don’t need to understand the underlying mathematics in order to use and apply Benford’s Law. You just have to be able to see why the distribution is inevitable.
But what use is this knowledge?
Well, it’s good for fraud detection: you can ask the Arizona State Treasurer if you don’t believe me. In an example cited by accountancy journals for years after, a state employee was found guilty of trying to defraud the State of Arizona of around $1.8m in 1993. The staffer reportedly kept most of the fraudulent transactions just below a $100,000 limit, with an unusually large number starting with sevens, eights and nines. This resulted in a very non-Benford distribution and the example is held up as a classic case study in the effectiveness of using Benford’s to detect accounting fraud.
But fraud detection is simply one of many potential uses. Now that you know that wild collected numbers usually show this distribution, you can look for sets of numbers that deviate. For example, a colleague of mine found this leading integer distribution in some direct debit/catalogue payments.
There is no suggestion that this is fraudulent but it did tell him that the matter was worthy of further investigation because “something” is actively responsible for this deviation. And that something might just be the nugget of information that good data scientists are expected to find.
And that is the real take-home message. If you see a Benford distribution, then it really is a case of “move along folks, nothing to see here”. If you see anything else, investigate further. ®