Data Rocks was born out of a desire to help people do and see more with their data. One of the first values I wrote down when defining a brand identity was “embrace complexity”. It sounds counterintuitive: how can I embrace complexity while trying to help others to see and do more with data? Allow me to explain.

Life is complex. We, as humans, are complex creatures. The systems and societies we created are all composed of a delicate balance of wonderful little things. We can’t strip ourselves entirely from all of this. While there sure is an optimal level of complexity, and too much of it can be a bad thing, the opposite is also true: stripping everything we do down to the most elemental factors sounds like an attractive idea, but oversimplification is a dangerous thing.

Ignoring complexity, especially when trying to make sense of intricate processes full of moving parts, can lead to errors, biases, and incorrect assumptions, which misguides us into making decisions relying upon insufficient information. Complexity doesn’t have to be *complicated*. It is just a fact we need to embrace. We can strive to make the objects of our analysis more accessible and less *complicated*. However, we must be mindful of doing it just to the extent where we can still identify those wonderful little things that such object is made of, at the right level of detail required to do so. We have to learn not to fear and embrace the proper levels of complexity to better make sense of our world.

There are many ways we can do it when trying to interpret our world, work and life through the lens of data. We can do it using storytelling, such as when we create a tale explaining how a complex natural phenomenon happens; we can use illustration, such as when we break down a sequence of events in a diagram; but the most common way our modern societies choose to compile, analyse, and derive conclusions from complex events, processes and objects is by trying to condense concepts, assumptions and complexity into neat numbers, through the use of Statistics.

## The dreadful feeling of having to deal with data

One of the most common archetypes I’ve come across during my data journey is the one of the data avoider or data denier. There seems to be a deeply rooted mix of fear and hatred for any numbers, no matter how helpful. They dread having to face a number and make sense of it so deeply that they instantly enter into fight or flight mode whenever a metric, KPI, or simple average is brought up.

This archetype tends to have significant reactions when faced with the dreadful task of decoding a number and making sense of it into reality. “The data is wrong”, “Statistics lie”, “You can’t trust numbers, they’re always deceiving you”, “where did this number come from? Why should I go with it?”, “My experience is enough to make my own decisions; you won’t convince me with fancy numbers”. They’re right. I won’t.

No number is bad or lying on its own. People lie. People lie with tools. __People sometimes lie with charts__, and they sometimes lie with numbers. How can we tell the difference? Well, it takes effort and a leap of faith to accept that, sometimes, things aren’t as straightforward as we’d like them to be - and that’s ok.

The first step to demystifying something that scares us is understanding how it works. But understanding Statistics often comes associated with the (incorrect) stigma of “not being a numbers person”. Let me tell you: I am anything but a numbers person. I’m first a humanist, second a systems thinker, third a designer and only then a data person. Through many mistakes, I learned how to embrace and not fear the complexities of what I do every day, but that did not come naturally. I didn’t have a good start with mathematics when I was a kid. In fact, despite being a stellar student in almost any other subject, Math was the one thing that almost held me back at school. So, how did the change happen?

To me, it came in the shape of a good Statistics teacher. Not one I met in school or University, but a manager I once had. I got this job as a Junior Demand Planning Analyst at a large Fast-Moving Consumer Goods (FMCG) company. I had an affinity with technology and was able to use it to help me *skip* most of the statistical and mathematical skills that could be required to perform my tasks. Hooray, Excel and its magic. But when it came to Planning, new challenges came up: I had to create *regression models.* I had to use Math and Statistics to figure out a way to *forecast* demand and stock projections into the future. Not only that, I had to be able to then show other teams how my team performed with those forecasts: how far off were we? What was the __Weighted Mean Average Percentage Error__ between what actually happened and what we had forecast? How can we improve the models to make them more accurate in the next cycle? Panic mode was on.

It was all a bit over my head, but not being someone to give up that easily, I pushed through anyway. My luck was to work in a team in partnership with a manager who knew this stuff *really well.* Not only that, he was a Mathematician with a master's in Statistics, an absolute Supply Chain nerd and *loved* talking about it. I liked learning, and he enjoyed teaching, so we partnered up. I learned so much about *practical examples* of how I could apply Statistics theory to what I was trying to do. I’ve always dreaded the thought of having to learn another theorem or formula, but after a while, *I was enjoying it.*

So then it dawned on me: my dread didn’t come from the subject itself but from how I was previously taught about it. Having someone who was able to translate all of that theory into practical knowledge that I could actually *use* to achieve things was the missing key. I’ll forever be grateful for this moment in my career - it defined much of what I did next. I knew that for every data question I had, there was probably at least one theory somewhere I could draw from to help me solve it. I learned to *embrace complexity* instead of fearing it.

## Stripping Statistics naked

Not everyone will meet a good manager-turned-friend who happens to be a Stats enthusiast that can help them nerd about it to the point where they stop hating it. For all of you out there, I have just the right book: “__Naked Statistics: stripping the dread from the data__” by __Charles Wheelan__. It is an introductory book on Statistics intended for non-experts. The book covers a wide range of statistical concepts and techniques, using real-world examples and anecdotes to illustrate how statistics can be used in everyday life.

__Wheelan__ includes thought exercises, illustrations, and several examples from sports, medicine, entertainment and everyday life, making sense of them through Statistics. It is not a textbook, and it doesn’t feel like one: it is a fluid read for curious people to learn more about how they can make sense of all the numbers they see every day. What was the best baseball player of all time? How does Netflix know what movies I like? Should I change the door in the Monty Hall problem? How did the Financial Crisis happen? What do all those Data Analysts mean when they yell “garbage-in, garbage-out”? The author goes through each one of these examples, plus more, distilling them in detail and using them as the backbone for a very engaging read where he introduces the reader to multiple statistical situations and how they can be solved.

The author explains in his introduction that he’s not set to turn the reader into a stats expert, but to help them better understand what all these numbers surrounding and overwhelming us mean and how to make sense of it all through simple concepts that he too did not learn until much later in life. It is a compassionate read for everyone struggling to make sense of all the metrics they see every day in their work, lives, the news or social media.

__The book__ is split into three main sections, in which __Wheelan__ deep dives into fundamental concepts with the help of humour, wit and plenty of real-life applications: In the first section, titled Descriptive Statistics, __Wheelan__ introduces basic concepts such as mean, median, and mode and measures of variability such as standard deviation and variance. He also discusses graphical representations of data, such as histograms and scatter plots, and how they can be used to identify patterns and relationships in data; In the second section, he then talks about Inferential Statistics, covering topics such as probability theory, hypothesis testing, and confidence intervals. He explains how these tools can be used to draw conclusions about a population based on a sample; He dedicates his third section to explaining Regression Analysis, delving into the world of linear regression and how it can be used to model relationships between variables. He also covers more advanced topics, such as multiple regression and logistic regression, and discusses how these techniques can be used to make predictions and forecast future outcomes. It may sound overwhelming, but it isn’t - everything is distilled to help us overcome the dread of seeing such terms.

## Simpsons Paradox and reference group fallacies

Have you ever looked at a dashboard and seen a metric that seemed off, even though the numbers looked correct? This can happen when we make certain assumptions about comparing data without considering the whole picture. Two statistical concepts can help us understand this issue: * Simpson’s Paradox* and the

__reference class problem__, also known as the

*reference group fallacy*.

Let's start with Simpson's Paradox. Imagine you're tracking the percentage of sales leads that your team has converted into actual sales. Each salesperson has a target to convert 60% of their leads into sales. You have two salespeople in your team: Bob's conversion rate is 45%, and Alice's is 75%. It looks like Alice is a better salesperson than Bob. You’re also wondering if Bob alone is responsible for your team not meeting the overall 60% target:

##### Sales Team Table:

Salesperson | Leads | Sales | Conversion |
---|---|---|---|

Alice | 1000 | 750 | 75% |

Bob | 2000 | 900 | 45% |

Total | 3000 | 1650 | 55% |

But then you start doing what every analytics tool enables you to do nowadays: drill down so you can find the root cause. Why is Bob not meeting his targets? You decided to check the figures by Region of sales. You filter West first:

##### West region Table:

Salesperson | Leads | Sales | Conversion |
---|---|---|---|

Alice | 920 | 735 | 80% |

Bob | 450 | 387 | 86% |

West Region Total | 1370 | 1122 | 82% |

Wait. Not only is Bob converting more than Alice in the West, both seem to be doing way above the 60% target. Hum.. weird. Let’s check the other region then. You clear filters and choose North:

##### North Region Table:

Salesperson | Leads | Sales | Conversion |
---|---|---|---|

Alice | 80 | 15 | 19% |

Bob | 1550 | 513 | 33% |

North Region Total | 1630 | 528 | 32% |

This is not looking good. North region has terrible conversion, but amazingly, even if you average the numbers from both regions you won’t get your overall total (hint: don’t average percentages like this, it is incorrect!). And there’s more: *Bob is better than Alice on both regions individually.* How come the total figures point to a different trend?

This is Simpson's Paradox in action. The Region filter in this case is called a __lurking variable.____ __This is a variable that modifies both the numerator and the denominator of the percentage - it changes the reference of the number of leads and the reference of the number of sales at the same time, creating completely independent groups to feed the percentage.

To avoid this issue, it's essential to establish a *common baseline for context when comparing groups with different proportions and sizes.* This can help us establish appropriate targets and capture each individual's performance more accurately. The percentages are all correct, but because Bob and Alice had different universes of leads when you start slicing the data into different groups, the overall percentage was affected by the different proportions and sizes. This is a common pitfall of using percentages and targets: it sounds like you are considering proportionality to assess targets, but you are falling into a mathematical pitfall.

__Simpsons Paradox__ is just one issue that may arise when slicing percentages with filters. It is an effect of something called the__ ____reference class problem____.__ This happens when we filter the data in a way that changes the reference group used to calculate a percentage, similar to above. For example, if you want to calculate the percentage of sales by region, but you filter the data to show only a specific product line, *the reference group* used to calculate the percentage will change. This can lead to confusion or incorrect interpretation of the data, especially when we have dashboards with multiple metrics and countless cross-filters.

Here are a few questions you can ask yourself to help prevent this:

What are the numerator and denominator of my percentage? What is the context of the metric? Are both parts of the calculation changing as I filter the data?

What am I comparing these cut percentages with? Are the groups equivalent?

What is the context of my filter or comparison? How fair is the comparison when the context is applied?

And here’s a neat video about the Simpson’s Paradox and decision-making, giving a different example:

### Should you read it?

Yes! Absolutely! Even if you are well versed in Statistics or Mathematics, this book will give you a good perspective on the communication aspects of it: how to explain these concepts to others? __Wheelan__ is a master in helping his readers embrace complexity instead of avoiding or fearing it. My only criticism is the heavy hand in sports analogies, but I’ll concede that this area has abundant, rich, relatable examples to borrow from. He also adds a good dose of humour to help you along. An easy-read, __Naked Statistics__ will surely be the Stats teacher/friend you wish you always had.

Always check your local library first to see if any of the books I recommend are available. If they’re not, consider donating a copy!

Get a copy at __your local library__ | __ Amazon__

*If you *__subscribe to my ____monthly Newsletter,__* you’ll get a summary of all recommendations, plus more of my data viz musings.*

*You can also *__follow D____ata Rocks on ____LinkedIn____ __

## Comments