Editor's Note: Matthew Lane is a Ph.D. candidate in mathematics at UCLA and is the founder of Math Goes Pop!, a blog focused on the surprisingly rich intersection between mathematics and popular culture. You can follow him on Twitter at @mmmaaatttttt.
Whether you are trying to make the best decisions for your fantasy baseball league, looking to capitalize on an opportunity in a fluctuating stock market or simply filtering through the results of a Google search, it is hard to deny that we are surrounded by more data now than ever before. As such, the task of organizing and drawing conclusions from data can be a challenge, but thankfully mathematics can, in many cases, rise to the occasion.
The application of mathematics to such a rapidly increasing pool of data, however, is not without controversy. For example, in February The New York Times published an investigation written by Charles Duhigg about the value of consumer data to major corporations, and how those corporations can use your data in an instinctively creepy way.
The company he focuses on is Target and its desire to identify pregnant women based on their shopping habits.
Target may have been too successful in this goal; as Duhigg writes, one teenage customer in Minneapolis was "outed" as being pregnant by a coupon mailer sent to her house. Her father was justifiably upset to see Target offering his underage daughter discounts on diapers and cribs, though he was perhaps more upset when he discovered Target knew more about his daughter's personal life than he did.
How is it possible for Target (or any company, for that matter) to draw such accurate conclusions about its customers from their shopping behavior? While the exact procedure is undoubtedly complex (Duhigg makes mention of Target performing some mathematical wizardry to assign each female customer a "pregnancy score"), the fundamental ideas are worth considering. In essence, what we are looking for is a way to assign a probability to an outcome (e.g. pregnancy) that is flexible, and can change as we learn more information (e.g. shopping habits). One way to do this is to apply Bayes' Theorem, a powerful result that allows us to modify the likelihood of some hypothesis as we obtain more information.
Let's consider the theorem in the context of Target and pregnancy. Suppose Target knows that at any given time, roughly 2% of its female shoppers are pregnant. A certain product, maybe it's a type of lotion, is particularly popular among pregnant women; to pull some numbers out of a hat, let's say that based on past consumer behavior, Target knows eight out of every 100 pregnant women will buy this lotion, while only one out of every 100 nonpregnant women will buy it. If you are a woman who buys this lotion, then essentially you are increasing the likelihood you are pregnant in the eyes of Target.
Bayes' Theorem tells us how to explicitly compute the probability that a woman is pregnant if she buys this item, by considering the ratio of the number of women who are both pregnant and buy the lotion to the number of women who simply buy the lotion (regardless of whether they may be pregnant).
With the numbers given above, the likelihood that a woman who buys the lotion is pregnant rises from 2% to just more than 14%; that's a sevenfold increase in the probability, and all just from the purchase of a single item! Though 14% isn't particularly high, when these probabilities are compounded across dozens of different purchases, suddenly it seems much more reasonable that a retailer could accurately infer a great deal about a customer based on his or her purchase history.
From a mathematical standpoint, the above example is no different from a woman taking a particularly crude pregnancy test (I say "crude" because any pregnancy test that detects pregnancy in only eight out of every 100 pregnant women probably isn't going to sell too well). Purchasing the lotion is the same as having a positive test result, while not purchasing the lotion is the same as a negative test result. The difference, of course, is that when a woman takes a pregnancy test, she is opting in, and her test results will be as confidential as she wants them to be. The same cannot be said for a pregnancy test administered without her consent through an analysis of what she buys.
The moral here is that the very same mathematics, like the power of the Force in "Star Wars," can be used for quite different purposes. Learning your daughter is pregnant when she shows you a positive test result is very different from learning she is pregnant from a retailer that sent her a coupon book for maternity clothes. It's natural to feel uncomfortable at the thought of large companies trying to infer personal details about you, but I'd encourage you not to think of mathematics itself as the problem. Indeed, this onslaught of data also has many wonderful applications (see, for example, the hunt for an Earth-like planet outside our solar system).
Instead, it is the process of data collection itself that ought to raise people's eyebrows. So, if your eyebrows are higher than normal, it may help to try to shop more locally and pay with cash. Of course, taking a few math classes to try to better understand what these companies are doing couldn't hurt either.
As a side note, the Joint Policy Board of Mathematics made April Mathematics Awareness Month. Every year a different subject is explored through the lens of mathematics; some of my favorite topics in recent years include the intersection of mathematics and sports (2010), voting (2008), and art (2003).
This year, the theme is "Mathematics, Statistics, and the Data Deluge." While it's a bit of a mouthful, it is certainly topical.