Driving down the highway in upstate Michigan yesterday, I saw a State Police billboard that states (I paraphrase)

Drivers who text are 23 times more likely to get into a crash.

This raises two questions for me:

- What exactly does one mean by "23 times more likely"?
- For all intents and purposes, is "23 times more likely" really something very significant?

These two questions may require some explaining.

The first question really boils down to: "how does one measure the likelihood of an event?" For many people (yours truly included), the first thing that likelihood brings to mind is "probability". As in "the likelihood that flipping a fair coin gives you heads is one in two;" or "the likelihood rolling a pair of dice to get 7 is one in six." If we compare the likelihood, we might say that "getting a head from flipping a coin is three times more likely than rolling a 7 on a pair of dice."

However, this method of measuring likelihood lacks a certain symmetry.
The probability of *not* getting a head from flipping a coin is also one in two, and the probability of *not* getting a 7 from rolling two dice is five in six, so while "getting a head from flipping a coin is three times more likely than rolling a 7 on a pair of dice", one cannot as naturally assert "*not* getting a head from flipping a coin is three times *less* likely than *not* rolling a 7 on a pair of dice."
(If you compute it, the ratio is actually 5/3, or only about 1.67 times less likely.)

A much more symmetric way of measuring likelihood is actually present in common parlance, and is *de rigueur* in the gambling industry. This is the notion of **odds**.
Where as **probability** expresses the likelihood as the ratio of "number of expected events" over "number of total trials", **odds** expresses the ratios of "number of expected events" over "number of expected non-events".
For an example, the **odds** of getting a head on a coin toss is one-to-one; the odds of rolling a 7 on a pair of dice is one-to-five.
(For gambling or insurance or any applications where it is necessary to set a price based on expected values, the odds notation makes the computation easier.)

If using the odds notation, then we may say that "getting a head from flipping a coin is **five** times more likely than rolling a 7 on a pair of dice."

What's nice about the odds notation is that the odds of *not* getting a head on a coin toss is also one-to-one; while not getting a 7 on a roll of two dice is simply five-to-one.
So automatically we also have "*not* getting a head from flipping a coin is five times *less* likely than not rolling a 7 on a pair of dice" --- that is to say, the expected symmetry is restored.

In addition to the obvious numeric difference between the two methods of expressions, and the issues of symmetry, there are a few reasons why one may prefer using **odds** to compare relative likelihoods.

- In "odds", a sure thing (an event that happens 100% of the time) is "infinitely more likely" than any event that has some degree of randomness. Compare this to the "probability" notion: saying that the chance of getting a head on a two-headed coin is only twice that of on a fair coin seems an understatement.
- For rare events, the two notions roughly agree. If an event as probability $P$, the odds of it happening can be expressed as the ratio \[ O = \frac{P}{1-P} \] which expression we can replace using the Maclaurin series as \[ O = P + P^2 + P^3 + \ldots \] so when $P$ is very small, the odds and probability are practically the same, which also agrees with our intuition that a disease that strikes 25 people per million is 5 times more likely than one that strikes 5 people per million.
- Finally, interestingly, for statistical data analysis, using odds as the advantage that one can quantify the relative likelihood of occurrences
**without**actually determining the individual probability of events.

(The difference between using the odds versus probability [this latter is often called "risk" in the medical/insurance industries where the events discussed are often detrimental to human beings] is discussed in detail in this article.)

Let me get to point 3 a bit later, and return to my initial questions.

Independently of what "23 times more likely" mean, it is also worthwhile thinking about whether that "23 times more likely" is actually going to significantly impact us. To take a facetious example: the odds of winning the EuroMillions lottery is about twice that of PowerBall (and EuroMillions payout is tax exempt, unlike the US lottery jackpots), but I am not going to make a life decision about where to live based on that statistic. The overall odds of dying in a car accident (as a car occupant) in the US, over the course of one year, is estimated at about 1 in 45,000, but this is of course averaged over all mortality data. Considering that around 38% of teens admit to texting while driving, it is hard to say whether the 1 in 45,000 figure is closure to the odds corresponding to "safe driving" behavior or "dangerous driving" behavior. (Policy-wise I think this is a moot point, as the "cost" for not texting while driving is sufficiently low that one should do it regardless of the actual impact on overall morality rates; this post is mostly an intellectual exercise.)

So what better way to find the actual estimates by tracking down the original study?

As an aside: this 23-fold increase in risk figure is so popular it turns out tracking down the original study was not a trivial task.
A bit of digging finally revealed the source of this number: a study contracted by the Federal Motor Carrier Safety Administration that was conducted by the Virginia Tech Transportation Institute, which tracked the driving behavior and incidents of commercial drivers (think truckers with on-board monitoring software and cabin-videos) to gather the statistics.
The report itself is 285 pages long and I am sure makes fascinating reading for people of certain persuasions, but *unfortunately* does not contain the numbers I was actually looking for!

And here we come back to the point 3 above, in terms of the benefits of using odds rather than probability to determine the relative likelihood. (In passing: yes, the 23-fold increase determined by the VTTI report is a increase in *odds*, which allows them to computing the relative odds between the two events without determining the absolute odds of either event.)
To see how something like this can work:

Assume our population/trials can be divided into four categories:

- Drivers in accidents who texts and drives (class AT)
- Drivers in accidents who do not text (class AN)
- Drivers who texts and doesn't get in accidents (class ST)
- Drivers who don't text and don't get in accidents (class SN)

To compute the ratio of probabilities, we would need to compute

\[ \frac{\frac{AT}{AT + ST}}{\frac{AN}{AN + SN}} \]

while to compute the ratio of odds, we need to compute

\[ \frac{\frac{AT}{ST}}{\frac{AN}{SN}} \]

Now suppose we have a large potential data set and we wish to estimate the ratio by sampling. To compute the ratio of probabilities, we would need to isolate the class T of texting drivers and randomly sample a subset to see what is the likelihood of one getting in an accident, and then similarly isolate the class N of non-texting drivers.

To compute the ratio of odds, however, we can perform a trick: since we are looking at a ratio of ratios, we can rewrite the desired relative odds as

\[ \frac{AT}{AN} \cdot \frac{SN}{ST} \]

the first ratio can be obtained by isolating the class A of all accidents and looking at the ratio of those who text versus those who don't, while the second can be obtained by isolating the class S of all non-accidents and taking a random sample to see what is the ratio of those who text and those who don't.

This is important when one has a situation where it is far easier to separate the data into class S versus class A, compared to separating into class T versus class N. (For example, in medical studies, it is generally not feasible to control, in observational studies, for the disease itself; in these cases it is in fact impossible to determine the precise number of class ST; but the ratio of ST to SN can be found by a baseline study of the general population.)