Chapter 11 Joint Distributions Tell You Everything
11.1 Joint Distributions
Do not let the fancy calligraphy notation of scare you -
its equivalent to last chapter’s . A frustration of
learning math is that there are multiple conventions for naming the same
things. Let’s start getting used to this so we can read further notes
beyond this book. The fancy is just a function that takes
input and spits out a probability. In this textbook and others, you will
see all sorts of equivalent notation for a probability function
including where is replaced by a set of random
variable values. For joint distributions, the input is values of random variables. should be thought of as
you supplying realizations and the function returns a
probability. The use of the capital ’s in refers to the
fact that a distribution gives a probability for all
possible realizations of the random variables.
The most complete method of reasoning about sets of random variables is by having a joint probability distribution. A joint probability distribution, , assigns a probability value to all possible assignments or realizations of sets of random variables. The goal of this chapter is to 1) introduce you to the notation of joint probability distributions and 2) convince you that if you are given a joint probability distribution, then you would be able to answer some very useful questions using probability.
Consider the graphical model from Shenoy and Shenoy (2000Shenoy, Catherine, and Prakash P Shenoy. 2000. Bayesian Network Models of Portfolio Risk and Return. The MIT Press.) and depicted in Figure 11.1.
Figure 11.1: Model of how stock prices for oil companies are influenced by other factors.
In the diagram, there are four random variables: 1) Interest Rate, 2) Stock Market, 3) Oil Industry, and 4) Stock Price (assume for an oil company). The arrows between the random variables tell a story of precedence and causal direction: interest rate influences the stock market which then, in combination with the state of the oil industry will determine the stock price (we will learn more about these arrows in the Graphical Models chapter). For simplicity and to gain intuition about joint distributions, assume that each of these four random variables is binary-valued, meaning they can each take two possible assignments:
Random Variable
Set of Possible Values (i.e. )
Thus, our probability space has values corresponding to the possible assignments to these four variables. So, a joint distribution must be able to assign probability to these 16 combinations. Here is one possible joint distribution:
Note that probability notation on the Internet or in books does not
conform to just one convention. I will mix conventions in this book, not
to confuse you, but to give you exposure to other conventions you might
encounter when going forward on your learning journey outside of this
text. Everytime I introduce a slightly changed notation, I will add
comments in the margin introducing it.
0.016
0.168
0.04
0.045
0.018
0.189
0.012
0.0135
0.004
0.042
0.04
0.045
0.012
0.126
0.108
0.1215
Notation note: . Each defines a
function where you supply realizations and and the probability function will
return .
For example, let outcome
of a dice roll and outcome
of a coin flip. Hence, you can supply potential outcomes, like which means and the function output
would be (if you were
to do the math).
Collectively, the above 16 probabilities represent the joint distribution - meaning, you plug in values for all four random variables and it gives you a probability. For example, yields a probability assignment of 12.15%.
More generally speaking, a marginal distribution is a compression of
information where only information regarding the marginal variables is
maintained. Take a set of random variables, (e.g. ), and a subset of those
variables (e.g. ). And using standard mathematical
convention, let
be the set of random variables in
that are not in (i.e. . Assuming discrete
random variables, then the marginal distribution is calculated from the joint
distribution . Effectively, when the joint probability
distribution is in tabular form, one just sums up the probabilities in
each row where .
One might also be curious about probability assignments for just a subset of the random variables. This smaller subset of variables can be called marginal variables and their probability distribution is called a marginal distribution. For example, the marginal distribution for oil industry is notated as and represents a probability distribution over just one of the four variables - ignoring the others. The marginal distribution can be derived from the joint distribution using the formula:
Think of a marginal distribution as a function of the marginal variables. Given realizations of the marginal variables, the function returns a probability. Applying the above formula to determine the marginal distribution of yields a tabular representation of the marginal distribution (Table 11.1).
Table 11.1: A marginal distribution shown in table form.
Realization ()
0.016 + 0.168 + 0.04 +
0.045 + 0.004 + 0.042 +
0.04 + 0.045 = 0.4
0.018 + 0.189 + 0.012 +
0.0135 + 0.012 + 0.126 +
0.108 + 0.1215 = 0.6
Exercise 11.1 Suppose we are only interested in the Oil Company Stock Price (). Given the probabilities in the above joint distribution, what is the marginal distribution for (i.e. and )?
Exercise 11.2 Suppose we are interested in both the Stock Market () and the Oil Industry (). We can find the marginal distribution for these two variables, . This is sometimes called a joint marginal distribution; joint referring to the presence of multiple variables and marginal referring to the notion that this is a subset of the original joint distribution. So, given the probabilities in the above joint distribution, what is the marginal distribution for - i.e. give a probability function for
Conditional distributions can be used to model scenarios where a subset of the random variables are known (e.g. data) and the remaining subset is of interest (e.g. model parameters). For example, we might be interested in getting the conditional distribution of Stock Price () given that we know the Interest Rate. The notation for this is and can be calculated using the definition of conditional probablity:
Think of a conditional distribution as a function of the variables to
the left of the conditioning pipe ($ | $) - since
they are assumed given, you already know the value for the right-side
variables with 100% certainty. You supply realizations of the
left-side variables, the function returns a probability.
For our specific problem:
To calculate conditional probabilities when already given the joint distribution, use a two-step process:
First, to simplify the problem, calculate the numerator, i.e. the marginal distribution for , To get the marginal distribution, just aggregate the rows in the joint distribution as done in the previous section on marginal distributions. and rid ourselves of the variables that we are not interested in:
0.086
0.4155
0.164
0.3345
Then, calculate any conditional distribution of interest by plugging in the given value for and all of the possible values. For example means we need to be able to find a probability for the two outcomes given that we know . Hence, we calculate and :
and,
which yields the following tabular representation of the conditional distribution for :
Exercise 11.3 Now, suppose we learn that is low. What is the conditional distribution for (i.e. and )?
11.4 MAP Estimates
Sometimes, we are not interested in a complete probability distribution, but rather seek a high-probability assignment to some subset of variables. For this, we can use a query (maximum a posteriori query). A finds the most likely assignment of all non-evidentiary variables (i.e. unknown values). Basically, you search the joint distribution for the largest probability value. For example, the maximum a posterior estimate of stock price given would be given by the following formula:
which in natural language asks for the argument (i.e. the realization of stock price) that maximizes the conditional probability . From above, we realize that and and hence, the MAP estimate is that because .
Exercise 11.4 Let , what is
,
the most likely joint assignment?
Exercise 11.5 Now, suppose we learn that is low. Let , what is ,
the most likely joint assignment?
11.5 Limitations of Joint Distributions
Why don’t we just use joint probability distributions all the time? Despite the expressive power of having a joint probability distribution, they are not that easy to directly construct due to the curse of dimensionality. As the number of random variables being considered in a dataset grows, the number of potential probability assignments grows too. Even in the era of big data, this curse of dimensionality still exists. Generally speaking, an exponential increase is required in the size of the dataset as each new descriptive feature is added.
Let’s assume we have random variables with each having values. Thus, the joint distribution requires probabilites. Even if and , this leads to 17,179,869,184 possibilities (over 17 billion). To make this concrete, a typical car purchase decision might easily look at 34 different variables (e.g. make, model, color, style, financing, etc.). So, to model this decision would require a very large joint distribution which actually dwarfs the amount of data that is available. As a point of comparison, well under 100 million motor vehicles were sold worldwide in 2019 - i.e. less than one data point per possible combination of features. Despite this “curse”, we will learn to get around it with more compact representations of joint distributions. These representations will require less data, but will still yield the power to answer queries of interest; just as if we had access to the full joint distribution.
11.6 Some Additional Exercises
In its quest to find properties suitable for development, Starbucks looks at three key variables: visibility, traffic, and competition. An analyst with the real estate division of Starbucks put together the following joint distribution to represent all of the candidate properties that Starbucks is currently considering for new locations:
Visibility
Traffic
Competition
P(V,T,C)
Low
Low
Low
0.05
Low
Low
High
0.05
Low
High
Low
0.05
Low
High
High
0.14
Medium
Low
Low
0.03
Medium
Low
High
0.01
Medium
High
Low
0.02
Medium
High
High
0.15
High
Low
Low
0.05
High
Low
High
0.20
High
High
Low
0.05
High
High
High
0.20
Exercise 11.6 Given the above joint distribution, what is the marginal probability that visibility is Medium? (report your probability as a decimal to the hundredth place. For example 15.8% should be entered 0.16)
Exercise 11.7 Given the above joint distribution, what is the conditional probability that visibility is high given that traffic is high? (enter your probability as a decimal to the hundreth place. For example 15.8% should be entered 0.16)