November | 2021 | A Place of Sand

I’ve recently been reviewing Bayesian networks with an eye to learning STAN. One question which occurred to me is the following. Suppose we are interested in the probability distribution $P(\mu)$ over parameters $\mu\in X$ (with state space $X$ ). We acquire some data $D$ , and wish to use it to infer $P(\mu)$ . Note that $D$ refers to the specific realized data, not the event space from which it is drawn.

Let’s assume that (1) we have a prior $P(\mu)$ , (2) the likelihood $P(D|\mu)$ is easy to compute or sample, and (3) the normalization $P(D)\equiv \sum_{\mu\in X} P(D|\mu)P(\mu)$ is not expensive to compute or adequately approximate.

The usual Bayesian approach involves updating the prior to a posterior via Bayes’ thm: $P(\mu|D)= \frac{P(D|\mu)P(\mu)}{P(D)}$ . However, there also is another view we may take. We need not restrict ourselves to a single Bayesian update. It is perfectly reasonable to ask whether multiple updates using the same $D$ would yield a more useful result.

Such a tactic is not as ridiculous or unjustified as it first seems. In many cases, the Bayesian posterior is highly sensitive to a somewhat arbitrary choice of prior $P(\mu)$ . The latter frequently is dictated by practical considerations rather than arising naturally from the problem at hand. For example, we often use the likelihood function’s conjugate prior to ensure that the posterior will be of the same family. Even in this case, the posterior depends heavily on the precise choice of $P(\mu)$ .

Though we must be careful interpreting the results, there very well may be applications in which an iterated approach is preferable. For example, it is conceivable that multiple iterations could dilute the dependence on $P(\mu)$ , emphasizing the role of $D$ instead. We can seek inspiration in the stationary distributions of Markov chains, where the choice of initial distribution becomes irrelevant. As a friend of mine likes to say before demolishing me at chess: let’s see where this takes us. Spoiler: infinite iteration “takes us” to maximum-likelihood selection.

An iterated approach does not violate any laws of probability. Bayes’ thm is based on the defining property $P(\mu,D)= P(D|\mu)P(\mu)= P(\mu|D)P(D)$ . Our method is conceptually equivalent to performing successive experiments which happen to produce the same data $D$ each time, reinforcing our certainty around it. Although its genesis is different, the calculation is the same. I.e., any inconsistency or inapplicability must arise through interpretation rather than calculation. The results of an iterated calculation may be inappropriate for certain purposes (such as estimating error bars, etc), but could prove useful for others.

In fact, one could argue there only are two legitimate approaches when presented with a one-time data set $D$ . We could apply it once or an infinite number of times. Anything else would amount to an arbitrary choice of the number of iterations.

It is easy to analyze the infinite iteration process. For simplicity, we’ll consider the case of a discrete, finite state space $X$ . $D$ is a fixed set of data values for our problem, so we are not concerned with the space or distribution from which it is drawn. $P(D)$ is a derived normalization factor, nothing more.

Let’s introduce some notation:

– Let $n\equiv |X|$ , and denote the elements of $X$ by $\mu_1\dots \mu_n$ .
– We could use $n$ -vectors to denote probability or conditional probability distributions over $X$ (with the $i^{th}$ component the probability of $\mu_i$ ), but it turns out to be simpler to use diagonal $n\times n$ matrices.
– $P(\mu)$ is an $n$ -vector, which we’ll write as a diagonal $n\times n$ matrix $v$ with $v_{ii}\equiv P(\mu_i)$ .
– We’ll denote by $D^k$ the data set $D$ repeated $k$ times. I.e., the equivalent of having performed an experiment $k$ times and obtained $D$ each time.
– $P(\mu|D)$ is an $n$ -vector, which we’ll write as a diagonal $n\times n$ matrix $v'$ with $v'_{ii}\equiv P(\mu_i|D)$ ).
– Where multiple updates are involved, we denote the final posterior $P(\mu|D^k)$ via an $n\times n$ diagonal matrix $v^{(k)}$ , with $v^{(k)}_{ii}\equiv P(\mu_i|D^k)$ . Note that $v'= v^{(1)}$ and $v= v^{(0)}$ .
– $P(D|\mu)$ as an $n$ -vector of probabilities as well, but we’ll also treat it as a diagonal $n\times n$ matrix $M$ with $M_{ii}\equiv P(D|\mu_i)$ .
– $P(D)=\sum_{i=1}^n P(D|\mu_i)P(\mu_i)$ is a scalar. In our notation, $P(D)= \text{tr}~ M v$ .

A single Bayesian update takes the form $v'= M v/(\text{tr}~ M v)$ . What happens if we repeat this? A second iteration substitutes $v'$ for $v$ , and we get $v^{(2)}= M v'/(\text{tr}~ M v')$ . This is homogeneous of degree $0$ in $v'$ , so the $(\text{tr}~ M v)$ normalization factor in $v'$ disappears. We thus have $v^{(2)}= M^2 v /(\text{tr}~ M^2 v)$ . The same reasoning extends to $v^{(k)}= M^k v/(\text{tr}~ M^k v)$ .

It now is easy to see what is happening. Suppose $n=2$ , and let $M_{11}>M_{22}$ . Our expression for $P(\mu_1|D)$ after $k$ iterations is $v^{(k)}_1= \frac{M^k_{11} v_{11}}{M^k_{11} v_{11} + M^k_{22} v_{22}}$ .

This has the form $\frac{a^k x}{a^k x + b^k y}$ , which can be written $1/(1+\frac{b^k y}{a^k x})$ . We know that $b<a$ , so as long as $x\ne 0$ we have $\lim_{k\rightarrow\infty} \frac{b^k y}{a^k x}= 0$ . Specifically, for $\epsilon>0$ we have $\frac{b^k y}{a^k x}<\epsilon$ for $k>\frac{\ln\epsilon + \ln \frac{x}{y}}{\ln \frac{b}{a}}$ . Note that the denominator is negative since $a>b$ and the numerator is negative for small enough $\epsilon$ .

We therefore have shown that (in this simple case), $\lim_{k\rightarrow\infty} v^{(k)}_1= v_{11}$ . If we perform the same analysis for $v^{(k)}_2$ , we get $v^{(k)}_2= \frac{M^k_{22} v_{22}}{M^k_{11} v_{11} + M^k_{22} v_{22}}$ , which corresponds to $1/(1+\frac{a^k x}{b^k y})$ . The denominator diverges for large enough $k$ , and the limit is $0$ . We therefore see that $\lim_{k\rightarrow\infty} v^{(k)}_2= 0$ .

This trivially extends to $n>2$ . As $k\rightarrow\infty$ , all but the dominant $M_{ii}$ are exponentially suppressed. The net effect of infinite iteration is to pick out the maximum likelihood value. I.e., we select the $\mu_i$ corresponding to the maximum $M_{ii}$ . All posterior probability is concentrated in that. Put another way, the limit of iterated posteriors is $P(\mu_i|D^\infty)= 1$ for $i=argmax~P(D|\mu_i)$ and $0$ for all others.

What if the maximum $M_{ii}$ is degenerate? Let’s again consider the simple $n=2$ case, but now with $M_{11}= M_{22}>0$ . It is easy to see what happens in this case. $a/b=1$ , so $v^{(k)}_1= \frac{v_{11}}{v_{11}+v_{22}}$ and $v^{(k)}_2= \frac{v_{22}}{v_{11}+v_{22}}$ . Note that $v_{11}+v_{22}=1$ here, but we stated the denominator explicitly to facilitate visualization of the extension to $n>2$ .

This extension is straightforward. We pick out the maximum likelihood values $\mu_i$ , and they are assigned their prior probabilities, renormalized. Suppose there are $m\le n$ degenerate maximum $M_{ii}$ ‘s, with indices $i_1\dots i_m$ (each $i_j\in 1\dots n$ ). The limit of iterated posteriors $P(\mu_{i_j}|D^\infty)= \frac{P(\mu_i)}{\sum_{j=1}^m P(\mu_{i_j})}$ . This reduces to our previous result when $m=1$ .

Note that we must ensure $v_i\ne 0$ for the maximum likelihood $\mu_i$ ‘s. I.e., we cannot have a $0$ prior for any of the maximum likelihood values. If we wish to exclude $\mu_i$ ‘s from consideration, we should do so before the calculation, thus eliminating the corresponding $P(D|\mu_i)$ ‘s from contention for the maximum likelihood.

Expanding $|X|$ to a countable set poses no problem. In the continuous case, we must work with intervals (or measurable sets) rather than point values. For any $\epsilon>0$ and any set of nonzero measure containing all the maximum likelihood values, there will be some $k$ that concentrates all but $\epsilon$ of the posterior probability in that set.

Note that $k$ depends on the choice of measurable set, and care must be taken when considering limits of such sets. For example, let $p\equiv \max_{\mu} P(D|\mu)$ be the maximum likelihood probability. If we consider an interval $I\equiv (p-\delta/2,p+\delta/2)$ as our maximum likelihood set, then the maximum likelihood “value” is the (measurable) set $V\equiv P(D|\mu)^{-1}(I)$ . For any $\epsilon$ , we have a $k$ as discussed above, such that $P(\mu\notin V|D^j)<\epsilon$ for $j>k$ . However, for a fixed $\epsilon$ , that $k$ will vary with $\delta$ . Put another way, we cannot simply assume uniform convergence.

We can view infinite iteration as a modification of the prior. Specifically, it is tantamount to pruning the prior of all non-maximum-likelihood values and renormalizing it accordingly. The posterior then is equal to the prior under subsequent single- $D$ steps (i.e. it is a fixed point distribution). Alternatively, we can view the whole operation as a single $D^\infty$ update. In that case, we keep the original prior and view the posterior as the aforementioned pruned version of the prior.

There are two takeaways here:

1. The infinite iteration approach simply amounts to maximum-likelihood selection. It selects the maximum likelihood value(s) from the known $P(D|\mu)$ and maintains their relative prior probabilities, suitably renormalized. Equivalently, it prunes all the non-maximum-likelihood values.
2. The resulting posterior still depends on the choice of prior unless the maximum likelihood value is unique, in which case that value has probability $1$ .

Unlike stationary distributions of Markov chains, the result is not guaranteed to be independent of our arbitrary initial choice — in this case, the prior $P(\mu)$ . Though true independence only is achieved when there is a unique maximum likelihood value, the dependence is reduced significantly even when there is not. The posterior depends only on those prior values corresponding to maximum likelihood $\mu$ ‘s. All others are irrelevant. The maximum likelihood values typically form a tiny subset of $\mu$ ‘s, thus eliminating most dependence on the prior. Note that such degeneracy (as well as the values themselves) is solely determined by the likelihood function.

These days, every minor institutional faux pas draws a melodramatic fawning apology utterly devoid of a modicum of self-respect and expressed through the metallic insincerity of boilerplate buzzwords . By now, one or more generations have grown up bombarded with such nonsense. We only can imagine what they must be like at home…

Hey Bob, listen, it’s no big deal, but could you take out the trash when it’s your turn? It’s really been piling up.

I’ve heard you loud and clear.

Fantastic.

My top priority has been fostering a community which values inclusiveness, mutual respect, and constructive engagement. A place where all perspectives, values, diverse viewpoints, and lifestyles are cherished.

Um, ok. Sure.

I realize I’ve fallen far short of my high ideals in this regard, and promise to do better.

Great. So … you’ll take out the trash?

But it’s not enough to be sorry. I know there has been an inexcusable breach of trust, and that my actions have caused deep hurt and lasting anguish.

If you feel bad, you could, like, take out the trash.

I can do better. I will not be complacent in the face of such a challenge. This is an opportunity for reflection and learning, to grow into a better version of myself.

Really, it’s not that big a deal. You just take the trash and put it in the bin.

Change is necessary, and the first step toward such change is to understand the scope of the problem.

That’s easy. The schedule is on the fridge. Just, you know, do it.

Toward this end, I have identified several important steps.

There’s really just one.

First, I enrolled in a sixteen-week sensitivity training course, mandatory for me, myself, and I.

Is that the reason you didn’t do any other chores for the last sixteen weeks?

I also hired an outside firm to thoroughly investigate my past behaviors and recommend a path forward. You may have noticed them here and there recently.

You mean that guy who crashed on the couch and ate all my Doritos? I thought he was a friend of yours.

After a rigorous investigation, we have concluded that all policies and procedures were followed and there was no misconduct.

You’re not going to take out the trash, are you?

The repercussions of trashgate are ongoing, and I will not rest on my laurels. I can do better, and I will do better.

Can part of your “not resting” involve moving trash from the kitchen to the bin?

That I did not intend my actions to be offensive is no excuse for the anxiety and pain they have caused.

It doesn’t smell great, and can attract roaches.

Nor do those actions reflect who I am as a person.

Pretty sure they do.

However, in the face of the continuing public reaction, my involvement can only serve to distract from our community’s valuable mission.

I think I know where this is going.

In consultation with myself, I have concluded that the best way for us all to move forward is for me to step down from my trash removal responsibilities.

You know, you could have just refused up front.

Although my formal role has diminished, I will remain active in other aspects of our vibrant and innovative community.

In other words, you’ll continue to use the foosball table.

I only hope these steps can bring some small measure of closure to those who have suffered through my thoughtless actions.

The only closure we need is of the trash bin.

	K.M. Halpern on The Most Important Problem in…
	K.M. Halpern on Trash Talk
	Rivka on The Most Important Problem in…
	Sarah Chrystine on Trash Talk
	K.M. Halpern on “The Delivery” now…

A Place of Sand

Monthly Archives: November 2021

What happens when you iterate Bayesian Inference with the same data set?

Trash Talk

Writings and Ravings