Category Archives: Notes

Semidirect Products, Group Extensions, Split Exact Sequences, and all that

July 8, 2023 K.M. Halpern

UPDATED SUBSTANTIALLY 7/6/2023
[Original Version Posted 1/22/2019]

PDF Notes: Semidirect Products, Group Extensions, Split Exact Sequences, and all that

If you suffer physics-brain and don’t know anything about semidirect products or group extensions, but get the sense direct products just aren’t cutting it — these are the notes for you.

Update: On reexamination, my original post on the subject was too brief and simplistic. Among other things, it neglected to explicitly construct the multiplication on a group extension. In the process of doing so, I identified numerous other shortcomings and omissions. As a result, I’ve replaced it with a far more thorough discussion. Unfortunately, at 50 pages it has expanded beyond the point where the primitive wordpress support for latex suffices. The notes now are provided as a pdf (linked at the top and bottom of this post). This has the added benefit of allowing me to the color-code proofs and comments and examples, allowing easier reading of the key elements.

The purpose of this post is to cure one aspect of physics-brain. In a typical physics education, we learn a sloppy version of linear algebra, basic group theory, and (certain) differential equations many times — and little else math-related. As a result, we develop numerous bad habits and suffer from a complete ignorance of certain important areas of mathematics. If two groups are being combined, it’s via a direct product. After all, what else is there? Unfortunately, this only works until it doesn’t — which is pretty much anywhere it matters. For example, even O(3,1) is not a direct product of the 2 copies of Z2 and SO+(3,1). You’ve probably heard of O(3,1). It matters.

Many of the things we naively assume would be direct products actually are semidirect products or general group extensions, and the result can be a lot of confusion when things don’t work as expected. Besides the construction of basic physical groups such as the Poincare group, this also arises in the study of quantum mechanics. Part of the reason we can pretend to work in Hilbert space when the actual state space is a projective Hilbert space is that the projective representations of a group lift to unitary representations of a different group. The latter is a particular group extension (in fact, a central extension) of the original group.

Ignorance of semidirect products and group extensions is quite understandable in a physicist, but also easily cured. These notes are an attempt to do so based on my experience trying to cure myself. They are self-contained, and the only prerequisite is a little group theory. Though I allude to topology in a couple of places, no knowledge of it is necessary.

These notes cover the following:

Normal subgroups and quotient groups, the isomorphism thms, various other group-related concepts, and inner/outer automorphisms.
Exact sequences, short-exact-sequences (SES’s), splitting of SES’s, central extensions, and the isomorphism classes of SES’s (as well as how they interact with splitting).
Introduction to the hierarchy of direct products, semidirect products, and group extensions. We take 3 views of each: (1) an external view in which we build a new group from two distinct groups, (2) an internal view in which we consider the relationship between an existing group and its subgroups, and (3) an SES-view in which we frame things in terms of short exact sequences. Because the external-view is conceptually the most challenging, we give a preview of its regimen.
The direct product in all 3 views.
Semidirect products in all 3 views.
General group extensions in all 3 views.
A brief comparison of direct products, semidirect products, and group extensions in all 3 views.
An addendum in which we explicitly construct the multiplication on a group extension in gory detail.

A few caveats. There are lots of detailed proofs in the notes. There also probably are typos and possibly errors. I hope to correct and clarify as needed, so please report these to me if you encounter any. The notes were written in passes and phases. There is a lot of intentional replication of effort in the proofs (where it serves the purpose of clarity), but also probably some unintentional replication (for example, if I consolidated some results in a proposition in one place, but fail to use those results in another). Nonetheless, I believe they should be quite informative and are on balance both correct and quite comprehensible. In particular, I tried to avoid “simplifying” things in subsequent passes as my understanding improved (though I certainly did correct things as needed). The danger in doing so is that I would lose sight of the things which confused me when first encountering the concepts — and one purpose of these notes is to address such sources of confusion. After enough passes and simplifications, the entire piece would end up a one page terse Bourbaki-like statement, along with a vague hint at the possibility of an idea of a proof-outline — which anyone worth their salt is expected to easily be able to expand into a 40-page proof. In math, the best way to learn a subject is to already know it. In physics, we take a gentler approach.

There are a couple of key omission — neither mission-central — which I hope to address in future posts. These concern (1) the relationship between semidirect products/group extensions and fiber bundles/principal bundles and (2) a detailed discussion of group cohomology and the classification of group extensions.

PDF Notes: Semidirect Products, Group Extensions, Split Exact Sequences, and all that

Notes

What happens when you iterate Bayesian Inference with the same data set?

November 28, 2021 K.M. Halpern Leave a comment

I’ve recently been reviewing Bayesian networks with an eye to learning STAN. One question which occurred to me is the following. Suppose we are interested in the probability distribution $P(\mu)$ over parameters $\mu\in X$ (with state space $X$ ). We acquire some data $D$ , and wish to use it to infer $P(\mu)$ . Note that $D$ refers to the specific realized data, not the event space from which it is drawn.

Let’s assume that (1) we have a prior $P(\mu)$ , (2) the likelihood $P(D|\mu)$ is easy to compute or sample, and (3) the normalization $P(D)\equiv \sum_{\mu\in X} P(D|\mu)P(\mu)$ is not expensive to compute or adequately approximate.

The usual Bayesian approach involves updating the prior to a posterior via Bayes’ thm: $P(\mu|D)= \frac{P(D|\mu)P(\mu)}{P(D)}$ . However, there also is another view we may take. We need not restrict ourselves to a single Bayesian update. It is perfectly reasonable to ask whether multiple updates using the same $D$ would yield a more useful result.

Such a tactic is not as ridiculous or unjustified as it first seems. In many cases, the Bayesian posterior is highly sensitive to a somewhat arbitrary choice of prior $P(\mu)$ . The latter frequently is dictated by practical considerations rather than arising naturally from the problem at hand. For example, we often use the likelihood function’s conjugate prior to ensure that the posterior will be of the same family. Even in this case, the posterior depends heavily on the precise choice of $P(\mu)$ .

Though we must be careful interpreting the results, there very well may be applications in which an iterated approach is preferable. For example, it is conceivable that multiple iterations could dilute the dependence on $P(\mu)$ , emphasizing the role of $D$ instead. We can seek inspiration in the stationary distributions of Markov chains, where the choice of initial distribution becomes irrelevant. As a friend of mine likes to say before demolishing me at chess: let’s see where this takes us. Spoiler: infinite iteration “takes us” to maximum-likelihood selection.

An iterated approach does not violate any laws of probability. Bayes’ thm is based on the defining property $P(\mu,D)= P(D|\mu)P(\mu)= P(\mu|D)P(D)$ . Our method is conceptually equivalent to performing successive experiments which happen to produce the same data $D$ each time, reinforcing our certainty around it. Although its genesis is different, the calculation is the same. I.e., any inconsistency or inapplicability must arise through interpretation rather than calculation. The results of an iterated calculation may be inappropriate for certain purposes (such as estimating error bars, etc), but could prove useful for others.

In fact, one could argue there only are two legitimate approaches when presented with a one-time data set $D$ . We could apply it once or an infinite number of times. Anything else would amount to an arbitrary choice of the number of iterations.

It is easy to analyze the infinite iteration process. For simplicity, we’ll consider the case of a discrete, finite state space $X$ . $D$ is a fixed set of data values for our problem, so we are not concerned with the space or distribution from which it is drawn. $P(D)$ is a derived normalization factor, nothing more.

Let’s introduce some notation:

– Let $n\equiv |X|$ , and denote the elements of $X$ by $\mu_1\dots \mu_n$ .
– We could use $n$ -vectors to denote probability or conditional probability distributions over $X$ (with the $i^{th}$ component the probability of $\mu_i$ ), but it turns out to be simpler to use diagonal $n\times n$ matrices.
– $P(\mu)$ is an $n$ -vector, which we’ll write as a diagonal $n\times n$ matrix $v$ with $v_{ii}\equiv P(\mu_i)$ .
– We’ll denote by $D^k$ the data set $D$ repeated $k$ times. I.e., the equivalent of having performed an experiment $k$ times and obtained $D$ each time.
– $P(\mu|D)$ is an $n$ -vector, which we’ll write as a diagonal $n\times n$ matrix $v'$ with $v'_{ii}\equiv P(\mu_i|D)$ ).
– Where multiple updates are involved, we denote the final posterior $P(\mu|D^k)$ via an $n\times n$ diagonal matrix $v^{(k)}$ , with $v^{(k)}_{ii}\equiv P(\mu_i|D^k)$ . Note that $v'= v^{(1)}$ and $v= v^{(0)}$ .
– $P(D|\mu)$ as an $n$ -vector of probabilities as well, but we’ll also treat it as a diagonal $n\times n$ matrix $M$ with $M_{ii}\equiv P(D|\mu_i)$ .
– $P(D)=\sum_{i=1}^n P(D|\mu_i)P(\mu_i)$ is a scalar. In our notation, $P(D)= \text{tr}~ M v$ .

A single Bayesian update takes the form $v'= M v/(\text{tr}~ M v)$ . What happens if we repeat this? A second iteration substitutes $v'$ for $v$ , and we get $v^{(2)}= M v'/(\text{tr}~ M v')$ . This is homogeneous of degree $0$ in $v'$ , so the $(\text{tr}~ M v)$ normalization factor in $v'$ disappears. We thus have $v^{(2)}= M^2 v /(\text{tr}~ M^2 v)$ . The same reasoning extends to $v^{(k)}= M^k v/(\text{tr}~ M^k v)$ .

It now is easy to see what is happening. Suppose $n=2$ , and let $M_{11}>M_{22}$ . Our expression for $P(\mu_1|D)$ after $k$ iterations is $v^{(k)}_1= \frac{M^k_{11} v_{11}}{M^k_{11} v_{11} + M^k_{22} v_{22}}$ .

This has the form $\frac{a^k x}{a^k x + b^k y}$ , which can be written $1/(1+\frac{b^k y}{a^k x})$ . We know that $b<a$ , so as long as $x\ne 0$ we have $\lim_{k\rightarrow\infty} \frac{b^k y}{a^k x}= 0$ . Specifically, for $\epsilon>0$ we have $\frac{b^k y}{a^k x}<\epsilon$ for $k>\frac{\ln\epsilon + \ln \frac{x}{y}}{\ln \frac{b}{a}}$ . Note that the denominator is negative since $a>b$ and the numerator is negative for small enough $\epsilon$ .

We therefore have shown that (in this simple case), $\lim_{k\rightarrow\infty} v^{(k)}_1= v_{11}$ . If we perform the same analysis for $v^{(k)}_2$ , we get $v^{(k)}_2= \frac{M^k_{22} v_{22}}{M^k_{11} v_{11} + M^k_{22} v_{22}}$ , which corresponds to $1/(1+\frac{a^k x}{b^k y})$ . The denominator diverges for large enough $k$ , and the limit is $0$ . We therefore see that $\lim_{k\rightarrow\infty} v^{(k)}_2= 0$ .

This trivially extends to $n>2$ . As $k\rightarrow\infty$ , all but the dominant $M_{ii}$ are exponentially suppressed. The net effect of infinite iteration is to pick out the maximum likelihood value. I.e., we select the $\mu_i$ corresponding to the maximum $M_{ii}$ . All posterior probability is concentrated in that. Put another way, the limit of iterated posteriors is $P(\mu_i|D^\infty)= 1$ for $i=argmax~P(D|\mu_i)$ and $0$ for all others.

What if the maximum $M_{ii}$ is degenerate? Let’s again consider the simple $n=2$ case, but now with $M_{11}= M_{22}>0$ . It is easy to see what happens in this case. $a/b=1$ , so $v^{(k)}_1= \frac{v_{11}}{v_{11}+v_{22}}$ and $v^{(k)}_2= \frac{v_{22}}{v_{11}+v_{22}}$ . Note that $v_{11}+v_{22}=1$ here, but we stated the denominator explicitly to facilitate visualization of the extension to $n>2$ .

This extension is straightforward. We pick out the maximum likelihood values $\mu_i$ , and they are assigned their prior probabilities, renormalized. Suppose there are $m\le n$ degenerate maximum $M_{ii}$ ‘s, with indices $i_1\dots i_m$ (each $i_j\in 1\dots n$ ). The limit of iterated posteriors $P(\mu_{i_j}|D^\infty)= \frac{P(\mu_i)}{\sum_{j=1}^m P(\mu_{i_j})}$ . This reduces to our previous result when $m=1$ .

Note that we must ensure $v_i\ne 0$ for the maximum likelihood $\mu_i$ ‘s. I.e., we cannot have a $0$ prior for any of the maximum likelihood values. If we wish to exclude $\mu_i$ ‘s from consideration, we should do so before the calculation, thus eliminating the corresponding $P(D|\mu_i)$ ‘s from contention for the maximum likelihood.

Expanding $|X|$ to a countable set poses no problem. In the continuous case, we must work with intervals (or measurable sets) rather than point values. For any $\epsilon>0$ and any set of nonzero measure containing all the maximum likelihood values, there will be some $k$ that concentrates all but $\epsilon$ of the posterior probability in that set.

Note that $k$ depends on the choice of measurable set, and care must be taken when considering limits of such sets. For example, let $p\equiv \max_{\mu} P(D|\mu)$ be the maximum likelihood probability. If we consider an interval $I\equiv (p-\delta/2,p+\delta/2)$ as our maximum likelihood set, then the maximum likelihood “value” is the (measurable) set $V\equiv P(D|\mu)^{-1}(I)$ . For any $\epsilon$ , we have a $k$ as discussed above, such that $P(\mu\notin V|D^j)<\epsilon$ for $j>k$ . However, for a fixed $\epsilon$ , that $k$ will vary with $\delta$ . Put another way, we cannot simply assume uniform convergence.

We can view infinite iteration as a modification of the prior. Specifically, it is tantamount to pruning the prior of all non-maximum-likelihood values and renormalizing it accordingly. The posterior then is equal to the prior under subsequent single- $D$ steps (i.e. it is a fixed point distribution). Alternatively, we can view the whole operation as a single $D^\infty$ update. In that case, we keep the original prior and view the posterior as the aforementioned pruned version of the prior.

There are two takeaways here:

1. The infinite iteration approach simply amounts to maximum-likelihood selection. It selects the maximum likelihood value(s) from the known $P(D|\mu)$ and maintains their relative prior probabilities, suitably renormalized. Equivalently, it prunes all the non-maximum-likelihood values.
2. The resulting posterior still depends on the choice of prior unless the maximum likelihood value is unique, in which case that value has probability $1$ .

Unlike stationary distributions of Markov chains, the result is not guaranteed to be independent of our arbitrary initial choice — in this case, the prior $P(\mu)$ . Though true independence only is achieved when there is a unique maximum likelihood value, the dependence is reduced significantly even when there is not. The posterior depends only on those prior values corresponding to maximum likelihood $\mu$ ‘s. All others are irrelevant. The maximum likelihood values typically form a tiny subset of $\mu$ ‘s, thus eliminating most dependence on the prior. Note that such degeneracy (as well as the values themselves) is solely determined by the likelihood function.

Notes

Be Careful Interpreting Covid-19 Rapid Home Test Results

October 3, 2021 K.M. Halpern Leave a comment

Now that Covid-19 rapid home tests are widely available, it is important to consider how to interpret their results. In particular, I’m going to address two common misconceptions.

To keep things grounded, let’s use some actual data. We’ll assume a false positive rate of 1% and a false negative rate of 35%. These numbers are consistent with a March, 2021 metastudy [1]. We’ll denote the false positive rate $E_p=0.01$ and the false negative rate $E_n=0.35$ .

It may be tempting to assume from these numbers that a positive rapid covid test result means you’re 99% likely to be infected, and a negative result means you’re 65% likely not to be. Neither need be the case. In particular,

A positive result does increase the probability you have Covid, but by how much depends on your previous prior. This in turn depends on how you are using the test. Are you just randomly testing yourself, or do you have some strong reason to believe you may be infected?
A negative result has little practical impact on the probability you have Covid.

These may seem counterintuitive or downright contradictory. Nonetheless, both are true. They follow from Bayes’ thm.

Note that when I say that the test increases or decreases “the probability you have Covid,” I refer to knowledge not fact. You either have or do not have Covid, and taking the test obviously does not change this fact. The test simply changes your knowledge of it.

Also note that the limitations on inference I will describe do not detract from the general utility of such tests. Used correctly, they can be extremely valuable. Moreover, from a behavioral standpoint, even a modest non-infinitesimal probability of being infected may be enough to motivate medical review, further testing, or self-quarantine.

Let’s denote by $C$ the event of having Covid, and by $T$ the event of testing positive for it. $P(C)$ is the prior probability of having covid. It is your pre-test estimate based on everything you know. For convenience, we’ll often use $\mu$ to denote $P(C)$ .

If you have no information and are conducting a random test, then it may be reasonable to use the general local infection rate as $P(C)$ . If you have reason to believe yourself infected, a higher rate (such as the fraction of symptomatic people who test positive in your area) may be more suitable. $P(C)$ should reflect the best information you have prior to taking the test.

The test adds information to your prior $P(C)$ , updating it to a posterior probability of infection $P(C|O)$ , where $O$ denotes the outcome of the test: either $T$ or $\neg T$ .

In our notation, $P(\neg T|C)= E_n$ and $P(T|\neg C)= E_p$ . These numbers are properties of the test, independent of the individuals being tested. For example, the manufacturer could test 1000 swabs known to be infected with covid from a petri dish, and $E_n$ would be the number which tested negative divided by 1000. Similarly, they could test 1000 clean swabs, and $E_p$ would be the number which tested positive divided by 1000.

What we care about are the posterior probabilities: (1) the probability $P(C|T)$ that you are infected given that you tested positive, and (2) the probability that you are not infected given that you tested negative $P(\neg C|\neg T)$ . I.e. the probabilities that the test correctly reflects your infection status.

Bayes’ Thm tells us that $P(A|B)= \frac{P(B|A)P(A)}{P(B)}$ , a direct consequence of the fact that $P(A|B)P(B)= P(B|A)P(A)= P(A\cap B)$ .

If you test positive, what is the probability you have Covid? $P(C|T)= \frac{P(T|C)P(C)}{P(T|C)P(C)+P(T|\neg C)P(\neg C)}$ , which is $\frac{(1-E_n)\mu}{(1-E_n)\mu+E_p(1-\mu)}$ . The prior of infection was $\mu$ , so you have improved your knowledge by a factor of $\frac{(1-E_n)}{(1-E_n)\mu+E_p(1-\mu)}$ . For $\mu$ small relative to $E_p$ , this is approximately $\frac{E_p}{1-E_n}$ .

Suppose you randomly tested yourself in MA. According to data from Johns Hopkins [2], at the time of this writing there have been around 48,000 new cases reported in MA over the last 28 days. MA has a population of around 7,000,000. It is reasonable to assume that the actual case rate is twice that reported (in the early days of Covid, the unreported factor was much higher, but let’s assume it presently is only $1\times$ ).

Le’ts also assume that any given case tests positive for 14 days. I.e., 24,000 of those cases would test positive at any given time in the 4 week period (of course, not all fit neatly into the 28 day window, but if we assume similar rates before and after, this approach is fine). Including the unreported cases, we then have 48,000 active cases at any given time. We thus have a state-wide infection rate of $\frac{48000}{7000000}\approx 0.00685$ , or about 0.7%. We will define $\mu_{MA}\equiv 0.00685$ .

Using this prior, a positive test means you are $45\times$ more likely to be infected post-test than pre-test. This seems significant! Unfortunately, the actual probability is $P(C|T)= 0.31$ .

This may seem terribly counterintuitive. After all, the test had a 1% false positive rate. Shouldn’t you be 99% certain you have Covid if you test positive? Well, suppose a million people take the test. With a 0.00685 unconditional probability of infection, we expect 6850 of those people to be infected. $E_n=0.35$ , so only 4453 of those will test positive.

However, even with a tiny false positive rate of $E_p=0.01$ , 9932 people who are not infected also will test positive. The problem is that there are so many more uninfected people being tested that $E_p=0.01$ still generates lots of false positives. If you test positive, you could be in the 9932 people or the 4453 people. Your probability of being infected is $\frac{4453}{9932+4453}= 0.31$ .

Returning to the general case, suppose you test negative. What is the probability you do not have Covid? $P(\neg C|\neg T)= \frac{P(\neg T|\neg C)P(\neg C)}{P(\neg T|\neg C)P(\neg C)+P(\neg T|C)P(C)}= \frac{(1-E_p)(1-\mu)}{(1-E_p)(1-\mu)+E_n\mu}$ . For small $\mu$ this is approximately $1$ unless $E_p$ is very close to $1$ . Specifically, it expands to $1-\frac{E_n}{(1-E_p)}\mu+O(\mu^2)$ .

Under $\mu_{MA}$ as the prior, the probability of being uninfected post-test is 0.99757 vs 0.9932 pre-test. For all practical purposes, our knowledge has not improved.

This too may seem counterintuitive. As an analogy, suppose in some fictional land earthquakes are very rare. Half of them are preceded by a strong tremor the day before (and such a tremor always heralds a coming earthquake), but the other half are unheralded.

If you feel a strong tremor, then you know with certainty than an earthquake is coming the next day. Suppose you don’t feel a strong tremor. Does that mean you should be more confident that an earthquake won’t hit the next day? Not really. Your chance of an earthquake has not decreased by a factor of two. Earthquakes were very rare to begin with, so the default prediction that there wouldn’t be one only is marginally changed by the absence of a tremor the day before.

Of course, $\mu_{MA}$ generally is not the correct prior to use. If you take the test randomly or for no particular reason, then your local version of $\mu_{MA}$ may be suitable. However, if you have a reason to take the test then your $\mu$ is likely to be much higher.

Graphs 1 and 2 below illustrate the information introduced by a positive or negative test result as a function of the choice of prior. In each, the difference in probability is the distance between the posterior and prior graphs. The prior obviously is a straight line since we are plotting it against itself (as the $x$ -axis). Note that graph 1 has an abbreviated $x$ -axis because $P(C|T)$ plateaus quickly.

From graph 1, it is clear that except for small priors (such as the general infection rate in an area with very low incidence), a positive result adds a lot of information. For $\mu>0.05$ , it provides near certainty of infection.

From graph 2, we see that a negative result never adds terribly much information. When the prior is 1 or 0, we already know the answer, and the Bayesian update does nothing. The largest gain is a little over 0.2, but that’s only attained when the prior is quite high. In fact, there’s not much improvement at all until the prior is over 0.1. If you’re 10% sure you already have covid, a home test will help but you probably should see a doctor anyway.

Note that these considerations are less applicable to PCR tests, which can have sufficiently small $E_p$ and $E_n$ to result in near-perfect information for any realistic prior.

One last point should be addressed. How can tests with manufacturer-specific false positive and false negative rates depend on your initial guess at your infection probability? If you pick an unconditional local infection rate as your prior, how could they depend on the choice of locale (such as MA in our example)? That seems to make no sense. What if we use a smaller locale or a bigger one?

The answer is that the outcome of the test does not depend on such things. It is a chemical test being performed on a particular sample from a particular person. Like any other experiment, it yields a piece of data. The difference arises in what use we make of that data. Bayesian probability tells us how to incorporate the information into our previous knowledge, converting a prior to a posterior. This depends on that knowledge — i.e. the prior. How we interpret the result depends on our assumptions.

A couple of caveats to our analysis:

The irrelevance of a negative result only applies when you have no prior information other than some (low) general infection rate. If you do have symptoms or have recently been exposed or have any other reason to employ a higher prior probability of infection, then a negative result can convey significantly more information. Our dismissal of its worth was contingent on a very low prior.
Even in the presence of a very low prior probability of infection, general testing of students or other individuals is not without value. Our discussion applies only to the interpretation of an individual test result. In aggregate, the use of such tests still would produce a reasonable amount of information. Even if only a few positive cases are caught as a result and the overall exposure rate is lowered only a little, the effect can be substantial. Pathogen propagation is a highly nonlinear process, and a small change in one of the parameters can have a very large effect. One caution, however. If the results aren’t understood for what they are, overconfidence can result. The aggregate use of testing can have a substantial negative effect if it results in relaxation of other precautions due to overconfidence resulting from a misunderstanding of the information content of those test results.

References:

[1] Rapid, point‐of‐care antigen and molecular‐based tests for diagnosis of SARS‐CoV‐2 infection — Dinnes, et al. Note that “specificity” refers to $1-E_p$ and “sensitivity” refers to $1-E_n$ . See wikipedia for further details

[2] Johns Hopkins Covid-19 Dashboard

Notes

Fun with Voting in Cambridge, MA

September 30, 2021 K.M. Halpern Leave a comment

My city of Cambridge, MA is one of a few municipalities which employs ranked choice voting for City Council elections. Unlike most cities, the Mayor is chosen by the City Council and is largely a ceremonial position. Most real power resides with the City Manager, who is appointed for an indefinite term by the City Council. This means that City Councils which get to appoint a new City Manager exert an inordinate influence over the future course of the city. One such point is fast approaching. Unfortunately, given the present and probable near-term composition of the City Council, the decision likely will be based on considerations other than aptitude. However, putting aside my city’s somber prognosis, the upcoming City Council election is a good opportunity to discuss an unusual method of voting and some of its shortcomings.

Ordinary winner-takes-all elections dominate the popular consciousness. National elections are of this nature. It would not be inaccurate to observe that such an approach reflects the general weltanschauung of our culture. However, there are many other voting methods. In fact, voting theory is a vibrant field of research. Together with its sibling, auction theory, it forms part of the subject commonly known as “social choice theory”.

As an aside, I recently published a paper, Social Choice using Moral Metrics in that field. It focuses on measuring distances between behaviors, rather than on voting systems per se. Back in 2008, I also wrote a voting theory piece about swing votes and block voting. What I termed “influence” in it is more commonly referred to as “voting power”. Neither are related to what I discuss in this post, but I encourage the interested reader to peruse them.

It may be argued that certain voting methods are fairer than others, by one or another definition of fairness. Particular flavors sometimes are advocated by those disenchanted with an existing method or an agenda to see some particular group gain influence. Calls for change sometimes arise in response to highly-visible anomalies, election outcomes which appear egregiously unfair even to disinterested eyes.

In elections with a large field of candidates or those in which a number of positions are simultaneously filled (such as the Cambridge City Council election), winner-takes-all voting may not be suitable or may give rise to such anomalies.

California’s recall system is an example. The ballot in that case has 2 questions: (1) whether to recall the governor and (2) who should replace him. The first question is winner-takes-all for the governor alone. If he loses, the 2nd question is winner-takes-all for the other candidates. It is quite possible for a candidate to be chosen who easily would have lost to the recalled governor one-on-one. In 2003, 44.6% of voters voted not to recall Governor Davis. He thus was recalled, and Schwarzenegger then won with 48.58% of the votes for replacement. It is highly unlikely that in a head-to-head gubernatorial election, Republican Schwarzenegger would have beaten Democrat Davis in the heavily blue state. However, Gray was excluded from this 2nd contest and Schwarzenegger was deemed preferable to the alternatives by most voters.

Arrow’s Theorem

It is natural to ask whether any voting system is unimpeachably fair, indicting the use of other systems as anachronistic or disingenuous. Arrow famously proved that, under even a small set of fairness constraints and for a broad class of voting systems, it is impossible to find one. Loosely speaking, when more than 2 candidates are present, no method of aggregating the rankings of candidates by voters into a single outcome ranking can simultaneously satisfy three conditions: (1) if every voter prefers candidate x to candidate y, then x outranks y in the outcome, (2) no single voter’s preference determines the outcome (i.e. no dictator), and (3) if each voter ranks x relative to y (i.e. above or below it) the same way in elections A and B (though the order can differ between voters, of course), then the outcomes of A and B do too. I.e., if voters change their overall ranking of x and y or the relative placement of other candidates, but don’t change whether x is preferred to y or vice versa, then whether x outranks y or vice versa in the outcome is unchanged.

It is quite plausible to add more fairness conditions, but most plausible definitions of fairness would require at least these three conditions to hold. Arrow showed that there is no ranked voting system (including “preponderance of the votes”) in which unfair anomalies cannot arise.

As an aside, if one were to relax a condition, the most palatable clearly would be (3). It is conceivable that a “fair” aggregation method may allow the overall ranking of candidates to affect a pairwise order in the outcome. However, this generally is deemed undesirable.

As with complexity results in computer science (CS) or Godel’s impossibility theorem in logic, the theoretical existence of hard or problematic cases does not necessarily pose a practical obstacle. In CS, an algorithm with worst-case exponential complexity may be far more useful than one with linear complexity in real-world applications. For example, the latter could have a huge constant cost (often referred to as a “galactic algorithm”) and the former could be exponential only in an infinitesimal fraction of cases or under circumstances which never arise in practice. Godel’s theorem does have real-world examples (i.e. non-meta-theorems), but (at this point) they remain rare.

Though nowhere near as profound, Arrow’s theorem invites similar skepticism. The impossibility of a preference system which excludes all anomalies does not mean such anomalies arise in practice, or that a system which excludes all realistic anomalies cannot be found. Unfortunately (or fortunately, depending on one’s perspective), such anomalies do arise in practice. Worse, the systems in question often are of significant social import and subject to intense scrutiny. The anomalies which do arise can be quite visible and politically troublesome.

Social choice theory exhibits another critical difference from CS and logic, one which merits additional caution. The goal of logic, mathematics, and theoretical computer science generally is to understand which problems are solvable and how best to solve them. Anomalies are viewed as pathological and undesirable. They sometimes serve as useful counterexamples, guiding researchers to better understanding and helping them improve their tools. However, they are to be avoided in real-world applications. If a pathological case arises in such a context, alternate machinery must be employed or the framework modified to exclude it.

This need not be the case in social choice theory. Not everyone’s goal is aligned, or social choice would be unnecessary. With elections, there could be adverse incentives. It may be possible to game an election by identifying and exploiting anomalies endemic to the specific system involved. There also may be groups who strongly prefer that anomalies arise, either for purposes of fomenting discord or if those anomalies serve them well. For this reason, dismissing anomalies as almost impossible under some assumed prior may be naive. The prior must incorporate human behavior, and this very well could concentrate probability around the anomalies. Put another way, if we naively model the probability of anomalies arising using an assumption of ideal behavior we risk ignoring the very real possibility that participants will engineer or utilize anomalies.

This issue is related to Gibbard’s theorem, which loosely states that under even weaker conditions than Arrow’s theorem (at least 3 candidates and no dictator), there is no ideal ballot which reflects a voter’s preferences. Put another way, the voting system can be gamed. In fact, a voter may need to game it (perhaps in response to polls or other information) in order to best reflect their individual preferences. The optimal ballot ranking to enact a voter’s preferences may not be their actual preference ranking of candidates.

The Rules in Cambridge

What does all this have to do with the Cambridge elections? Cambridge employs a particular system of ranked choice voting, which they refer to as “Proportional Representation”. This often is portrayed as fairer, more democratic, and so on. I am going to offer an example of an egregious anomaly which can result. I do this not in the expectation that it will arise or be exploited. Nor do I hope to change a voting method that is, all things considered, quite reasonable. Rather, the anomaly serves an illustrative example of the inherent problem with claiming that one voting system is “fairer” than another.

First, I’ll describe the precise rules of the Cambridge election, as best I understand them. See MA Election Laws, section 9 for details. State law governs the general rules for proportional representation voting in any Massachusetts municipalities which choose to employ it. Only certain parameters and details of execution are left to local discretion.

The City Council consists of 9 individuals, and the entire body is elected once every 2 years. Voters are presented with a list of candidates and may select a 1st choice, a 2nd choice, and so on. I do not recall the maximum number of choices which can be made, but let us suppose it is not limited. The anomaly arises whether or not this is the case. Note that a given voter is not required to rank all the candidates. They could select only their top 3 choices, for example. Whether or not a full ranking by each voter is required does not affect the anomaly.

First some definitions. N will denote the total number of ballots (i.e. the number of voters who participate in the election). At the time of writing, the minimum number of signatures to get on the ballot is 50. We’ll call this ‘M’, because State law gives it a role in the algorithm. Q=(N/10)+1 will be the “quota”, the minimum number of ballots a candidate needs to win.

Why not choose Q=N/9? The type of voting system we’re describing is sometimes referred to as “single-transferable-vote” (STV) because of the use of spillovers (described below). There are two common quota methods for determining STV winners: (1) “Hare” corresponds to Q=N/9, and (2) “Droop” corresponds to Q=(N/10)+1. In each case, we round up if needed. The two methods generally result in the same outcome or differ only in how the last winner is chosen. Each has benefits and drawbacks vis-a-vis what is deemed fair in terms of proportional representation. Among other things, the Droop quota tends to favor small parties over large. It also is the smallest quota which guarantees no more than 9 winners. As we will see, neither method guarantees a full complement of 9 winners. Regardless, the Droop quota is that used by Cambridge.

Once the ballots have been collected, a sequence of steps is performed by computer. An order of polling places is determined randomly by the city beforehand. Within each polling place, ballots are sorted by the choice of 1st place candidate (and then presumably randomly within each such cohort). The ballots then go through a series of stages. The first stage is special.

Stage 1: Any candidate who reaches Q votes is declared a winner. Subsequent 1st place votes for them are passed to the next ranked candidate on the ballot who has not already been declared a winner. Ex. if a ballot is reached with x, y, and z as the 1st, 2nd, and 3rd candidates, and both x and y already have been declared winners, it would go to z. If no non-winner choice remains on the ballot, it is swapped with a ballot that already was consumed by the winner and has non-winner choices on it. This minimizes the number of discarded ballots. Note that it always pays for a voter to rank a lot of choices, because otherwise some other voter may have their preference registered instead. It’s not clear from the law what order the 1st place candidates’ ballots should be sorted, but we’ll assume randomly. It does not matter for the anomaly we will discuss. As the sorting proceeds, any candidate with Q votes (by spillover from other candidates or by being 1st on their own) is declared a winner, and any remaining votes for them spill over as described.

Once this process has been completed, almost every ballot has been assigned to some candidate (i.e. either consumed by a winner or spilled over to a remaining candidate). Because of the ballot-swapping mechanism described, it unlikely (but still possible) for ballots to have been discarded due to lack of non-winner alternatives. Each winner has consumed precisely Q ballots, and each remaining candidate has less than Q ballots. In what follows we use “higher-ranked” to refer to the preferred candidates on a ballot. In practice, this means they have been assigned a lower number. I.e., the 1st place candidate on a ballot is “higher-ranked” than the 2nd place candidate.

At this point, any candidate with fewer than M ballots (in our case 50) is declared to have lost. Their ballots are transferred in the same manner as before to the remaining candidates. Note that this form of elimination only takes place in this first round, since the number of ballots assigned to a candidate cannot decrease in subsequent rounds.

Stages 2+: If 9 candidates have been declared winners, the process ends. Otherwise, the trailing candidate is declared to have lost, and their votes are transferred (one by one) to the remaining candidates in the same manner as before, but with one important change. Unlike in the first round, if no remaining non-winner candidates are listed on a ballot, it is discarded rather than swapped with another. As before, any candidate who reaches Q votes is declared a winner and can accrue no more votes. There are some tie-breaker rules associated with determining who is the trailing candidate at the end of a given round, but we won’t go into those. If at any time, the number of winners plus remaining candidates is 9, all remaining candidates are declared winners. The round ends when every ballot in play either has been spilled over (once) or discarded. Those ballots not discarded or consumed by winners and those candidates not eliminated then proceed to the next round.

Note that a spillover never can result in a ballot being assigned to a higher-ranked candidate. For example, suppose a ballot already has been assigned to the 3rd listed candidate on it. This only could happen if there was a reason to skip the top 2. This means they either already were declared winners or already were eliminated. Nor do any swaps (possible only in the 1st round) affect this. Any subsequent spillovers must go to lower-ranked candidates, or the ballot would have been handed to a higher-ranked candidate already.

Note that unless every voter ranks every candidate, it is possible for some ballots to be discarded. This is highly unlikely in the first round, because swapping is allowed. However, in subsequent rounds ballots may be discarded if they list no candidates which remain in play (i.e. that have not already been declared winners or eliminated). Though there is a theoretical bound on the number of possible discarded ballots, it can be high.

It is quite possible for an insufficient number of winners to be declared. This is no surprise. If every voter lists the same three candidates, but no others, then only three candidates will win. Insufficient ranking by voters can lead to inadequate outcomes.

Unless the field of candidates is reduced below 9 in the first round (i.e. too few candidates meet the 50 vote threshold), there ultimately will be 9 winners. However, some may not get many votes. If every voter ranks every candidate, then all winners will meet quota. If not, some candidates may win without meeting quota by dint of being the last ones uneliminated.

A number of obvious anomalies come to mind. For example, if everyone votes for x,y, and z as the top 3 candidates but there is a huge field of candidates for 4th place — so that each gets 51 spillover votes — then the remaining candidates won’t be eliminated in the first round. The remaining 6 winners then will be selected by the tie-breaker procedure (which we didn’t elaborate on). Fair yes, desirable no. However, such anomalies can be accounted voter-failures. If each voter ranks the whole field of candidates, they won’t arise.

One important thing to note is that the election method described does not obey the conditions of Arrow’s theorem. The procedure is not even deterministic, and certainly does not satisfy the 3rd fairness condition. It is quite possible for a change in the ranking of candidate z on individual ballots to affect the order of x relative to y in the outcome even if the order of x relative to y is unchanged on those individual ballots. As an extreme example, suppose x is 1st and y is 2nd on 50 ballots and y is 1st and x is 2nd on 50 ballots, and suppose z is 3rd on all of these. If one of the 1st 50 ballots moves z to the top, x will be eliminated in the 1st round. If one of the 2nd 50 ballots moves z to the top y will be eliminated in the 1st round. In neither case did the ranking of x relative to y change on any ballots. Some anomalies arise for similar reasons to those involved in Arrow’s theorem, but others arise for different reasons.

The Anomaly

Let us now consider the specific anomaly we set out to discuss. Suppose there are 10000 ballots and 9 positions to be filled. We require 1001 votes for a candidate to win, but we’ll call it 1000 to simplify calculation. Suppose that candidate x is ranked 1st on all 10000 ballots, candidate y is ranked 3rd on all 10000 ballots, and 100 other candidates (which we’ll call z1-z100) are ranked 2nd on 100 ballots each.

Everyone agrees that candidates x and y should be on the City Council. They both rank in the top 3 choices for everyone. However, candidate y is eliminated in the first round. All the spillover votes from candidate x go to candidates z1-z100. The number could vary for each, depending on the order in which ballots are processed. For example, it is possible that each of z1-z100 is assigned 90 spillover votes from candidate x. It also is possible that z1-z90 would accrue 100 spillover votes each, and the rest would get 0 and be eliminated.

At the end of round 1, x is declared a winner and consumes 1000 votes, y has 0 votes, and z1-z100 each have between 0 and 100 votes. At least 90 of them have enough to survive the 50 vote test. However, y is eliminated. The remaining z’s then proceed through a series of elimination and spillover rounds (with possible tie-breakers for the trailing candidate if needed) until only 8 of the z’s remain. These then are declared winners.

The result is 1 winner everyone wants, 8 winners few people agree on, and the conspicuous loss of the 2nd candidate everyone wants.

This is just one fun example of how well-intentioned voting systems can result in highly-undesirable outcomes.

knotes, Notes

The (quasi)-Duality of the Lie Derivative and Exterior Derivative

February 15, 2021 K.M. Halpern Leave a comment

Lecture1 Lecture2 Lecture3 Lecture4 Lecture5

This is a short set of notes that covers a couple of aspects of duality in differential geometry and algebraic topology. It grew out of an enigmatic comment I encountered, to the effect that the Lie and exterior derivatives were almost-dual in some sense. I wanted to ferret out what this meant, which turned out to be more involved than anticipated. Along the way, I decided to explore something else I never had properly understood: the nature of integration from a topological perspective. This led to an exploration of the equivalence of de Rham and singular cohomology.

The notes are in the form of five sets of slides. Originally, they comprised four presentations I gave in a math study group. On tidying, the last set grew unwieldy, so I broke it into two.

Lecture1: Review of DG and AT. Types of derivatives on ${M}$ , de Rham Complex, review of some diff geom, Lie deriv and bracket, chain complexes, chain maps, homology, cochain complexes, cohomology, tie in to cat theory.
Lecture2: The integral as a map, Stokes’ thm, de Rham’s thm, more about Lie derivs.
Lecture3: Recap of de Rham cohomology, review of relevant algebra, graded algebras, tensor algebra, exterior algebra, derivations, uniqueness results for derivations, the interior product.
Lecture4: Cartan’s formula, tensor vs direct product, element-free def of LA, Lie coalgebras
Lecture5: Quick recap, relation between struct constants of LA and LCA, the choice of ground ring or field, duality of Lie deriv and exterior deriv.

These notes grew organically, so the order of presentation may seem a bit … unplanned. The emphases and digressions reflect issues I encountered, and may be peculiar to my own learning process and the many gaps in my physicist-trained math background. Others may not share the same points of confusion, or require the same background explanations. They were designed for my own use at some future point when I’ve completely forgotten the material and need a bespoke refresher. I.e., a week from now.

Although I’ve tried to polish the notes to stand on their own, there are some allusions to earlier material studied in the group. In particular, certain abbreviations are used. Here is a (hopefully) complete list:

DG: Differential Geometry
AT: Algebraic Topology
DR: de Rham
${P}$ : Used for a Principal bundle. Not really used here, but mentioned in passing.
PB: Principal Bundle. Not really used here, but mentioned in passing.
AB: Associated Bundle. Not really used here, but mentioned in passing.
LG: Lie Group. Mentioned in passing.
LA: Lie Algebra
LCA: Lie Coalgebra (defined here).
v.f. Vector fields
v.s. Vector space

The 1st 2 lectures focus on the equivalence of de Rham and singular cohomologies via a duality embodied in the integral map, and enforced by Stokes’ and de Rham’s thms. The last 3 lectures focus on the quasi-duality between the Lie derivative and exterior derivative. By quasi-duality we don’t mean to downplay its legitimacy. I didn’t go through all sorts of contortions to call a square a circle just because it sounds elegant. There is a true duality, and a beautiful one. But saying that it is directly between the Lie and exterior derivs is slightly misleading.

These notes were constructed over a period of time, and focus on the specific topic of interest. They are by no means comprehensive. Although edited to correct earlier misconceptions based on later understanding (as well as errors pointed out by the math group), the order of development has not been changed. They were written by someone learning the subject matter as he learned it. They may have some mistakes, there may be some repetition of points, and they are not designed from the ground up with a clear vision. Nonetheless, they may prove helpful in clarifying certain points or as a springboard for further study.

These notes explain the following:

${\int}$ as a map from the de Rham complex to the singular cochain complex
Stokes’ thm as a relationship between de Rham cohomology and singular cohomology
The various types of derivations/anti-derivations encountered in differential geometry
A review of graded algebras, tensor algebras, exterior algebras, derivations, and anti-derivations.
A review of Lie Derivatives, as well as Cartan’s formula
A discussion of what the duality of ${{\mathcal{L}}}$ and ${d}$ means
A discussion of the two views one can take of ${T(M)}$ and ${\Lambda(M)}$ : as ${\infty}$ -dimensional vector spaces over ${\mathbb{R}}$ or as finite-basis modules over the smooth fns on M. The former is useful for abstract formulation while the latter is what we calculate with in DG. The transition between the two can be a source of confusion.
A discussion of why derivations and anti-derivations are the analogues of linearity when we move from one view to the other.

The notes draw from many sources, including Bott & Tu, Kobyashi & Nomizu, and various discussions on stackexchange. A list of references is included at the end of the last set of slides.

Notes

The Truth about Stock Prices: 12 Myths

September 10, 2020 K.M. Halpern

No-fee trading has invited a huge influx of people new to trading. In this article, I will discuss the basics of “price formation”, the mechanism by which stock prices are determined.

Like most people, for much of my life I assumed that every stock has a well-defined “price” at any given point in time. You could buy or sell at that price, and the price would move based on activity. If it went up you made money, if it went down you lost money. Trading was easy: you just bought the stocks you thought would go up and sold the ones you thought would go down.

Unfortunately, my blissful naivete was cut short. After a youthful indiscretion, I ended up doing five years at the Massachusetts Institute of Technology. When the doors finally slammed shut behind me, I emerged with little more than a bus ticket and some physics-department issued clothes. Nobody reputable would hire a man with a checkered background doing physics, so I ended up with the only sort open to hard cases: Wall Street.

I caught the eye of a particularly unsavory boss one day, and he recruited me into a gang doing stat arb at a place called Morgan Stanley. I tried to get out, but they kept pulling me back in. It took six years to find a way out, but even then freedom proved elusive. I was in and out of corporations for the next few years, and even did some contract work for a couple of big hedge funds. Only in the confusion of 2008, did I finally manage to cut ties and run. But the scars are still there. The scars never go away.

On the plus side, I did learn a bit about market microstructure. Along the way I came to understand that my original view of prices was laughably simplistic. My hope is that I can help some misguided kid somewhere avoid my own missteps. If I can save even one reader, the effort put into this post will have been repaid a thousand times over. Mainly because I didn’t put much effort into it.

Rather than a detailed exposition on market microstructure (which varies from exchange to exchange, but has certain basic principles), I will go through a number of possible misconceptions. Hopefully, this will be of some small help to new traders who wish to better understand the dynamics of the stock market. At the very least, it will make you sound smart at cocktail parties. It also may help the occasional reader avoid such minor faux pas as redditing “hey guys, why don’t we all collude to manipulate stock prices in clear violation of SEC regulations, and to such an absurd degree that it will be impossible for regulators NOT to crucify us.” But hey, what’s the worst that could result from the public subversion of a number of powerful, well-connected hedge funds and the defiant proclamation that this was intentional?

Now to the important bit. Because we live in America, and everybody sues everyone for everything, I’ll state the obvious. Before you do anything, make sure you know what you are doing. If you read it here, that doesn’t mean it’s right or current. Yes, I worked in high frequency statistical arbitrage for some time. However, my specific knowledge may be dated. Though the general principles I describe still apply, you should confirm anything I say before relying heavily on it. In particular, I am no tax expert. Be sure to consult an accountant, a lawyer, a doctor, a rabbi, and a plumber before attempting anything significant. And if you do, please send me their info. It’s really hard to find a good accountant, lawyer, doctor, rabbi, or plumber.

Don’t take anything I say (or anyone else says) as gospel. I’ve tried to be as accurate as possible, but that doesn’t mean there aren’t technical errors. As always, the onus is on you to take care of your own money. When I first started out on Wall Street, I was in awe of traders. Then I got to know some. In my first job, somebody helpfully explained why people on Wall Street were paid more than in other professions. They weren’t paid to be infallible and never make mistakes; they were paid to be attentive and diligent enough to catch any mistakes they did make.

This sounded nice, but turned out to be a load of malarkey. The highly-paid professionals on Wall Street are the same bunch of knuckleheads as in any other profession, but with better credentials. However, this cuts both ways. Many people have a view, promulgated by movies and television, that bankers are unscrupulous, boiler-room shysters. These certainly exist, but mostly amongst the armies of low-paid retail brokers, or in certain very disreputable areas such as commercial banking. The real Wall Street is quite different. The individuals I worked with were highly ethical, and the environment was far more collegial and honest than academia. And this was in the late 90’s and early 2000’s, before academia really went to pot. The few knives I had to pull out of my back were (with one exception) gleefully inserted by fellow former-physicists. Fortunately, while physicists know a lot about the kinematics of knives, they know very little about anatomy. I emerged unscathed, and even got a few free knives out of it — which I promptly sold to some folks in Academia, where such things always are in high demand.

Despite its inapplicability to actual employee behavior, the point about mistakes is a good one. It is impossible to avoid making mistakes, but if you value your money you should carefully triple-check everything. This goes doubly for any work done by an accountant, financial adviser, or other “professional” you ill-advisedly employ. They probably know less than you do, and certainly care less than you do about your money.

The best advice I can offer is to inform yourself and be careful. Do research, check, recheck, and recheck again before committing to a trade. In my personal trading, I’ve never lost out by being too slow or cautious. But I have been hammered by being too hasty.

Now to the possible misconceptions. I’ll call them “myths” because that’s what popular websites do, so obviously it’s the right thing to do, and I prefer to do the right thing because the wrong thing rarely works.

Myth 1: There is a “price” for a stock at any given point in time. When a stock is traded during market hours, there is no such thing as its “price”. There is a bid (the highest offer to buy) and an ask (the lowest offer to sell). Often, the “price” people refer to is the last trade price (the price at which the last actual transaction occurred, regardless of its size). Sometimes the midpoint (bid+ask)/2 or weighted midpoint (bid x bidsize + ask x asksize)/(bidsize + asksize) is used. For algorithmic trading, more complicated limit-book centroids sometimes are computed as well. The “closing price” generally refers to the last trade price of the day. This is what appears in newspapers.

Myth 2: I can place a limit order at any price I want. No, you cannot. Stocks (and options) trade at defined ticks. The “tick” or “tick size” is the space between allowed prices, and may itself vary with price. For example, the tick size in stock ZZZ could be $0.01 for prices below $1.00 and $0.05 otherwise. Often, ticks are things like 1/8 or 1/16 rather than multiples of $0.01. The tick size rules vary per exchange (or per security type on a given exchange) rather than per stock. In our example, any stock’s price could have allowable values of …, $0.98, $0.99, $1.00, $1.05, $1.10, … on the exchange in question.

Myth 3: Limit Orders always are better than market orders. Limit orders offer greater control over the execution price, but they may not be filled or may result in adverse selection. Suppose ZZZ is trading with a bid of $100, an ask of $101, and a tick size of $0.50. Alice places a buy limit order at $100.5. It is quite possible that it quickly will be filled, giving her $0.50 better execution than a market order.

But suppose it is not filled right away. If the stock goes up, Alice has incurred what is called “opportunity cost.” The $0.50 attempted savings now translates into having to pay a higher price or forego ownership of the stock. It’s like waiting for the price of a home to go down, only to see it go up. If you want the home (and still can afford it), you now must pay more.

Ok, but why not just leave the limit order out there indefinitely? Surely it will get filled at some point as the stock bounces around. And if not, there is no harm. You don’t end up with the stock, but haven’t lost any money. In fact, why not put a limit order at $98? If it gets executed, that’s a $2.00 price improvement!

The problem is adverse selection. Such a limit order would get filled when the stock is falling. Sure, a temporary dip could catch it. But a major decline also could. The order is likely to be filled under precisely the conditions when Alice would not want it to be. At that point, she may be able to buy the stock for $97 or $96 — if buying it remains desirable at all. In the presence of an “alpha” (loosely speaking, a statistical signal which a trader believes has some predictive power for future stock movements), it may pay to place such limit orders —but that is a specific execution strategy based on a specific model. In general, there is no free money to be had. You either incur the transaction cost of crossing the spread (i.e. paying the ask), or risk both the opportunity cost of losing out on a desirable trade and the possibility of adverse selection which lands you with the stock at the worst possible time.

Well, it isn’t strictly true there is no free money to be had. There is free money to be made, but only by market makers, uniquely positioned to accept large volumes of orders. In this, they are not unlike the exchanges themselves. You and I do not possess the technology, capital, or customer flow to make money that way.

Myth 4: I can buy or sell any quantity at the stated price. There are a couple of reasons this is not true. The “stated price” usually is the last trade price, and there is no guarantee you can buy at that same price. Just because a house down the block sold for X doesn’t mean you can buy an identical one now for X. In illiquid stocks (and quite often with options), the last trade may have taken place some time ago and be stale relative to the current quote.

In principle, you can buy at the current ask or sell at the current bid. However, even this is not guaranteed. The bid and ask can move quickly, and it may be difficult to catch them. But there also is another critical issue at play. The bid and ask are not for unlimited quantities of stock. Each has an associated size, the total number of shares being sold or sought at that price. To understand this, it is necessary to explain how an order actually is executed — and that requires the notion of a “limit book” (aka “order book”).

Most data vendors and websites will display a “quote” (aka “composite quote”) for each stock. This consists of a bid, an ask, a bid-size, and an ask-size. Although some websites may omit the sizes, they are considered part of the quote. Suppose the quote for ZZZ has a bid of $100 for 200 shares, an ask of $101 for 50 shares, and the relevant tick-size is $0.50. Then the spread is two ticks (101-100)/0.50, and the midpoint is $100.50. It isn’t necessarily the case that there is one trader offering to buy 200 shares at $100 and another offering to sell 50 shares at $101. The sizes may be aggregates of multiple orders at those price levels.

The composite quote actually is a window into a larger constellation of orders known as the limit book. The limit book consists of a set of orders at various price levels. For example, the limit book for ZZZ could have orders at $101, $101.5, $102, and $104 on the ask side, with a queue of specific orders at each level. The composite quote simply is the highest bid, the lowest ask, and the aggregate size for each.

Suppose Bob puts in a market order to buy $100 shares of ZZZ. This is matched against the orders at the lowest ask level ($101 in this case) in their order of priority (usually the time-order in which they were received). Since there only are 50 shares at $101, the exchange matches Bob against all the sell-orders at $101. It then matches the remaining 50 shares against the second ask level ($101.5) and higher until it matches them all. If it fails to match them all, Bob will have a partial fill, and the remainder of the order will be cancelled (since it was a market order). Each “fill” is a match against a specific sell-order, and a given trade can result in many fills. This is part of why your broker may sometimes send a bunch of trade confirmations for a single order on your part.

For highly liquid stocks, no order you or I are likely to place will go execute past the inner quote. However, that quote can move quickly and the price at which a market order is executed may not be what you think. Brokers also execute order flow internally, or sell flow to other institutions — which then match it against other customers or their own orders. To you it looks the same (and may actually improve your execution in some cases), but your trade may never make it to the exchange. This is fine, since you’re not a member of the exchange — your broker is.

Note the risk of a market order, especially for illiquid stocks. Suppose the 2nd ask level was $110 rather than $101.5. In that case, Bob would have bought 50 shares at $100 and 50 shares at $110. A limit order slightly past the ask would have avoided this. For example, if he wanted to ensure execution (if possible) but avoid such ridiculous levels, he could place a fill-or-kill (but not all-or-none) order at $102. This would ensure that he doesn’t pay more than $102, but he may only get a partial fill.

For stocks (other than penny-stocks), limit orders rarely are necessary as protection, though they may be desirable for other purposes. But when trading options, a limit order always should be used. If the quote is moving around a lot, this can be a good way to control worst-case execution (but in exchange for some opportunity cost). Options are a bit odd, since brokers often will write them on the spot in response to an order. You just need to figure out what their automated price-level is. Sometimes it is the midpoint, sometimes slightly higher. You almost always can do better than the innermost ask for small volume. For higher volume, you should buy slowly (over a day or two) to avoid moving the market too much — though it may be impossible if you effectively have the broker as your only counterparty. But back to Bob and ZZZ!

Now suppose that Bob places a limit order to buy 50 shares at $100.5, right in the middle of the current spread. There now is a new highest bid level: $100.5, and Bob is the sole order at that level. Any market sell order will match against him first, and this may happen so fast that the quote never noticeably changes. But if not, the new bid and bidsize will be $100.5 and 50 shares. If instead, he placed his buy order at $100, he would join the other bids at $100 as the last in the queue at that level.

What if he places it at $101 instead? If there were 25 shares available at that ask level, he would match those 25 shares. He now would have a bid for the remaining 25 shares at $101. This would be the new best bid, the quote would change accordingly. The new best ask would be $101.5. Finally, suppose he placed the limit order at $110 instead. This effectively would be a market order, and would match against the $101 and $101.5 levels as before. Note that he would not get filled at $110 in this example. If there were 25 shares each at $101 and $101.5, he would be filled at those levels and his $110 limit order would have the same effect as a $101.5 limit order.

The limit book constantly is changing and, to make things worse, there often is hidden size. On many exchanges, it’s quite possible for the limit book to show 25 shares available at $101 and yet fill Bob for all 50 at that level. There could be hidden shares which automatically replenish the sell-order but are not visible in the feed. This is intentional. Most of the time, we only have access to simple data: the current quote and the last trade price.

Note that the crossing procedure described is performed automatically almost everywhere these days. Most exchanges run “ECNs”, electronic crossing networks. An algorithm accepts orders which conform to the tick-size and other exchange rules, crossing them or adjusting the limit book accordingly. This is conceptually simple, but the software is rather involved. Because of the critical nature of an exchange, the technology has to be robust. It must be able to receive high volumes of orders with minimal latency; process them, cross them, and update the limit book; transmit limit-book, quote, and trade information to data customers; manage back-end and regulatory tasks such as clearing trades, reporting them, and processing payments; and do all this at extremely high speed, across many stocks and feeds concurrently, and with significant resilience. It definitely beats a bunch of screaming people and trade slip confetti.

Myth 5: The price at the close of Day 1 is the price at the open of Day 2. This clearly is not true, and often the overnight move is huge and predicated on different dynamics than intra-day moves. There are two effects involved. Some exchanges make provision for after-market and pre-open trading, but the main effect is the opening auction. Whenever there is a gap in trading, the new trading session begins with an opening auction. Orders accumulate prior to this, populating the limit book. However, no fills can occur. This means that the two sides of the limit book can overlap, with some bids higher than some asks. This never happens during regular trading because of the crossing procedure described earlier, and this situation must cleaned up before ordinary trading can begin.

The opening auction is an unambiguous procedure for matching orders until the two sides of the book do not overlap. It is executed automatically by algorithm. The closing price on a given day is the last trade price of that day. It often takes a while for data to trickle in, so this gets adjusted a little after the actual close but usually is fairly stable. The prices one sees at the start of the day involve a flurry of fills from the uncrossing. This may create its own minor chaos, but the majority of the overnight price move is reflected in the orders themselves. Basically, it can be thought of as a queue waiting to get their orders in. There also are certain institutional effects near the open and close because large funds must meet certain portfolio constraints. Note that the opening auction happens any time there is a halt to trading. Most opening auctions are associated with the morning open, but some exchanges (notably the Tokyo Stock Exchange) have a lunch break. Extreme price moves also can trigger a temporary trading halt. In each case, there is an opening auction before trading restarts.

Myth 6: The price fluctuations of a stock reflect market sentiment. That certainly can be a factor, often the dominant one. However, short-term price fluctuations also may be caused by mere market microstructure.

The price we see in most charts and feeds is the last trade price, so let’s go with that. Similar considerations hold for the quote midpoint, bid, ask, or any other choice of “price” that is being tracked.

When you buy at the ask, some or all of the sell-orders at that ask-level of the limit book are filled. There may be hidden size which immediately appears, or someone may happen to jump in (or adjust a higher sell-order down). But in general, this is not the case. The composite quote moves, as do all quote-based metrics. The last trade price also reflects your trade, at least until the next trade occurs.

Consider an unrealistic but illustrative example: ZZZ has a market cap of a billion dollars. Bob and Alice are sitting at home, trading. The rest of the market, including all the major institutions which own stock in ZZZ, are sitting back waiting for some news or simply have no desire to trade ZZZ at that time. They don’t participate in trading, and have no orders outstanding. So it’s just Alice and Bob. ZZZ has a last trade price of $100, Bob has a limit order to buy 1 share at $100, and Alice has a limit order to sell 1 share at $101. These orders form both the quote and the entirety of the limit book (in this case).

Bob gets enthusiastic, and crosses the spread. The price now is $101, that at which his trade transacted. Both see that the “price” just went up, and view the stock as upward-bound. Alice has some more to sell, and decides to raise her ask. She places a sell limit order for 1 share at $102. The ask now is 1x$102. Bob bites, crossing the spread and transacting at $102. The “price” now is $102. The pattern repeats with Alice always increasing the ask by $1 and Bob always biting after a minute or so. The closing price that day $150.

Two people have traded a total of 50 shares over the course of that day. Has the price of a billion dollar company really risen 50%? True, this is a ridiculous example. In reality, the limit book would be heavily populated even if there was little active trading, and other participants wouldn’t sit idly by while these two knuckleheads (well, one knucklehead, since Alice actually does pretty well) go at it. But the concept it illustrates is an important one. Analogous things can happen in other ways. Numerous small traders can push the price of a stock way up, while larger traders don’t participate. In penny stocks, this sort of thing actually can happen (though usually not in such an extreme manner). When a stock’s price changes dramatically, it is important to look at the trading volume and (if possible) who is trading. When such low-volume price moves occur, it is not a foregone conclusion that the price will revert immediately or in the near term. Institutional traders aren’t necessarily skilled or wise, and can get caught up in a frenzy or react to it — so such effects can have real market impact. However, most of the time they tend to be transient.

Myth 7: Shorting is an abstraction, and is just like buying negative shares. In many cases, it effectively behaves like this for the trader. However, the actual process is more complicated. “Naked shorts” generally are not allowed, though they can arise in anomolous circumstances. When you sell short, you are not simply assigned a negative number of shares, which settles accordingly. You are borrowing specific shares of stock from a specific person who has a long position. The matching process is called a “locate” and is conducted at your broker’s level if possible or at the exchange level if the broker has no available candidates. There is an exception for market-makers and for brokers when a stock is deemed “easy to borrow”, meaning it is highly liquid and there will be no problem covering the short if necessary. Brokers maintain dynamic “easy to borrow” and “hard to borrow” lists for this purpose.

From the standpoint of a trader, there are two situations in which a short may not behave as expected. Suppose Bob sells short 100 shares of ZZZ stock, and the broker locates it with Alice. Alice owns 100 shares, and the broker effectively lends these to Bob. If Alice decides to sell her shares, Bob now needs to return the shares he borrowed and be assigned new ones. Normally, this is transparent to Bob. But if replacement shares cannot be located, he must exit his short position. The short sale is contingent on the continuing existence of located shares.

Because of the borrowing aspect, Bob’s broker also must ensure he has sufficient funds to cover any losses as ZZZ rises. This requires a margin. If ZZZ goes up, Bob may have to put up additional capital or exit his position (and take the loss). In principle, a short can result in an unlimited loss. In practice, Bob would fail a margin call before then. I.e., Bob cannot simply “wait out” a loss as he could with a long position.

If — as you should — you view the value of your position as always marked-to-market, then (aside from transaction cost or tax concerns) you never should hold a position just to wait out a loss. Most people don’t think or act this way, and there sometimes are legitimate reasons not to. For example, a long term investment generally shouldn’t be adjusted unless new information arrives (though that information may regard other stocks or externalities which necessitate an overall portfolio adjustment). One could argue that short term random fluctuations do not constitute new information, and without an alpha model one should not trade on them. This is a reasonable view. However, the ability to avoid doing so is not symmetric. Because of the issues mentioned, short positions may be harder to sustain than long ones.

The next couple of myths involve some tax lingo. In what follows “STCG” refers to “Short Term Capital Gain” and “LTCG” refers to “Long Term Capital Gain”. “STCL” and “LTCL” refer to the corresponding losses (i.e. negative gains).

Myth 8: Shares are fungible. When you sell them, it doesn’t matter which ones you sell. This is true from the standpoint of stock trading, but not taxes. Most brokers allow you to specify the specific shares (the “lots”) you wish to sell, though the means of doing so may not be obvious. However, for almost all purposes two main choices suffice: LIFO and FIFO. Most of the time, FIFO is the default. With many brokers, you can change this default for your account, as well as override it for individual trades. Let’s look at the difference between FIFO and LIFO.

Suppose Bob bought 100 shares of ZZZ at $50 3 years ago and bought another 100 shares of ZZZ at $75 6 months ago. ZZZ now is at $100, and he decides to sell 100 shares. If he sells the first 100 shares, a LTCG of $5000 ($10000 – $5000) is generated, but if he sells the second 100 shares a STCG of $2500 ($10000 – $7500) is generated. The implications of such gains can be significant, and are discussed below. The specifics of Bob’s situation will determine which sale is more advantageous — or less disadvantageous.

The first choice corresponds to FIFO accounting: first in, first out. The second corresponds to LIFO: last in, first out. One usually (but not always) benefits from FIFO, which is why this is the default. Note that FIFO and LIFO are relative to a given brokerage account, since a broker only knows what about your positions with it. If Bob had an earlier position with broker B, broker A does not know about it or cannot sell it. In that case, Bob must keep track of these things. FIFO and LIFO are relative to the specific account in question, but the tax consequences for Bob are determined across all brokerage accounts. We’ll see what this means in a moment.

All capital gains are relative to “basis” (or “tax basis”), generally the amount you paid for the stock when you bought it. In the example above, the basis for the first lot was $5000 and the basis for the second was $7500. This was why the LTCG from the first was $5000, while the STCG from the second was $2500. With stocks (but not necessarily mutual funds), a tax event only occurs when you close your position. If you hold the shares for 10 years, only on year 10 is a capital gains tax event generated. This can allow some strategic planning, and part of your overall investment strategy may involve choosing to sell in a low-income year. Note that dividends are taxed when you receive them, and regardless of whether they are cash or stock dividends or you chose to reinvest them. Also note that some mutual funds generate tax events from their own internal trading. You could be taxed on these (STCG or LTCG), and it is best to research the tax consequences of a fund before investing in it.

If you transfer stocks between accounts (usually done when transferring a whole account to a new broker), their tax basis is preserved. No tax events are generated. Note that the transfer must be done right. If you manually close your old positions and open new ones (with enough time between), you may generate a tax event. But if you perform an official “transfer” (usually initiated with your destination broker), the basis is preserved and no tax event occurs. Whether your broker will know that basis is another question. Not every broker’s technology or commitment to customer convenience is up to snuff. It is a good practice to keep your own careful records of all your trading activity.

When would LIFO be preferable? There are various cases, but the most common is to take a STCL to offset STCGs. STCGs tend to be taxed at a much higher rate than LTCGs, so taking a loss against them often is the desirable thing to do. In Bob’s case, if the price had gone down to $25 instead of up to $100, he could sell at a loss and use that loss to offset gains from some other stocks. He would have to specify LIFO to sell the newer lot and generate the STCL.

Myth 9: A “no-fee” trading account is better than one with fees. The cost to a trader involves several components. The main three are broker fees, exchange fees, and “execution”. “No-fee” refers to the broker fee. Unless many small trades are being executed with high frequency, the broker fee tends to be small. The exchange fees are passed along to you, even for “no-fee” accounts. The “execution” is the bulk of the cost. No or low-fee brokers often cross flow internally or sell flow to high-frequency firms which effectively front-run you. Market orders see slightly worse execution than they could, and limit orders get filled with slightly lower frequency than they could (or are deferred, causing slight adverse selection). These effects are not huge, but something to be aware of.

Suppose Alice buys 100 shares of ZZZ at $100. Broker X is no-fee, and Broker Y charges a fee of $7.95 per trade but has 10 bp (0.1%) better execution than Broker X on average. That 10 bp is just a price improvement of $0.10, and amounts to $10. Alice does better with Broker Y than Broker X. This benefit may seem to apply only to large trades, but it also applies to stocks with large spreads. For illiquid stocks (including penny stocks) the price improvement can be much more significant. There are trading styles (lots of small trades in highly liquid stocks) where no-fee sometimes trumps better execution, but most often it does not.

Myth 10: Taxes are something your accountant figures out, and shouldn’t affect your trading. Selling at the best price is all that matters. Taxes can eat a lot of your profit, and should be a primary consideration. Tax planning involves choosing accounts to trade in (401K or other tax-deferred vs regular), realizing losses to offset gains, and choosing assets with low turnover. As mentioned, some mutual funds can generate capital gains through their internal trading. In extreme cases, you could pay significant tax on a losing position in one.

Why are taxes so important to trading? The main reason is that there can be a 25% (or more) difference in tax rate between a LTCG and a STCG. STCGs often are taxed punitively, or at best are treated like ordinary income. Here in MA, the state tax alone is 12% for STCGs vs 5% for LTCGs. Federally, STCGs are treated as ordinary income while LTCGs have their own lower rate.

STCGs are defined as positions held for under one year, while LTCGs are held for over one year. Note that it is the individual positions that matter. If Bob owns 200 shares of ZZZ, bought in two batches, then each batch has its own basis and its own purchase date. Also note that most stock option positions result in a STCG or STCL. A STCG only can be offset by a STCL, but a LTCG can be offset by a LTCL or STCL. Clearly, STCLs are more valuable than LTCLs. They can be rolled to subsequent years under some circumstances, but may be automatically wasted against LTCGs if you are not careful.

A good understanding of these details can save a lot of money. To understand the impact, suppose Alice has a (state+federal) 20% LTCG marginal tax rate and a 45% STCG marginal tax rate. She makes $10,000 on a trade, not offset by any loss. If it is a LTCG, she pays $2000 in taxes and keeps $8000. If it is a STCG, she pays $4500 and keeps $5500. That’s an additional $2500 out of her pocket. Since the markets pay us to take risk, she must take more risk or tie up more capital to make the same $8000 of after-tax profit. How much more capital? Not just the missing 25%, because the extra profit will be taxed at 45% as well. We solve 0.55 x= 8000, to get 14,545. Alice must take tie up 45% more capital or (loosely speaking) take 45% more risk to walk away with the same after-tax profit.

Myth 11: Options are like leveraged stock. No. This is untrue for many reasons, but I’ll point out one specific issue. Options can be thought of as volatility bets. Yes, the Black Scholes formula depends on the stock price in a nonlinear manner, and yes the Black Scholes model significantly underestimates tail risk. But for many purposes, it pays to think of options as predominantly volatility-based. Let’s return to our absurd but illustrative earlier scenario involving Bob bidding himself up and Alice happily making money off him.

As before, they trade ZZZ stock and are the only market participants but don’t know it. They run up their positions as before, with Bob buying a share from Alice at $100, then $101, up to $109. He now owns 10 shares. Both are so excited to be trading, they fall over backward in their chairs and bang their heads. Alice goes from pessimistic to optimistic, while Bob goes from optimistic to pessimistic. He wants to unload some of his stock, and offers to sell a share at $109. Alice now is optimistic, so she buys. He tries again, but gets no bite so he lowers the price to $108. Alice thinks this is a good deal and snaps it up. Bob sees the price dropping and decides to get out while he can. He offers at $107, Alice buys. And so on. At $100 he has sold his last share. Both are back where they started, as is the last reported trade price of ZZZ. At this point, both lean back in relief and their chairs topple over again. Now they’re back to their old selves, and they repeat the original pattern, with Alice selling to Bob at $100, $101, etc. Their chairs are very unstable, and this pattern repeats several times during the day. The last leg of the day is a downward one.

The day’s trading involves ZZZ stock price see-sawing between 100 and 109, and the price ends where it started. Consider somebody trading the options market (maybe Alice and Bob are the only active stock traders that day because everybody else is focusing on the options market). The price of ZZZ is unchanged between the open and close, but the prices of most ZZZ call and put options have risen dramatically. Option prices are driven by several things: the stock price, the strike price, the time to expiry, and the volatility. If the stock price rises dramatically, put options will go down but not as much as the price change would seem to warrant. This is because the volatility has increased. In our see-saw case, the volatility rose even when the stock price remained the same.

Myth 12: There are 12 myths.

Notes

Two-Envelope Problems

August 13, 2019 K.M. Halpern

Let’s visit a couple of fun and extremely counterintuitive problems which sit in the same family. The first appears to be a “paradox,” and illustrates a subtle fallacy. The second is an absolutely astonishing (and legitimate) algorithm for achieving better than 50-50 oods of picking the higher of two unknown envelopes. Plenty of articles have discussed who discovered what ad nauseum so we’ll just dive into the problems.

— The Two Envelope Paradox: Optimizing Expected Return —

First, consider the following scenario. Suppose you are shown two identical envelopes, each containing some amount of money unknown to you. You are told that one contains double the money in the other (but not which is which or what the amounts are) and are instructed to choose one. The one you select is placed in front of you and its contents are revealed. You then are given a second choice: keep it or switch envelopes. You will receive the amount in the envelope you choose. Your goal is to maximize your expected payment.

Our intuition tells us that no information has been provided by opening the envelope. After all, we didn’t know the two values beforehand so learning one of them tells us nothing. The probability of picking the higher envelope should be ${1/2}$ regardless of whether we switch or not. But you weren’t asked to improve on the probability, just to maximize your expected payment. Consider the following 3 arguments:

Let the amount in the the envelope you initially chose be ${z}$ . If it is wrong to switch then the other envelope contains ${z/2}$ , but if it is right to switch it contains ${2z}$ . There are even odds of either, so your expectation if you switch is ${1.25z}$ . This is better than the ${z}$ you get by sticking with the initial envelope, so it always is better to switch!
Since we don’t know anything about the numbers involved, opening the first envelope gives us no information — so ignore that value. Call the amount in the other envelope ${z'}$ . If it is wrong to switch then the envelope you chose contains ${2z'}$ , and if right to switch it contains ${0.5z'}$ . If you switch, you get ${z'}$ but if you don’t your expectation is ${1.25z'}$ . So it always is better NOT to switch!
Call the amounts in the two envelopes ${x}$ and ${2x}$ (though you don’t know which envelope contains which). You pick one, but there is equal probability of it being either ${x}$ or ${2x}$ . The expected reward thus is ${1.5x}$ . If you switch, the same holds true for the other envelope. So you still have an expected reward of ${1.5x}$ . It doesn’t matter what you do.

Obviously, something is wrong with our logic. One thing that is clear is that we’re mixing apples and oranges with these arguments. Let’s be a bit more consistent with our terminology. Let’s call the value that is in the opened envelope ${z}$ and the values in the two envelopes ${x}$ and ${2x}$ . We don’t know which envelope contains each, though. When we choose the first envelope, we observe a value ${z}$ . This value may be ${x}$ or ${2x}$ .

In the 3rd argument, ${P(z=x)= P(z=2x)= 0.5}$ . If we switch, then ${\langle V \rangle= P(z=x)2x+P(z=2x)x = 1.5x}$ . If we keep the initial envelope then ${\langle V \rangle= P(z=x)x+P(z=2x)2x = 1.5x}$ . Whether we switch or not, the expected value is ${1.5x}$ though we do not know what this actually is. It could correspond to ${1.5z}$ or ${0.75z}$ . We must now draw an important distinction. It is correct that ${P(z=x)= P(z=2x)= 0.5}$ for the known ${z}$ and given our definition of ${x}$ as the minimum of the two envelopes. However, we cannot claim that ${1.5x}$ is ${1.5z}$ or ${0.75z}$ with equal probability! That would be tantanmount to claiming that the envelopes contain the pairs ${(z/2,z)}$ or ${(z,2z)}$ with equal probability. We defined ${x}$ to be the minimum value so the first equality holds, but we would need to impose a constraint on the distribution over that minimum value itself in order for the second one to hold. This is a subtle point and we will return to it shortly. Suffice it to say that if we assume such a thing we are led right to the same fallacy the first two arguments are guilty of.

Obviously, the first two arguments can’t both be correct. Their logic is the same and therefore they must both be wrong. But how? Before describing the problems, let’s consider a slight variant in which you are NOT shown the contents of the first envelope before being asked to switch. It may seem strange that right after you’ve chosen, you are given the option to switch when no additional information has been presented. Well, this really is the same problem. With no apriori knowledge of the distribution over ${x}$ , it is immaterial whether the first envelope is opened or not before the 2nd choice is made. This gives us a hint as to what is wrong with the first two arguments.

There actually are two probability distributions at work here, and we are confounding them. The first is the underlying distribution on ordered pairs or, equivalently, the distribution of the lower element ${x}$ . Let us call it ${P(x)}$ . It determines which two numbers ${(x,2x)}$ we are dealing with. We do not know ${P(x)}$ .

The second relevant distribution is over how two given numbers (in our case ${(x,2x)}$ ) are deposited in the envelopes (or equivalently, how the player orders the envelopes by choosing one first). This distribution unambiguously is 50-50.

The problem arises when we implicitly assume a form for ${P(x)}$ or attempt to infer information about it from the revealed value ${z}$ . Without apriori knowledge of ${P(x)}$ , being shown ${z}$ makes no difference at all. Arguments which rely solely on the even-odds of the second distribution are fine, but arguments which implicitly involve ${P(x)}$ run into trouble.

The first two arguments make precisely this sort of claim. They implicitly assume that the pairs ${(z/2,z)}$ or ${(z,2z)}$ can occur with equal probability. Suppose they couldn’t. For simplicity (and without reducing the generality of the problem), let’s assume that the possible values in the envelopes are constrained to ${2^n}$ with ${n\in Z}$ . The envelopes thus contain ${(2^n,2^{n+1})}$ for some integer ${n}$ (though we don’t know which envelope contains which value). For convenience, let’s work in terms of ${log_2}$ of the values involved (taking care to use ${2^n}$ when computing expectations).

In these terms, the two envelopes contain ${(n,n+1)}$ for some ${n=\log_2(x)}$ (defined to be the lesser of the two). We open one, and see ${m=\log_2(z)}$ . If it is the upper then the pair is ${(m-1,m)}$ , otherwise the pair is ${(m,m+1)}$ . To claim that these have equal probabilities means that ${n=m-1}$ and ${n=m}$ are equally probable. We made this assumption independent of the value of ${m}$ , so it would require that all pairs ${(n,n+1)}$ be equally probable.

So what? Why not just assume a uniform distribution? Well, for one thing, we should be suspicious that we require an assumption about ${P(x)}$ . The 3rd argument requires no such assumption. Even if we were to assume a form for ${P(x)}$ , we can’t assume it is uniform. Not just can’t as in “shouldn’t”, but can’t as in “mathematically impossible.” It is not possible to construct a uniform distribution on ${Z}$ .

Suppose we sought to circumvent this issue by constraining ourselves to some finite range ${[M,N]}$ , which we supposedly know or assume apriori. We certainly can impose a uniform distribution on it. Each pair ${(n,n+1)}$ has probability ${1/(N-M-1)}$ with ${n\in [M,N-1]}$ . But now we’ve introduced additional information (in the form of ${N}$ and ${M}$ ), and it no longer is surprising that we can do better than even-odds! We always would switch unless the first envelope contained ${N}$ . There is no contradiction between the first two arguments because we have apriori knowledge and are acting on it. We no longer are true to the original game.

Rather than dwell on this particular case, let’s solve the more general case of a given ${P(x)}$ (or in terms of ${log_2}$ , ${P(n)}$ ). For any ${n}$ drawn according to ${P(n)}$ , the envelopes contain ${(n,n+1)}$ in some order and it is equally likely that ${m=n}$ and ${m=n+1}$ . If we know ${P}$ we can bet accordingly since it contains information. In that case, knowing ${m}$ (i.e. ${z}$ ) helps us. Let’s suppose we don’t know ${P}$ . Then it still does not matter whether we observe the value ${z}$ , because we don’t the know the underlying distribution!

There only are two deterministic strategies: always keep, always switch. Why? Suppose that the drawn value is ${n}$ (unknown to us) and the observed value is ${m}$ . Note that these don’t require actual knowledge of the ${m}$ value, just that it has been fixed by the process of opening the envelope. Since we don’t know the underlying distribution, our strategy will be independent of the actual value. Given that the value doesn’t matter, we have nothing to do but always keep or always switch.

First consider the expected value with the always-keep strategy:

$\displaystyle \langle V_K \rangle= \sum_{n=-\infty}^\infty P(n) [P(m=n|n) 2^n + P(m=n+1|n) 2^{n+1}]$

I.e. we sum over all possible ordered pairs ${(n,n+1)}$ and then allow equal probability ${P(m=n+1|n)=P(m=n|n)=0.5}$ for either of the two envelope orders. So we have ${\langle V_K \rangle= \sum P(n) (2^n+2^{n+1})/2 = 3 \langle 2^{n-1} \rangle}$ . We immediately see that for this to be defined the probability distribution must drop faster than ${2^n}$ as ${n}$ gets large! We already have a constraint on the possible forms for ${P}$ .

Next consider the always-switch strategy. It’s easy to see that we get the same result:

$\displaystyle \langle V_S \rangle= \sum_{n=-\infty}^\infty P(n) [P(m=n|n) 2^{n+1} + P(m=n+1|n) 2^{n}]$

and since ${P(m=n|n)= P(m=n+1|n)}$ we get the same answer.

But let’s be extra pedantic, and connect this to the original formulation of the first two arguments. I.e., we should do it in terms of ${m}$ , the observed value.

$\displaystyle \langle V_S \rangle= \sum_m P(m) [P(n=m|m) 2^{m+1} + P(n=m-1|m) 2^{m-1}]$

We observe that ${P(n=m|m)= P(m|n=m)P(n=m)/P(m)}$ and ${P(n=m-1|m)= P(m|n=m-1)P(n=m-1)/P(m)}$ . We know that ${P(m|n=m)= P(m|n=m-1)= 0.5}$ . Plugging these in, we get

$\displaystyle \langle V_S \rangle= \sum_m [0.5 P(n=m) 2^{m+1} + 0.5 P(n=m-1) 2^{m-1}]$

The first term gives us ${\sum_n P(n) 2^n}$ . We can rewrite the index on the 2nd sum to get ${\sum_n P(n) 2^{n-1}}$ , which gives us ${\langle V_S \rangle= \sum_m P(n) (2^n + 2^{n-1})}$ , the exact same expression as before!

How does this apply to the ${[M,N]}$ ranged example we gave before? When we discussed it, we considered the case where the underlying distribution was known. In that and all other cases, a better than even-odds strategy based on such knowledge can be computed. In our actual formulation of the game, we don’t know ${P(n)}$ and there’s no reason it couldn’t be uniform on some unknown interval ${[M,N]}$ . Suppose it was. It still seems from our earlier discussion as if we’d do better by always switching. We don’t. The average amount thrown away by incorrectly switching when ${m=N}$ exactly offsets the average gain from switching in all other cases. We do no better by switching than by keeping.

We thus see that without knowing the underlying distribution ${P(x)}$ , the switching and keeping strategies have the same expected reward. Of the three arguments we originally proposed, the first 2 were flawed in that they assume a particular, and impossible, underlying distribution for ${x}$ .

At the beginning of our discussion, we mentioned that our intuition says you cannot do better than 50-50 probability-wise. Let us set aside expected rewards and focus solely on probabilities. We now see how you actually can do better than 50-50, contrary to all intuition!

— Achieving better than 50-50 Odds with Two Envelopes —

Next let’s consider a broader class of two-envelope problems, but purely from the standpoint of probabilities. Now the two envelopes can contain any numbers; one need not be double the other. As before, we may choose an envelope, it is opened, and we are offered the opportunity to keep it or switch. Unlike before, our goal now is to maximize the probability of picking the larger envelope.

Since we are dealing with probabilities rather than expectation values, we don’t care what two numbers the envelopes contain. In fact, they need not be numbers at all — as long as they are distinct and comparable (i.e. ${a<b}$ or ${b<a}$ but not both). To meaningfully analyze the problem we require a slightly stronger assumption, though: specifically that the set from which they be drawn (without repetition) possesses a strict linear ordering. However, it need not even possess any algebraic structure or a metric. Since we are not concerned with expectation values, no such additional structure is necessary.

Our intuition immediately tells us that nothing can be gained by switching. In fact, nothing we do should have any impact on the outcome. After all, the probability of initially picking correctly is ${1/2}$ . Switching adds no information and lands us with an identical ${1/2}$ probability. And that is that, right? It turns out that, contrary to our very strong intuition about the problem, there is in fact a way to improve those odds. To accomplish this, we’ll need to introduce a source of randomness. For convenience of exposition we’ll assume the envelopes contain real numbers, and revisit the degree to which we can generalize the approach later.

The procedure is as follows:

Pick any continuous probability distribution ${P}$ which has support on all of ${R}$ (i.e. ${p(x)>0}$ for all real ${x}$ ). Most common distributions (normal, beta, exponential, etc) are fine.
Choose an envelope and open it. We’ll denote its value ${z}$ .
Sample some value ${d}$ from our distribution ${P}$ . If ${z>d}$ stick with the initial choice, otherwise switch. We’ll refer to ${z>d}$ or ${z<d}$ because the probability that ${z=d}$ has measure ${0}$ and safely can be ignored.

At first, second, and ${n^{th}}$ glance, this seems pointless. It feels like all we’ve done is introduce a lot of cruft which will have no effect. We can go stand in a corner flipping a coin, play Baccarat at the local casino, cast the bones, or anything else we want, and none of that can change the probability that we’re equally likely to pick the lower envelope as the higher one initially — and thus equally likely to lose as to gain by switching. With no new information, there can be no improvement. Well, let’s hold that thought and do the calculation anyway. Just for fun.

First some terminology. We’ll call the value in the opened envelope ${z}$ , and the value in the other envelope ${z'}$ . The decision we must make is whether to keep ${z}$ or switch to the unknown ${z'}$ . We’ll denote by ${x}$ and ${y}$ the values in the two envelopes in order. I.e., ${x<y}$ by definition. In terms of ${z}$ and ${z'}$ we have ${x= \min(z,z')}$ and ${y= \max(z,z')}$ . We’ll denote our contrived distribution ${P}$ in the abstract, with pdf ${p(v)}$ and cdf ${F(v)=\int_{-\infty}^v p(v') dv'}$ .

Let’s examine the problem from a Bayesian perspective. There is a 50-50 chance that ${(z,z')=(x,y)}$ or ${(z,z')=(y,x)}$ . So ${p(z=x)=p(z=y)=0.5}$ . There are no subtleties lurking here. We’ve assumed nothing about the underlying distribution over ${(x,y)}$ . Whatever ${(x,y)}$ the envelopes contain, we are equally likely to initially pick the one with ${x}$ or the one with ${y}$ .

Once the initial envelope has been opened, and the value ${z}$ revealed, we sample ${d}$ from our selected distribution ${P}$ and clearly have ${p(d<x)=F(x)}$ and ${p(d<y)=F(y)}$ and ${p(d<z)=F(z)}$ . The latter forms the criterion by which we will keep ${z}$ or switch to ${z'}$ . Please note that in what follows, ${d}$ is not a free variable, but rather a mere notational convenience. Something like ${p(x<d)}$ is just notation for “the probability the sampled value is greater than ${x}$ .” We can apply Bayes’ law to get (with all probabilities conditional on some unknown choice of ${(x,y)}$ ):

$\displaystyle p(z=x|d<z)= \frac{p(d<z|z=x)p(z=x)}{p(d<z)}$

What we really care about is the ratio:

$\displaystyle \frac{p(z=x | d<z)}{p(z=y | d<z)}= \frac{p(d<z|z=x)p(z=x)}{p(d<z|z=y)p(z=y)}= \frac{F(x)}{F(y)}<1$

Here, we’ve observed that ${p(d<z|z=x)= p(d<x)= F(x)}$ and ${F(x)<F(y)}$ since by assumption ${x<y}$ and ${F}$ is monotonically increasing (we assumed its support is all of ${R}$ ). I.e., if ${d<z}$ there is a greater probability that ${z=y}$ than ${z=x}$ . We shouldn’t switch. A similar argument shows we should switch if ${d>z}$ .

So what the heck has happened, and where did the new information come from? What happened is that we actually know one piece of information we had not used: that the interval ${(x,y)}$ has nonzero probability measure. I.e. there is some “space” between ${x}$ and ${y}$ . We don’t know the underlying distribution but we can pretend we do. Our strategy will be worse than if we did know the underlying ${p(x)}$ , of course. We’ll return to this shortly, but first let’s revisit the assumptions which make this work. We don’t need the envelopes to contain real numbers, but we do require the following of the values in the envelopes:

The set of possible values forms a measurable set with a strict linear ordering.
Between any two elements there is a volume with nonzero probability. Actually, this only is necessary if we require a nonzero improvement for any ${(x,y)}$ . If we only require an improvement on average we don’t need it. But in that scenario, the host can contrive to use a distribution which neutralizes our strategy and returns us to 50-50 odds.

What difference does ${P}$ itself make? We don’t have any way to choose an “optimal” distribution because that would require placing the bulk of probability where we think ${x}$ and ${y}$ are likely to lie. I.e. we would require prior knowledge. All we can guarantee is that we can improve things by some (perhaps tiny) amount. We’ll compute how much (for a given true underlying distribution) shortly.

Let’s assume that ${Q(x,y)}$ is the true underlying distribution over ${(x,y)}$ . We won’t delve into what it means to “know” ${Q}$ since we are handed the envelopes to begin with. Perhaps the game is played many times with values drawn according to ${Q}$ or maybe it is a one-time affair with ${(x,y)}$ fixed (i.e. ${Q}$ a ${\delta}$ -distribution). Ultimately, such considerations just would divert us to the standard core philosophical questions of probability theory. Suffice to say that there exists some ${Q(x,y)}$ . By definition ${Q(x,y)=0}$ unless ${x<y}$ . For convenience, we’ll define a symmetrized version as well: ${q(a,b)\equiv Q(a,b)+Q(b,a)}$ . We don’t employ a factor of ${1/2}$ since the two terms are nonzero on disjoint domains.

Given ${Q}$ , what gain do we get from a particular choice of ${P}$ ?

$\displaystyle \begin{array}{rcl} P(win)= \int_{x<y} dx dy Q(x,y)[p(z=x|(x,y))p(x<d) \\ + p(z=y|(x,y))p(d<y)] \end{array}$

I.e., the probability we keep ${z}$ when it is ${y}$ and switch when it is ${x}$ . Clearly, ${p(z=x|(x,y))= p(z=y|(x,y))= 0.5}$ since those are the immutable 50-50 envelope ordering probabilities. After a little rearrangement, we get:

$\displaystyle P(win)= \frac{1}{2} + \langle F(y) - F(x) \rangle_Q$

Our gain is the mean value of ${F(y)-F(x)}$ over the joint distribution ${Q(x,y)}$ . The more probability ${P}$ jams between ${x}$ and ${y}$ , the more we gain should that ${(x,y)}$ arise. But without knowledge of the underlying joint distribution ${Q(x,y)}$ , we have no idea how best to pick ${P}$ . All we can do is guarantee some improvement.

How well can we do if we actually know ${Q}$ ? Well, there are two ways to use such information. We could stick to our strategy and try to pick an optimal ${P}$ , or we could seek to use knowledge of ${Q}$ directly. In order to do the former, we need to exercise a little care. ${Q}$ is a two-dimensional distribution while ${P}$ is one-dimensional. How would we use ${Q}$ to pick ${P}$ ? Well, this is where we make use of the observed ${z}$ .

In our previous discussion of the ${(x,2x)}$ envelope switching fallacy, the value of ${z}$ turned out to be a red-herring. Here it is not. Observing ${z}$ is essential here, but only for computation of probabilities. As mentioned, we assume no algebraic properties and are computing no expectations. We already know that the observation of ${z}$ is critical, since our algorithm pivots on a comparison between ${z}$ and our randomly sampled value ${d}$ . Considering our ultimate goal (keep or switch), it is clear what we need from ${Q}$ : a conditional probability that ${z'>z}$ . However, we cannot directly use ${Q(y|x)}$ because we defined ${x<y}$ . We want ${p(z'|z)}$ and we don’t know whether ${z<z'}$ or ${z'<z}$ . Let’s start by computing the probability of ${z}$ (being the observed value) and of ${z,z'}$ (being the observed and unobserved values).

The probability of observing ${z}$ and the other envelope having ${z'}$ is the probability that the relevant ordered pair was chosen for the two envelopes multiplied by the ${1/2}$ probability that we initially opened the envelop containing the value corresponding to our observed ${z}$ rather than the other one.

$\displaystyle p(z,z')= Q(min(z,z'),max(z,z'))/2= q(z,z')/2$

To get ${p(z)}$ we integrate this. ${p(z)= \frac{1}{2}\int Q(z,y)dy + \frac{1}{2}\int Q(x,z)dz}$ . This is a good point to introduce two quantities which will be quite useful going forward.

$\displaystyle I_1(z)\equiv \int_{-\infty}^z Q(x,z) dx$

$\displaystyle I_2(z)\equiv \int_z^\infty Q(z,y) dy$

In terms of these,

$\displaystyle p(z)= \frac{1}{2}[I_1(z)+I_2(z)]$

There’s nothing special about calling the variables ${x}$ or ${y}$ in the integrals and it is easy to see (since each only covers half the domain) that we get what we would expect:

$\displaystyle p(z)= \frac{1}{2}\int q(w,z)dw$

What we want is the distribution ${p(z'|z)= p(z,z'|z)= p(z,z')/p(z)= q(z,z')/p(z)}$ . This gives us:

$\displaystyle p(z'|z)= \frac{q(z,z')}{\int q(w,z)dw}= \frac{q(z,z')}{I_1(z)+I_2(z)}$

Finally, this gives us the desired quantity ${p(z'>z)= \int_{z'>z} dz' p(z'|z)}$ . It is easy to see that:

$\displaystyle p(z'<z)= \frac{I_1(z)}{I_1(z)+I_2(z)}$

$\displaystyle p(z'>z)= \frac{I_2(z)}{I_1(z)+I_2(z)}$

As an example, consider the previous ${(x,2x)}$ case — where one envelope holds twice what the other does. We observe ${z}$ , and ${z'}$ must be either ${2z}$ or ${z/2}$ , though we don’t know with what probabilities. If we are given the underlying distribution on ${x}$ , say ${P_2(x)}$ , we can figure that out. ${Q(x,y)= P_2(x)\delta(y-2x)}$ and ${q}$ is the symmetrized version. ${\int q(w,z)dw= \int dw [Q(w,z)+Q(z,w)]= (P_2(z/2)+P_2(2z))}$ . So ${p(z)= \frac{1}{2}(P_2(z/2)+P_2(2z))}$ . This is just what we’d expect — though we’re really dealing with discrete values and are being sloppy (which ends us up with a ratio of infinities from the ${\delta}$ function when computing probability ratios, but we’ll ignore that here). The relevant probability ratio clearly is ${P_2(z/2)/P_2(2z)}$ . From a purely probability standpoint, we should switch if ${P_2(2z)>P_2(z/2)}$ . If we reimpose the algebraic structure and try to compute expectations (as in the previous problem) we would get an expected value of ${z}$ from keeping and an expected value of ${z[P_2(z/2)/2 + 2P(2z)]}$ from switching . Whether this is less than or greater than ${z}$ depends on the distribution ${P_2}$ .

Returning to our analysis, let’s see how often we are right about switching if we know the actual distribution ${Q}$ and use that knowledge directly. The strategy is obvious. Using our above formulae, we can compute ${p(z'<z)}$ directly. To optimize our probability of winning, we observe ${z}$ then we switch iff ${I_1(z)<I_2(z)}$ . If there is additional algebraic structure and expectations can be defined, then an analogous calculations give whatever switching criterion maximizes the relevant expectation value.

In terms of probabilities, full knowledge of ${Q}$ is the best we can do. The probability we act correctly is:

$\displaystyle \begin{array}{rcl} P'(win)= \int dz \frac{[\theta(I_1(z)-I_2(z)) I_1(z) + \theta(I_2(z)-I_1(z))I_2(z)]}{I_1(z)+I_2(z)} \\ = \int dz \frac{\max(I_1(z),I_2(z))}{(I_1(z)+I_2(z)} \end{array}$

$\displaystyle P'(win|z)= \frac{\max(I_1(z),I_2(z))}{(I_1(z)+I_2(z)}$

Since ${I_1}$ and ${I_2}$ are monotonic (one increasing, the other decreasing), we have a cutoff value ${\hat z}$ (defined by ${I_1({\hat z})= I_2({\hat z})}$ ) below which we should switch and above which we should not.

How do we do with our invented ${P}$ instead? We could recast our earlier formula for ${P(win)}$ into our current notation, but it’s easier to compute directly. For given ${z}$ , the actual probability of needing to switch is ${I_2(z)/(I_1(z)+I_2(z))}$ . Based on our algorithm, we will do so with probability ${P(z<d)= 1-F(z)}$ . The probability of not needing to switch is ${I_1(z)}$ and we do so with probability ${P(z>d)= F(z)}$ . I.e., our probability of success for given ${z}$ is:

$\displaystyle P(win|z)= \frac{I_1(z)F(z) + I_2(z)(1-F(z))}{I_1(z)+I_2(z)}$

For any given ${z}$ , this is of the form ${\alpha r + (1-\alpha)(1-r)}$ where ${r= F(z)}$ and ${\alpha= I_1(z)/(I_1(z)+I_2(z))}$ . The optimal solutions lie at one end or the other. So it obviously is best to have ${F(z)=0}$ when ${z<{\hat z}}$ and ${F(z)=1}$ when ${z>{\hat z}}$ . This would be discontinuous, but we could come up with a smoothed step function (ex. a logistic function) which is differentiable but arbitrarily sharp. The gist is that we want all the probability in ${F}$ concentrated around ${\hat z}$ . Unfortunately, we have no idea where ${\hat z}$ is!

Out of curiosity, what if we pick instead ${P}$ to be the conditional distribution ${p(z'|z)}$ itself once we’ve observed ${z}$ ? We’ll necessarily do worse than by direct comparison using ${Q}$ (the max formula above), but how much worse? Well, ${p(z'|z)= q(z,z')/(I_1(z)+I_2(z))}$ . Integrating over ${z'<z}$ we have ${F(z)= \int_{-\infty}^z p(z'|z) dz'= I_1(z)/(I_1(z),I_2(z))}$ . I.e., We end up with ${(I_1^2(z)+I_2^2(z))/(I_1(z)+I_2(z))^2}$ as our probability of success. If we had used ${1-p(z'|z)}$ for our ${P}$ instead we would get ${2I_1(z)I_2(z)/(I_1(z)+I_2(z))^2}$ instead. Neither is optimal in general.

Next, let’s look at the problem from an information theory standpoint. As mentioned, there are two sources of entropy: (1) the choice of the underlying pair ${(x,y)}$ (with ${x<y}$ by definition) and (2) the selection ${(z,z')=(x,y)}$ or ${(z,z')=(y,x)}$ determined by our initial choice of an envelope. The latter is a fair coin toss with no information and maximum entropy. The information content of the former depends on the (true) underlying distribrution.

Suppose we have perfect knowledge of the underlying distribution. Then any given ${z}$ arises with probability ${p(z)=\frac{1}{2}[I_1(z)+I_2(z)]}$ . Given that ${z}$ , we have a Bernoulli random variable ${p(z'>z)}$ given by ${I_2(z)/(I_1(z)+I_2(z))}$ . The entropy of that specific coin toss (i.e. the conditional entropy of the Bernoulli distribution ${p(z'> z|z)}$ ) is

$\displaystyle H(z'>z|z)= \frac{-I_1(z)\ln I(z) - I_2(z)\ln I_2(z) + (I_1(z)+I_2(z))\ln [I_1(z)+I_2(z)]}{I_1(z)+I_2(z)}$

With our contrived distribution ${P}$ , we are implicitly are operating as if ${p(z'>z)= 1-F(z)}$ . This yields a conditional entropy:

$\displaystyle H'(z'>z|z)= -(1-F(z))\ln (1-F(z)) - F(z)\ln F(z)$

There is a natural measure of the information cost of assuming an incorrect distribution. It is the Kullback Liebler Divergence (also known as the relative entropy). While it wouldn’t make sense to compute it between ${Q}$ and ${P}$ (which are, among other things, of different dimension, we certainly can compare the cost for given ${z}$ of the difference in our Bernoulli random variables for switching — and then integrate over ${z}$ to get an average cost in bits. Let’s denote by ${q(z'>z)}$ the probability based on the true distribution and keep ${p(z'>z)}$ for the contrived one. I.e. ${q(z'>z)= I_2(z)/(I_1(z)+I_2(z))}$ and ${p(z'>z)= 1-F(z)}$ . For given ${z}$ , the K-L divergence is:

$\displaystyle D(Q || P, z)= \frac{-I_2(z)\ln [(I_1(z)+I_2(z))(1-F(z))/I_2(z)] - I_1(z)\ln [(I_1(z)+I_2(z))F(z)/I_1(z)]}{I_1(z)+I_2(z)}$

Integrating this, we get the mean cost in bits of being wrong.

$\displaystyle \begin{array}{rcl} \langle D(Q || P) \rangle= \frac{1}{2}\int dz [-(I_1(z)+I_2(z))\ln [I_1(z)+I_2(z)] - I_2(z)\ln (1-F(z)) \\ -I_1(z)\ln F(z) + I_1(z)\ln I_1(z) + I_2(z)\ln I_2(z)] \end{array}$

The first term is simply ${H(z)}$ , the entropy of our actual distribution over ${z}$ . In fact, the first term and last 2 terms together we recognize as ${\langle H(z'>z|z) \rangle}$ , the mean Bernoulli entropy of the actual distribution. In these terms, we have:

$\displaystyle \langle D(Q || P) \rangle= \langle H(z'>z|z) \rangle + \langle \frac{ -I_2(z)\ln(1-F(z)) - I_1(z)\ln F(z)}{I_1(z)+I_2(z)} \rangle$

where the expectations are over the unconditional actual distribution ${p(z)}$ . The 2nd expectation on the right represents the cost of being wrong about ${P}$ . If it was the optimal distribution with all probability centered near ${\hat z}$ then the term on the right would approach ${0}$ and there would be no entropy cost.

As an aside, this sort of probabilistic strategy should not be confused with the mixed strategies of game theory. In our case, a mixed strategy would be an apriori choice ${aK+(1-a)S}$ where ${K}$ is the always-keep strategy, ${S}$ is the always-switch strategy, and ${0\le a\le 1}$ is the probability of employing the always-keep strategy. A player would flip a biased-coin with Bernoulli probability ${a}$ and choose one of the two-strategies based on it. That has nothing to do with the measure-theory approach we’re taking here. In particular, a mixes strategy makes no use of the observed value ${x}$ or its relation to the randomly sampled value. Any mixed strategy gives even-odds because the two underlying deterministic strategies both have even-odds.

knotes, Notes

Differential Entropy

January 31, 2012 K.M. Halpern

A discussion of some of the subtleties of differential entropy. This also contains a review of discrete entropy, various entropy-related information quantities such as mutual information, and a listing of various axiomatic formulations.

Read the Notes (PDF)

Musing, Notes

The Optics of Camera Lens Stacks (Analysis)

August 24, 2011 K.M. Halpern

This first appeared on my tech blog. I like to play around with various configurations of camera lenses. This partly is because I prefer to save money by using existing lenses where possible, and partly because I have a neurological condition (no doubt with some fancy name in the DSM-IV) that compels me to try to figure things out. I spent 5 years at an institute because of this problem and eventually got dumped on the street with nothing but a PhD in my pocket. So let this be a warning: keep your problem secret and don’t seek help.

A typical DSLR (or SLR) owner has a variety of lenses. Stacking these in various ways can achieve interesting effects, simulate expensive lenses (which may internally be similar to such a stack), or obtain very high magnifications. Using 3 or 4 lenses, a telextender, a closeup lens, and maybe some extension rings (along with whatever inexpensive adapter rings are needed), a wide variety of combinations can be constructed. In another entry, I’ll offer a companion piece of freeware that enumerates the possible configurations and computes their optical properties.

In the present piece, I examine the theory behind the determination of those properties for any particular setup. Given a set of components (possibly reversed) and some readily available information about them and the camera, we deduce appropriate optical matrices, construct an effective matrix for the system, and extract the overall optical properties – such as focal length, nearest object distance, and maximum magnification. We account for focal play and zoom ranges as needed.

The exposition is self-contained, although this is not a course on optics and I simply list basic results. Rather, I focus on the application of matrix optics to real camera lenses. I also include a detailed example of a calculation.

As far as I am aware, this is the only treatment of its kind. Many articles discuss matrix methods or the practical aspects of reversing lenses for macro photography. However, I have yet to come across a discussion of how to deduce the matrix for a camera lens and vice-versa.

After reading the piece, you may wonder whether it is worth the effort to perform such a calculation. Wouldn’t it be easier to simply try the configurations? To modify the common adage, a month on the computer can often save an hour in the lab. The short answer is yes and no. No I’m not an economist, why do you ask?

If you have a specific configuration in mind, then trying it is easier. However, if you have a set of components and want to determine which of the hundreds of possible configurations are candidates for a given use (just because the calculation works, doesn’t mean the optical quality is decent), or which additional components one could buy to make best use of each dollar, or which adapter rings are needed, or what end of the focal ranges to use, then the calculation is helpful. Do I recommend doing it by hand? No. I even used a perl script to generate the results for the example. As mentioned, a freeware program to accomplish this task in a more robust manner will be forthcoming. Think of the present piece as the technical manual for it.

Read the Paper (PDF)

knotes, Notes

Cardinality

August 2, 2011 K.M. Halpern

A compilation of useful results involving cardinal numbers (small ones, not huge ones) and arithmetic, along with the cardinalities of certain useful sets. There’s also a small section on bases of infinite-dimensional vector spaces. Proofs and justifications for many of the results are included in an appendix.

Read the Paper

A Place of Sand

Category Archives: Notes

Semidirect Products, Group Extensions, Split Exact Sequences, and all that

What happens when you iterate Bayesian Inference with the same data set?

Be Careful Interpreting Covid-19 Rapid Home Test Results

Fun with Voting in Cambridge, MA

Arrow’s Theorem

The Rules in Cambridge

The Anomaly

The (quasi)-Duality of the Lie Derivative and Exterior Derivative

The Truth about Stock Prices: 12 Myths

Two-Envelope Problems

Differential Entropy

The Optics of Camera Lens Stacks (Analysis)

Cardinality

Writings and Ravings