# What happens when you iterate Bayesian Inference with the same data set?

I’ve recently been reviewing Bayesian networks with an eye to learning STAN. One question which occurred to me is the following. Suppose we are interested in the probability distribution $P(\mu)$ over parameters $\mu\in X$ (with state space $X$). We acquire some data $D$, and wish to use it to infer $P(\mu)$. Note that $D$ refers to the specific realized data, not the event space from which it is drawn.

Let’s assume that (1) we have a prior $P(\mu)$, (2) the likelihood $P(D|\mu)$ is easy to compute or sample, and (3) the normalization $P(D)\equiv \sum_{\mu\in X} P(D|\mu)P(\mu)$ is not expensive to compute or adequately approximate.

The usual Bayesian approach involves updating the prior to a posterior via Bayes’ thm: $P(\mu|D)= \frac{P(D|\mu)P(\mu)}{P(D)}$. However, there also is another view we may take. We need not restrict ourselves to a single Bayesian update. It is perfectly reasonable to ask whether multiple updates using the same $D$ would yield a more useful result.

Such a tactic is not as ridiculous or unjustified as it first seems. In many cases, the Bayesian posterior is highly sensitive to a somewhat arbitrary choice of prior $P(\mu)$. The latter frequently is dictated by practical considerations rather than arising naturally from the problem at hand. For example, we often use the likelihood function’s conjugate prior to ensure that the posterior will be of the same family. Even in this case, the posterior depends heavily on the precise choice of $P(\mu)$.

Though we must be careful interpreting the results, there very well may be applications in which an iterated approach is preferable. For example, it is conceivable that multiple iterations could dilute the dependence on $P(\mu)$, emphasizing the role of $D$ instead. We can seek inspiration in the stationary distributions of Markov chains, where the choice of initial distribution becomes irrelevant. As a friend of mine likes to say before demolishing me at chess: let’s see where this takes us. Spoiler: infinite iteration “takes us” to maximum-likelihood selection.

An iterated approach does not violate any laws of probability. Bayes’ thm is based on the defining property $P(\mu,D)= P(D|\mu)P(\mu)= P(\mu|D)P(D)$. Our method is conceptually equivalent to performing successive experiments which happen to produce the same data $D$ each time, reinforcing our certainty around it. Although its genesis is different, the calculation is the same. I.e., any inconsistency or inapplicability must arise through interpretation rather than calculation. The results of an iterated calculation may be inappropriate for certain purposes (such as estimating error bars, etc), but could prove useful for others.

In fact, one could argue there only are two legitimate approaches when presented with a one-time data set $D$. We could apply it once or an infinite number of times. Anything else would amount to an arbitrary choice of the number of iterations.

It is easy to analyze the infinite iteration process. For simplicity, we’ll consider the case of a discrete, finite state space $X$. $D$ is a fixed set of data values for our problem, so we are not concerned with the space or distribution from which it is drawn. $P(D)$ is a derived normalization factor, nothing more.

Let’s introduce some notation:

– Let $n\equiv |X|$, and denote the elements of $X$ by $\mu_1\dots \mu_n$.
– We could use $n$-vectors to denote probability or conditional probability distributions over $X$ (with the $i^{th}$ component the probability of $\mu_i$), but it turns out to be simpler to use diagonal $n\times n$ matrices.
$P(\mu)$ is an $n$-vector, which we’ll write as a diagonal $n\times n$ matrix $v$ with $v_{ii}\equiv P(\mu_i)$.
– We’ll denote by $D^k$ the data set $D$ repeated $k$ times. I.e., the equivalent of having performed an experiment $k$ times and obtained $D$ each time.
$P(\mu|D)$ is an $n$-vector, which we’ll write as a diagonal $n\times n$ matrix $v'$ with $v'_{ii}\equiv P(\mu_i|D)$).
– Where multiple updates are involved, we denote the final posterior $P(\mu|D^k)$ via an $n\times n$ diagonal matrix $v^{(k)}$, with $v^{(k)}_{ii}\equiv P(\mu_i|D^k)$. Note that $v'= v^{(1)}$ and $v= v^{(0)}$.
$P(D|\mu)$ as an $n$-vector of probabilities as well, but we’ll also treat it as a diagonal $n\times n$ matrix $M$ with $M_{ii}\equiv P(D|\mu_i)$.
$P(D)=\sum_{i=1}^n P(D|\mu_i)P(\mu_i)$ is a scalar. In our notation, $P(D)= \text{tr}~ M v$.

A single Bayesian update takes the form $v'= M v/(\text{tr}~ M v)$. What happens if we repeat this? A second iteration substitutes $v'$ for $v$, and we get $v^{(2)}= M v'/(\text{tr}~ M v')$. This is homogeneous of degree $0$ in $v'$, so the $(\text{tr}~ M v)$ normalization factor in $v'$ disappears. We thus have $v^{(2)}= M^2 v /(\text{tr}~ M^2 v)$. The same reasoning extends to $v^{(k)}= M^k v/(\text{tr}~ M^k v)$.

It now is easy to see what is happening. Suppose $n=2$, and let $M_{11}>M_{22}$. Our expression for $P(\mu_1|D)$ after $k$ iterations is $v^{(k)}_1= \frac{M^k_{11} v_{11}}{M^k_{11} v_{11} + M^k_{22} v_{22}}$.

This has the form $\frac{a^k x}{a^k x + b^k y}$, which can be written $1/(1+\frac{b^k y}{a^k x})$. We know that $b, so as long as $x\ne 0$ we have $\lim_{k\rightarrow\infty} \frac{b^k y}{a^k x}= 0$. Specifically, for $\epsilon>0$ we have $\frac{b^k y}{a^k x}<\epsilon$ for $k>\frac{\ln\epsilon + \ln \frac{x}{y}}{\ln \frac{b}{a}}$. Note that the denominator is negative since $a>b$ and the numerator is negative for small enough $\epsilon$.

We therefore have shown that (in this simple case), $\lim_{k\rightarrow\infty} v^{(k)}_1= v_{11}$. If we perform the same analysis for $v^{(k)}_2$, we get $v^{(k)}_2= \frac{M^k_{22} v_{22}}{M^k_{11} v_{11} + M^k_{22} v_{22}}$, which corresponds to $1/(1+\frac{a^k x}{b^k y})$. The denominator diverges for large enough $k$, and the limit is $0$. We therefore see that $\lim_{k\rightarrow\infty} v^{(k)}_2= 0$.

This trivially extends to $n>2$. As $k\rightarrow\infty$, all but the dominant $M_{ii}$ are exponentially suppressed. The net effect of infinite iteration is to pick out the maximum likelihood value. I.e., we select the $\mu_i$ corresponding to the maximum $M_{ii}$. All posterior probability is concentrated in that. Put another way, the limit of iterated posteriors is $P(\mu_i|D^\infty)= 1$ for $i=argmax~P(D|\mu_i)$ and $0$ for all others.

What if the maximum $M_{ii}$ is degenerate? Let’s again consider the simple $n=2$ case, but now with $M_{11}= M_{22}>0$. It is easy to see what happens in this case. $a/b=1$, so $v^{(k)}_1= \frac{v_{11}}{v_{11}+v_{22}}$ and $v^{(k)}_2= \frac{v_{22}}{v_{11}+v_{22}}$. Note that $v_{11}+v_{22}=1$ here, but we stated the denominator explicitly to facilitate visualization of the extension to $n>2$.

This extension is straightforward. We pick out the maximum likelihood values $\mu_i$, and they are assigned their prior probabilities, renormalized. Suppose there are $m\le n$ degenerate maximum $M_{ii}$‘s, with indices $i_1\dots i_m$ (each $i_j\in 1\dots n$). The limit of iterated posteriors $P(\mu_{i_j}|D^\infty)= \frac{P(\mu_i)}{\sum_{j=1}^m P(\mu_{i_j})}$. This reduces to our previous result when $m=1$.

Note that we must ensure $v_i\ne 0$ for the maximum likelihood $\mu_i$‘s. I.e., we cannot have a $0$ prior for any of the maximum likelihood values. If we wish to exclude $\mu_i$‘s from consideration, we should do so before the calculation, thus eliminating the corresponding $P(D|\mu_i)$‘s from contention for the maximum likelihood.

Expanding $|X|$ to a countable set poses no problem. In the continuous case, we must work with intervals (or measurable sets) rather than point values. For any $\epsilon>0$ and any set of nonzero measure containing all the maximum likelihood values, there will be some $k$ that concentrates all but $\epsilon$ of the posterior probability in that set.

Note that $k$ depends on the choice of measurable set, and care must be taken when considering limits of such sets. For example, let $p\equiv \max_{\mu} P(D|\mu)$ be the maximum likelihood probability. If we consider an interval $I\equiv (p-\delta/2,p+\delta/2)$ as our maximum likelihood set, then the maximum likelihood “value” is the (measurable) set $V\equiv P(D|\mu)^{-1}(I)$. For any $\epsilon$, we have a $k$ as discussed above, such that $P(\mu\notin V|D^j)<\epsilon$ for $j>k$. However, for a fixed $\epsilon$, that $k$ will vary with $\delta$. Put another way, we cannot simply assume uniform convergence.

We can view infinite iteration as a modification of the prior. Specifically, it is tantamount to pruning the prior of all non-maximum-likelihood values and renormalizing it accordingly. The posterior then is equal to the prior under subsequent single-$D$ steps (i.e. it is a fixed point distribution). Alternatively, we can view the whole operation as a single $D^\infty$ update. In that case, we keep the original prior and view the posterior as the aforementioned pruned version of the prior.

There are two takeaways here:

1. The infinite iteration approach simply amounts to maximum-likelihood selection. It selects the maximum likelihood value(s) from the known $P(D|\mu)$ and maintains their relative prior probabilities, suitably renormalized. Equivalently, it prunes all the non-maximum-likelihood values.
2. The resulting posterior still depends on the choice of prior unless the maximum likelihood value is unique, in which case that value has probability $1$.

Unlike stationary distributions of Markov chains, the result is not guaranteed to be independent of our arbitrary initial choice — in this case, the prior $P(\mu)$. Though true independence only is achieved when there is a unique maximum likelihood value, the dependence is reduced significantly even when there is not. The posterior depends only on those prior values corresponding to maximum likelihood $\mu$‘s. All others are irrelevant. The maximum likelihood values typically form a tiny subset of $\mu$‘s, thus eliminating most dependence on the prior. Note that such degeneracy (as well as the values themselves) is solely determined by the likelihood function.

# Fun with Voting in Cambridge, MA

My city of Cambridge, MA is one of a few municipalities which employs ranked choice voting for City Council elections. Unlike most cities, the Mayor is chosen by the City Council and is largely a ceremonial position. Most real power resides with the City Manager, who is appointed for an indefinite term by the City Council. This means that City Councils which get to appoint a new City Manager exert an inordinate influence over the future course of the city. One such point is fast approaching. Unfortunately, given the present and probable near-term composition of the City Council, the decision likely will be based on considerations other than aptitude. However, putting aside my city’s somber prognosis, the upcoming City Council election is a good opportunity to discuss an unusual method of voting and some of its shortcomings.

Ordinary winner-takes-all elections dominate the popular consciousness. National elections are of this nature. It would not be inaccurate to observe that such an approach reflects the general weltanschauung of our culture. However, there are many other voting methods. In fact, voting theory is a vibrant field of research. Together with its sibling, auction theory, it forms part of the subject commonly known as “social choice theory”.

As an aside, I recently published a paper, Social Choice using Moral Metrics in that field. It focuses on measuring distances between behaviors, rather than on voting systems per se. Back in 2008, I also wrote a voting theory piece about swing votes and block voting. What I termed “influence” in it is more commonly referred to as “voting power”. Neither are related to what I discuss in this post, but I encourage the interested reader to peruse them.

It may be argued that certain voting methods are fairer than others, by one or another definition of fairness. Particular flavors sometimes are advocated by those disenchanted with an existing method or an agenda to see some particular group gain influence.  Calls for change sometimes arise in response to highly-visible anomalies, election outcomes which appear egregiously unfair even to disinterested eyes.

In elections with a large field of candidates or those in which a number of positions are simultaneously filled (such as the Cambridge City Council election), winner-takes-all voting may not be suitable or may give rise to such anomalies.

California’s recall system is an example. The ballot in that case has 2 questions: (1) whether to recall the governor and (2) who should replace him. The first question is winner-takes-all for the governor alone. If he loses, the 2nd question is winner-takes-all for the other candidates. It is quite possible for a candidate to be chosen who easily would have lost to the recalled governor one-on-one. In 2003, 44.6% of voters voted not to recall Governor Davis. He thus was recalled, and Schwarzenegger then won with 48.58% of the votes for replacement. It is highly unlikely that in a head-to-head gubernatorial election, Republican Schwarzenegger would have beaten Democrat Davis in the heavily blue state. However, Gray was excluded from this 2nd contest and Schwarzenegger was deemed preferable to the alternatives by most voters.

## Arrow’s Theorem

It is natural to ask whether any voting system is unimpeachably fair, indicting the use of other systems as anachronistic or disingenuous. Arrow famously proved that, under even a small set of fairness constraints and for a broad class of voting systems, it is impossible to find one. Loosely speaking, when more than 2 candidates are present, no method of aggregating the rankings of candidates by voters into a single outcome ranking can simultaneously satisfy three conditions: (1) if every voter prefers candidate x to candidate y, then x outranks y in the outcome, (2) no single voter’s preference determines the outcome (i.e. no dictator), and (3) if each voter ranks x relative to y (i.e. above or below it) the same way in elections A and B (though the order can differ between voters, of course), then the outcomes of A and B do too. I.e., if voters change their overall ranking of x and y or the relative placement of other candidates, but don’t change whether x is preferred to y or vice versa, then whether x outranks y or vice versa in the outcome is unchanged.

It is quite plausible to add more fairness conditions, but most plausible definitions of fairness would require at least these three conditions to hold. Arrow showed that there is no ranked voting system (including “preponderance of the votes”) in which unfair anomalies cannot arise.

As an aside, if one were to relax a condition, the most palatable clearly would be (3). It is conceivable that a “fair” aggregation method may allow the overall ranking of candidates to affect a pairwise order in the outcome. However, this generally is deemed undesirable.

As with complexity results in computer science (CS) or Godel’s impossibility theorem in logic, the theoretical existence of hard or problematic cases does not necessarily pose a practical obstacle. In CS, an algorithm with worst-case exponential complexity may be far more useful than one with linear complexity in real-world applications. For example, the latter could have a huge constant cost (often referred to as a “galactic algorithm”) and the former could be exponential only in an infinitesimal fraction of cases or under circumstances which never arise in practice. Godel’s theorem does have real-world examples (i.e. non-meta-theorems), but (at this point) they remain rare.

Though nowhere near as profound,  Arrow’s theorem invites similar skepticism.  The impossibility of a preference system which excludes all anomalies does not mean such anomalies arise in practice, or that a system which excludes all realistic anomalies cannot be found.   Unfortunately (or fortunately, depending on one’s perspective), such anomalies do arise in practice.  Worse,  the systems in question often are of significant social import and subject to intense scrutiny.  The anomalies which do arise can be quite visible and politically troublesome.

Social choice theory exhibits another critical difference from CS and logic, one which merits additional caution.  The goal of logic, mathematics, and theoretical computer science generally is to understand which problems are solvable and how best to solve them.  Anomalies are viewed as pathological and undesirable.  They sometimes serve as useful counterexamples, guiding researchers to better understanding and helping them improve their tools.   However, they are to be avoided in real-world applications.   If a pathological case arises in such a context, alternate machinery must be employed or the framework modified to exclude it.

This need not be the case in social choice theory.   Not everyone’s goal is aligned, or social choice would be unnecessary.    With elections, there could be adverse incentives. It may be possible to game an election by identifying and exploiting anomalies endemic to the specific system involved.  There also may be groups who strongly prefer that anomalies arise, either for purposes of fomenting discord or if those anomalies serve them well.  For this reason, dismissing anomalies as almost impossible under some assumed prior may be naive. The prior must incorporate human behavior, and this very well could concentrate probability around the anomalies.  Put another way, if we naively model the probability of anomalies arising using an assumption of ideal behavior we risk ignoring the very real possibility that participants will engineer or utilize anomalies.

This issue is related to Gibbard’s theorem, which loosely states that under even weaker conditions than Arrow’s theorem (at least 3 candidates and no dictator), there is no ideal ballot which reflects a voter’s preferences. Put another way, the voting system can be gamed. In fact, a voter may need to game it (perhaps in response to polls or other information) in order to best reflect their individual preferences. The optimal ballot ranking to enact a voter’s preferences may not be their actual preference ranking of candidates.

## The Rules in Cambridge

What does all this have to do with the Cambridge elections? Cambridge employs a particular system of ranked choice voting, which they refer to as “Proportional Representation”. This often is portrayed as fairer, more democratic, and so on. I am going to offer an example of an egregious anomaly which can result. I do this not in the expectation that it will arise or be exploited.  Nor do I hope to change a voting method that is, all things considered, quite reasonable.  Rather, the anomaly serves an illustrative example of the inherent problem with claiming that one voting system is “fairer” than another.

First, I’ll describe the precise rules of the Cambridge election, as best I understand them. See MA Election Laws, section 9 for details.  State law governs the general rules for proportional representation voting in any Massachusetts municipalities which choose to employ it.  Only certain parameters and details of execution are left to local discretion.

The City Council consists of 9 individuals, and the entire body is elected once every 2 years. Voters are presented with a list of candidates and may select a 1st choice, a 2nd choice, and so on.  I do not recall the maximum number of choices which can be made, but let us suppose it is not limited. The anomaly arises whether or not this is the case. Note that a given voter is not required to rank all the candidates. They could select only their top 3 choices, for example. Whether or not a full ranking by each voter is required does not affect the anomaly.

First some definitions. N will denote the total number of ballots (i.e. the number of voters who participate in the election).  At the time of writing, the minimum number of signatures to get on the ballot is 50.  We’ll call this ‘M’, because State law gives it a role in the algorithm. Q=(N/10)+1 will be the “quota”, the minimum number of ballots a candidate needs to win.

Why not choose Q=N/9?  The type of voting system we’re describing is sometimes referred to as “single-transferable-vote” (STV) because of the use of spillovers (described below). There are two common quota methods for determining STV winners:  (1) “Hare” corresponds to Q=N/9, and (2) “Droop” corresponds to Q=(N/10)+1.   In each case, we round up if needed. The two methods generally result in the same outcome or differ only in how the last winner is chosen. Each has benefits and drawbacks vis-a-vis what is deemed fair in terms of proportional representation. Among other things, the Droop quota tends to favor small parties over large. It also is the smallest quota which guarantees no more than 9 winners. As we will see, neither method guarantees a full complement of 9 winners.  Regardless, the Droop quota is that used by Cambridge.

Once the ballots have been collected, a sequence of steps is performed by computer. An order of polling places is determined randomly by the city beforehand. Within each polling place, ballots are sorted by the choice of 1st place candidate (and then presumably randomly within each such cohort).  The ballots then go through a series of stages.  The first stage is special.

Stage 1: Any candidate who reaches Q votes is declared a winner. Subsequent 1st place votes for them are passed to the next ranked candidate on the ballot who has not already been declared a winner. Ex. if a ballot is reached with x, y, and z as the 1st, 2nd, and 3rd candidates, and both x and y already have been declared winners, it would go to z. If no non-winner choice remains on the ballot, it is swapped with a ballot that already was consumed by the winner and has non-winner choices on it. This minimizes the number of discarded ballots. Note that it always pays for a voter to rank a lot of choices, because otherwise some other voter may have their preference registered instead. It’s not clear from the law what order the 1st place candidates’ ballots should be sorted, but we’ll assume randomly. It does not matter for the anomaly we will discuss. As the sorting proceeds, any candidate with Q votes (by spillover from other candidates or by being 1st on their own) is declared a winner, and any remaining votes for them spill over as described.

Once this process has been completed, almost every ballot has been assigned to some candidate (i.e. either consumed by a winner or spilled over to a remaining candidate). Because of the ballot-swapping mechanism described, it unlikely (but still possible) for ballots to have been discarded due to lack of non-winner alternatives. Each winner has consumed precisely Q ballots, and each remaining candidate has less than Q ballots. In what follows we use “higher-ranked” to refer to the preferred candidates on a ballot. In practice, this means they have been assigned a lower number. I.e., the 1st place candidate on a ballot is “higher-ranked” than the 2nd place candidate.

At this point, any candidate with fewer than M ballots (in our case 50) is declared to have lost. Their ballots are transferred in the same manner as before to the remaining candidates. Note that this form of elimination only takes place in this first round, since the number of ballots assigned to a candidate cannot decrease in subsequent rounds.

Stages 2+: If 9 candidates have been declared winners, the process ends. Otherwise, the trailing candidate is declared to have lost, and their votes are transferred (one by one) to the remaining candidates in the same  manner as before, but with one important change. Unlike in the first round, if no remaining non-winner candidates are listed on a ballot, it is discarded rather than swapped with another. As before, any candidate who reaches Q votes is declared a winner and can accrue no more votes. There are some tie-breaker rules associated with determining who is the trailing candidate at the end of a given round, but we won’t go into those. If at any time, the number of winners plus remaining candidates is 9, all remaining candidates are declared winners. The round ends when every ballot in play either has been spilled over (once) or discarded. Those ballots not discarded or consumed by winners and those candidates not eliminated then proceed to the next round.

Note that a spillover never can result in a ballot being assigned to a higher-ranked candidate. For example, suppose a ballot already has been assigned to the 3rd listed candidate on it. This only could happen if there was a reason to skip the top 2. This means they either already were declared winners or already were eliminated. Nor do any swaps (possible only in the 1st round) affect this. Any subsequent spillovers must go to lower-ranked candidates, or the ballot would have been handed to a higher-ranked candidate already.

Note that unless every voter ranks every candidate, it is possible for some ballots to be discarded. This is highly unlikely in the first round, because swapping is allowed. However, in subsequent rounds ballots may be discarded if they list no candidates which remain in play (i.e. that have not already been declared winners or eliminated). Though there is a theoretical bound on the number of possible discarded ballots, it can be high.

It is quite possible for an insufficient number of winners to be declared. This is no surprise. If every voter lists the same three candidates, but no others, then only three candidates will win. Insufficient ranking by voters can lead to inadequate outcomes.

Unless the field of candidates is reduced below 9 in the first round (i.e. too few candidates meet the 50 vote threshold), there ultimately will be 9 winners. However, some may not get many votes. If every voter ranks every candidate, then all winners will meet quota. If not, some candidates may win without meeting quota by dint of being the last ones uneliminated.

A number of obvious anomalies come to mind. For example, if everyone votes for x,y, and z as the top 3 candidates but there is a huge field of candidates for 4th place — so that each gets 51 spillover votes — then the remaining candidates won’t be eliminated in the first round. The remaining 6 winners then will be selected by the tie-breaker procedure (which we didn’t elaborate on).  Fair yes, desirable no. However, such anomalies can be accounted voter-failures. If each voter ranks the whole field of candidates, they won’t arise.

One important thing to note is that the election method described does not obey the conditions of Arrow’s theorem. The procedure is not even deterministic, and certainly does not satisfy the 3rd fairness condition. It is quite possible for a change in the ranking of candidate z on individual ballots to affect the order of x relative to y in the outcome even if the order of x relative to y is unchanged on those individual ballots. As an extreme example, suppose x is 1st and y is 2nd on 50 ballots and y is 1st and x is 2nd on 50 ballots, and suppose z is 3rd on all of these.   If one of the 1st 50 ballots moves z to the top, x will be eliminated in the 1st round.  If one of the 2nd 50 ballots moves z to the top y will be eliminated in the 1st round.  In neither case did the ranking of x relative to y change on any ballots.  Some anomalies arise for similar reasons to those involved in Arrow’s theorem, but others arise for different reasons.

## The Anomaly

Let us now consider the specific anomaly we set out to discuss. Suppose there are 10000 ballots and 9 positions to be filled. We require 1001 votes for a candidate to win, but we’ll call it 1000 to simplify calculation. Suppose that candidate x is ranked 1st on all 10000 ballots, candidate y is ranked 3rd on all 10000 ballots, and 100 other candidates (which we’ll call z1-z100) are ranked 2nd on 100 ballots each.

Everyone agrees that candidates x and y should be on the City Council. They both rank in the top 3 choices for everyone. However, candidate y is eliminated in the first round. All the spillover votes from candidate x go to candidates z1-z100. The number could vary for each, depending on the order in which ballots are processed.  For example, it is possible that each of z1-z100 is assigned 90 spillover votes from candidate x.  It also is possible that z1-z90 would accrue 100 spillover votes each, and the rest would get 0 and be eliminated.

At the end of round 1, x is declared a winner and consumes 1000 votes, y has 0 votes, and z1-z100 each have between 0 and 100 votes.  At least 90 of them have enough to survive the 50 vote test.  However, y is eliminated.  The remaining z’s then proceed through a series of elimination and spillover rounds (with possible tie-breakers for the trailing candidate if needed) until only 8 of the z’s remain. These then are declared winners.

The result is 1 winner everyone wants, 8 winners few people agree on, and the conspicuous loss of the 2nd candidate everyone wants.

This is just one fun example of how well-intentioned voting systems can result in highly-undesirable outcomes.

# The (quasi)-Duality of the Lie Derivative and Exterior Derivative

This is a short set of notes that covers a couple of aspects of duality in differential geometry and algebraic topology. It grew out of an enigmatic comment I encountered, to the effect that the Lie and exterior derivatives were almost-dual in some sense. I wanted to ferret out what this meant, which turned out to be more involved than anticipated. Along the way, I decided to explore something else I never had properly understood: the nature of integration from a topological perspective. This led to an exploration of the equivalence of de Rham and singular cohomology.

The notes are in the form of five sets of slides. Originally, they comprised four presentations I gave in a math study group. On tidying, the last set grew unwieldy, so I broke it into two.

• Lecture1: Review of DG and AT. Types of derivatives on ${M}$, de Rham Complex, review of some diff geom, Lie deriv and bracket, chain complexes, chain maps, homology, cochain complexes, cohomology, tie in to cat theory.
• Lecture2: The integral as a map, Stokes’ thm, de Rham’s thm, more about Lie derivs.
• Lecture3: Recap of de Rham cohomology, review of relevant algebra, graded algebras, tensor algebra, exterior algebra, derivations, uniqueness results for derivations, the interior product.
• Lecture4: Cartan’s formula, tensor vs direct product, element-free def of LA, Lie coalgebras
• Lecture5: Quick recap, relation between struct constants of LA and LCA, the choice of ground ring or field, duality of Lie deriv and exterior deriv.

These notes grew organically, so the order of presentation may seem a bit … unplanned. The emphases and digressions reflect issues I encountered, and may be peculiar to my own learning process and the many gaps in my physicist-trained math background. Others may not share the same points of confusion, or require the same background explanations. They were designed for my own use at some future point when I’ve completely forgotten the material and need a bespoke refresher. I.e., a week from now.

Although I’ve tried to polish the notes to stand on their own, there are some allusions to earlier material studied in the group. In particular, certain abbreviations are used. Here is a (hopefully) complete list:

• DG: Differential Geometry
• AT: Algebraic Topology
• DR: de Rham
• ${P}$: Used for a Principal bundle. Not really used here, but mentioned in passing.
• PB: Principal Bundle. Not really used here, but mentioned in passing.
• AB: Associated Bundle. Not really used here, but mentioned in passing.
• LG: Lie Group. Mentioned in passing.
• LA: Lie Algebra
• LCA: Lie Coalgebra (defined here).
• v.f. Vector fields
• v.s. Vector space

The 1st 2 lectures focus on the equivalence of de Rham and singular cohomologies via a duality embodied in the integral map, and enforced by Stokes’ and de Rham’s thms. The last 3 lectures focus on the quasi-duality between the Lie derivative and exterior derivative. By quasi-duality we don’t mean to downplay its legitimacy. I didn’t go through all sorts of contortions to call a square a circle just because it sounds elegant. There is a true duality, and a beautiful one. But saying that it is directly between the Lie and exterior derivs is slightly misleading.

These notes were constructed over a period of time, and focus on the specific topic of interest. They are by no means comprehensive. Although edited to correct earlier misconceptions based on later understanding (as well as errors pointed out by the math group), the order of development has not been changed. They were written by someone learning the subject matter as he learned it. They may have some mistakes, there may be some repetition of points, and they are not designed from the ground up with a clear vision. Nonetheless, they may prove helpful in clarifying certain points or as a springboard for further study.

These notes explain the following:

• ${\int}$ as a map from the de Rham complex to the singular cochain complex
• Stokes’ thm as a relationship between de Rham cohomology and singular cohomology
• The various types of derivations/anti-derivations encountered in differential geometry
• A review of graded algebras, tensor algebras, exterior algebras, derivations, and anti-derivations.
• A review of Lie Derivatives, as well as Cartan’s formula
• A discussion of what the duality of ${{\mathcal{L}}}$ and ${d}$ means
• A discussion of the two views one can take of ${T(M)}$ and ${\Lambda(M)}$: as ${\infty}$-dimensional vector spaces over ${\mathbb{R}}$ or as finite-basis modules over the smooth fns on M. The former is useful for abstract formulation while the latter is what we calculate with in DG. The transition between the two can be a source of confusion.
• A discussion of why derivations and anti-derivations are the analogues of linearity when we move from one view to the other.

The notes draw from many sources, including Bott & Tu, Kobyashi & Nomizu, and various discussions on stackexchange. A list of references is included at the end of the last set of slides.

# The Truth about Stock Prices: 12 Myths

No-fee trading has invited a huge influx of people new to trading. In this article, I will discuss the basics of “price formation”, the mechanism by which stock prices are determined.

Like most people, for much of my life I assumed that every stock has a well-defined “price” at any given point in time. You could buy or sell at that price, and the price would move based on activity. If it went up you made money, if it went down you lost money. Trading was easy: you just bought the stocks you thought would go up and sold the ones you thought would go down.

Unfortunately, my blissful naivete was cut short. After a youthful indiscretion, I ended up doing five years at the Massachusetts Institute of Technology. When the doors finally slammed shut behind me, I emerged with little more than a bus ticket and some physics-department issued clothes. Nobody reputable would hire a man with a checkered background doing physics, so I ended up with the only sort open to hard cases: Wall Street.

I caught the eye of a particularly unsavory boss one day, and he recruited me into a gang doing stat arb at a place called Morgan Stanley. I tried to get out, but they kept pulling me back in. It took six years to find a way out, but even then freedom proved elusive. I was in and out of corporations for the next few years, and even did some contract work for a couple of big hedge funds. Only in the confusion of 2008, did I finally manage to cut ties and run. But the scars are still there. The scars never go away.

On the plus side, I did learn a bit about market microstructure. Along the way I came to understand that my original view of prices was laughably simplistic. My hope is that I can help some misguided kid somewhere avoid my own missteps. If I can save even one reader, the effort put into this post will have been repaid a thousand times over. Mainly because I didn’t put much effort into it.

Rather than a detailed exposition on market microstructure (which varies from exchange to exchange, but has certain basic principles), I will go through a number of possible misconceptions. Hopefully, this will be of some small help to new traders who wish to better understand the dynamics of the stock market. At the very least, it will make you sound smart at cocktail parties. It also may help the occasional reader avoid such minor faux pas as redditing “hey guys, why don’t we all collude to manipulate stock prices in clear violation of SEC regulations, and to such an absurd degree that it will be impossible for regulators NOT to crucify us.” But hey, what’s the worst that could result from the public subversion of a number of powerful, well-connected hedge funds and the defiant proclamation that this was intentional?

Now to the important bit. Because we live in America, and everybody sues everyone for everything, I’ll state the obvious. Before you do anything, make sure you know what you are doing. If you read it here, that doesn’t mean it’s right or current. Yes, I worked in high frequency statistical arbitrage for some time. However, my specific knowledge may be dated. Though the general principles I describe still apply, you should confirm anything I say before relying heavily on it. In particular, I am no tax expert. Be sure to consult an accountant, a lawyer, a doctor, a rabbi, and a plumber before attempting anything significant. And if you do, please send me their info. It’s really hard to find a good accountant, lawyer, doctor, rabbi, or plumber.

Don’t take anything I say (or anyone else says) as gospel. I’ve tried to be as accurate as possible, but that doesn’t mean there aren’t technical errors. As always, the onus is on you to take care of your own money. When I first started out on Wall Street, I was in awe of traders. Then I got to know some. In my first job, somebody helpfully explained why people on Wall Street were paid more than in other professions. They weren’t paid to be infallible and never make mistakes; they were paid to be attentive and diligent enough to catch any mistakes they did make.

This sounded nice, but turned out to be a load of malarkey. The highly-paid professionals on Wall Street are the same bunch of knuckleheads as in any other profession, but with better credentials. However, this cuts both ways. Many people have a view, promulgated by movies and television, that bankers are unscrupulous, boiler-room shysters. These certainly exist, but mostly amongst the armies of low-paid retail brokers, or in certain very disreputable areas such as commercial banking. The real Wall Street is quite different. The individuals I worked with were highly ethical, and the environment was far more collegial and honest than academia. And this was in the late 90’s and early 2000’s, before academia really went to pot. The few knives I had to pull out of my back were (with one exception) gleefully inserted by fellow former-physicists. Fortunately, while physicists know a lot about the kinematics of knives, they know very little about anatomy. I emerged unscathed, and even got a few free knives out of it — which I promptly sold to some folks in Academia, where such things always are in high demand.

Despite its inapplicability to actual employee behavior, the point about mistakes is a good one. It is impossible to avoid making mistakes, but if you value your money you should carefully triple-check everything. This goes doubly for any work done by an accountant, financial adviser, or other “professional” you ill-advisedly employ. They probably know less than you do, and certainly care less than you do about your money.

The best advice I can offer is to inform yourself and be careful. Do research, check, recheck, and recheck again before committing to a trade. In my personal trading, I’ve never lost out by being too slow or cautious. But I have been hammered by being too hasty.

Now to the possible misconceptions. I’ll call them “myths” because that’s what popular websites do, so obviously it’s the right thing to do, and I prefer to do the right thing because the wrong thing rarely works.

Myth 1: There is a “price” for a stock at any given point in time. When a stock is traded during market hours, there is no such thing as its “price”. There is a bid (the highest offer to buy) and an ask (the lowest offer to sell). Often, the “price” people refer to is the last trade price (the price at which the last actual transaction occurred, regardless of its size). Sometimes the midpoint (bid+ask)/2 or weighted midpoint (bid x bidsize + ask x asksize)/(bidsize + asksize) is used. For algorithmic trading, more complicated limit-book centroids sometimes are computed as well. The “closing price” generally refers to the last trade price of the day. This is what appears in newspapers.

Myth 2: I can place a limit order at any price I want. No, you cannot. Stocks (and options) trade at defined ticks. The “tick” or “tick size” is the space between allowed prices, and may itself vary with price. For example, the tick size in stock ZZZ could be $0.01 for prices below$1.00 and $0.05 otherwise. Often, ticks are things like 1/8 or 1/16 rather than multiples of$0.01. The tick size rules vary per exchange (or per security type on a given exchange) rather than per stock. In our example, any stock’s price could have allowable values of …, $0.98,$0.99, $1.00,$1.05, $1.10, … on the exchange in question. Myth 3: Limit Orders always are better than market orders. Limit orders offer greater control over the execution price, but they may not be filled or may result in adverse selection. Suppose ZZZ is trading with a bid of$100, an ask of $101, and a tick size of$0.50. Alice places a buy limit order at $100.5. It is quite possible that it quickly will be filled, giving her$0.50 better execution than a market order.

But suppose it is not filled right away. If the stock goes up, Alice has incurred what is called “opportunity cost.” The $0.50 attempted savings now translates into having to pay a higher price or forego ownership of the stock. It’s like waiting for the price of a home to go down, only to see it go up. If you want the home (and still can afford it), you now must pay more. Ok, but why not just leave the limit order out there indefinitely? Surely it will get filled at some point as the stock bounces around. And if not, there is no harm. You don’t end up with the stock, but haven’t lost any money. In fact, why not put a limit order at$98? If it gets executed, that’s a $2.00 price improvement! The problem is adverse selection. Such a limit order would get filled when the stock is falling. Sure, a temporary dip could catch it. But a major decline also could. The order is likely to be filled under precisely the conditions when Alice would not want it to be. At that point, she may be able to buy the stock for$97 or $96 — if buying it remains desirable at all. In the presence of an “alpha” (loosely speaking, a statistical signal which a trader believes has some predictive power for future stock movements), it may pay to place such limit orders —but that is a specific execution strategy based on a specific model. In general, there is no free money to be had. You either incur the transaction cost of crossing the spread (i.e. paying the ask), or risk both the opportunity cost of losing out on a desirable trade and the possibility of adverse selection which lands you with the stock at the worst possible time. Well, it isn’t strictly true there is no free money to be had. There is free money to be made, but only by market makers, uniquely positioned to accept large volumes of orders. In this, they are not unlike the exchanges themselves. You and I do not possess the technology, capital, or customer flow to make money that way. Myth 4: I can buy or sell any quantity at the stated price. There are a couple of reasons this is not true. The “stated price” usually is the last trade price, and there is no guarantee you can buy at that same price. Just because a house down the block sold for X doesn’t mean you can buy an identical one now for X. In illiquid stocks (and quite often with options), the last trade may have taken place some time ago and be stale relative to the current quote. In principle, you can buy at the current ask or sell at the current bid. However, even this is not guaranteed. The bid and ask can move quickly, and it may be difficult to catch them. But there also is another critical issue at play. The bid and ask are not for unlimited quantities of stock. Each has an associated size, the total number of shares being sold or sought at that price. To understand this, it is necessary to explain how an order actually is executed — and that requires the notion of a “limit book” (aka “order book”). Most data vendors and websites will display a “quote” (aka “composite quote”) for each stock. This consists of a bid, an ask, a bid-size, and an ask-size. Although some websites may omit the sizes, they are considered part of the quote. Suppose the quote for ZZZ has a bid of$100 for 200 shares, an ask of $101 for 50 shares, and the relevant tick-size is$0.50. Then the spread is two ticks (101-100)/0.50, and the midpoint is $100.50. It isn’t necessarily the case that there is one trader offering to buy 200 shares at$100 and another offering to sell 50 shares at $101. The sizes may be aggregates of multiple orders at those price levels. The composite quote actually is a window into a larger constellation of orders known as the limit book. The limit book consists of a set of orders at various price levels. For example, the limit book for ZZZ could have orders at$101, $101.5,$102, and $104 on the ask side, with a queue of specific orders at each level. The composite quote simply is the highest bid, the lowest ask, and the aggregate size for each. Suppose Bob puts in a market order to buy$100 shares of ZZZ. This is matched against the orders at the lowest ask level ($101 in this case) in their order of priority (usually the time-order in which they were received). Since there only are 50 shares at$101, the exchange matches Bob against all the sell-orders at $101. It then matches the remaining 50 shares against the second ask level ($101.5) and higher until it matches them all. If it fails to match them all, Bob will have a partial fill, and the remainder of the order will be cancelled (since it was a market order). Each “fill” is a match against a specific sell-order, and a given trade can result in many fills. This is part of why your broker may sometimes send a bunch of trade confirmations for a single order on your part.

For highly liquid stocks, no order you or I are likely to place will go execute past the inner quote. However, that quote can move quickly and the price at which a market order is executed may not be what you think. Brokers also execute order flow internally, or sell flow to other institutions — which then match it against other customers or their own orders. To you it looks the same (and may actually improve your execution in some cases), but your trade may never make it to the exchange. This is fine, since you’re not a member of the exchange — your broker is.

Note the risk of a market order, especially for illiquid stocks. Suppose the 2nd ask level was $110 rather than$101.5. In that case, Bob would have bought 50 shares at $100 and 50 shares at$110. A limit order slightly past the ask would have avoided this. For example, if he wanted to ensure execution (if possible) but avoid such ridiculous levels, he could place a fill-or-kill (but not all-or-none) order at $102. This would ensure that he doesn’t pay more than$102, but he may only get a partial fill.

For stocks (other than penny-stocks), limit orders rarely are necessary as protection, though they may be desirable for other purposes. But when trading options, a limit order always should be used. If the quote is moving around a lot, this can be a good way to control worst-case execution (but in exchange for some opportunity cost). Options are a bit odd, since brokers often will write them on the spot in response to an order. You just need to figure out what their automated price-level is. Sometimes it is the midpoint, sometimes slightly higher. You almost always can do better than the innermost ask for small volume. For higher volume, you should buy slowly (over a day or two) to avoid moving the market too much — though it may be impossible if you effectively have the broker as your only counterparty. But back to Bob and ZZZ!

Now suppose that Bob places a limit order to buy 50 shares at $100.5, right in the middle of the current spread. There now is a new highest bid level:$100.5, and Bob is the sole order at that level. Any market sell order will match against him first, and this may happen so fast that the quote never noticeably changes. But if not, the new bid and bidsize will be $100.5 and 50 shares. If instead, he placed his buy order at$100, he would join the other bids at $100 as the last in the queue at that level. What if he places it at$101 instead? If there were 25 shares available at that ask level, he would match those 25 shares. He now would have a bid for the remaining 25 shares at $101. This would be the new best bid, the quote would change accordingly. The new best ask would be$101.5. Finally, suppose he placed the limit order at $110 instead. This effectively would be a market order, and would match against the$101 and $101.5 levels as before. Note that he would not get filled at$110 in this example. If there were 25 shares each at $101 and$101.5, he would be filled at those levels and his $110 limit order would have the same effect as a$101.5 limit order.

The limit book constantly is changing and, to make things worse, there often is hidden size. On many exchanges, it’s quite possible for the limit book to show 25 shares available at $101 and yet fill Bob for all 50 at that level. There could be hidden shares which automatically replenish the sell-order but are not visible in the feed. This is intentional. Most of the time, we only have access to simple data: the current quote and the last trade price. Note that the crossing procedure described is performed automatically almost everywhere these days. Most exchanges run “ECNs”, electronic crossing networks. An algorithm accepts orders which conform to the tick-size and other exchange rules, crossing them or adjusting the limit book accordingly. This is conceptually simple, but the software is rather involved. Because of the critical nature of an exchange, the technology has to be robust. It must be able to receive high volumes of orders with minimal latency; process them, cross them, and update the limit book; transmit limit-book, quote, and trade information to data customers; manage back-end and regulatory tasks such as clearing trades, reporting them, and processing payments; and do all this at extremely high speed, across many stocks and feeds concurrently, and with significant resilience. It definitely beats a bunch of screaming people and trade slip confetti. Myth 5: The price at the close of Day 1 is the price at the open of Day 2. This clearly is not true, and often the overnight move is huge and predicated on different dynamics than intra-day moves. There are two effects involved. Some exchanges make provision for after-market and pre-open trading, but the main effect is the opening auction. Whenever there is a gap in trading, the new trading session begins with an opening auction. Orders accumulate prior to this, populating the limit book. However, no fills can occur. This means that the two sides of the limit book can overlap, with some bids higher than some asks. This never happens during regular trading because of the crossing procedure described earlier, and this situation must cleaned up before ordinary trading can begin. The opening auction is an unambiguous procedure for matching orders until the two sides of the book do not overlap. It is executed automatically by algorithm. The closing price on a given day is the last trade price of that day. It often takes a while for data to trickle in, so this gets adjusted a little after the actual close but usually is fairly stable. The prices one sees at the start of the day involve a flurry of fills from the uncrossing. This may create its own minor chaos, but the majority of the overnight price move is reflected in the orders themselves. Basically, it can be thought of as a queue waiting to get their orders in. There also are certain institutional effects near the open and close because large funds must meet certain portfolio constraints. Note that the opening auction happens any time there is a halt to trading. Most opening auctions are associated with the morning open, but some exchanges (notably the Tokyo Stock Exchange) have a lunch break. Extreme price moves also can trigger a temporary trading halt. In each case, there is an opening auction before trading restarts. Myth 6: The price fluctuations of a stock reflect market sentiment. That certainly can be a factor, often the dominant one. However, short-term price fluctuations also may be caused by mere market microstructure. The price we see in most charts and feeds is the last trade price, so let’s go with that. Similar considerations hold for the quote midpoint, bid, ask, or any other choice of “price” that is being tracked. When you buy at the ask, some or all of the sell-orders at that ask-level of the limit book are filled. There may be hidden size which immediately appears, or someone may happen to jump in (or adjust a higher sell-order down). But in general, this is not the case. The composite quote moves, as do all quote-based metrics. The last trade price also reflects your trade, at least until the next trade occurs. Consider an unrealistic but illustrative example: ZZZ has a market cap of a billion dollars. Bob and Alice are sitting at home, trading. The rest of the market, including all the major institutions which own stock in ZZZ, are sitting back waiting for some news or simply have no desire to trade ZZZ at that time. They don’t participate in trading, and have no orders outstanding. So it’s just Alice and Bob. ZZZ has a last trade price of$100, Bob has a limit order to buy 1 share at $100, and Alice has a limit order to sell 1 share at$101. These orders form both the quote and the entirety of the limit book (in this case).

Bob gets enthusiastic, and crosses the spread. The price now is $101, that at which his trade transacted. Both see that the “price” just went up, and view the stock as upward-bound. Alice has some more to sell, and decides to raise her ask. She places a sell limit order for 1 share at$102. The ask now is 1x$102. Bob bites, crossing the spread and transacting at$102. The “price” now is $102. The pattern repeats with Alice always increasing the ask by$1 and Bob always biting after a minute or so. The closing price that day $150. Two people have traded a total of 50 shares over the course of that day. Has the price of a billion dollar company really risen 50%? True, this is a ridiculous example. In reality, the limit book would be heavily populated even if there was little active trading, and other participants wouldn’t sit idly by while these two knuckleheads (well, one knucklehead, since Alice actually does pretty well) go at it. But the concept it illustrates is an important one. Analogous things can happen in other ways. Numerous small traders can push the price of a stock way up, while larger traders don’t participate. In penny stocks, this sort of thing actually can happen (though usually not in such an extreme manner). When a stock’s price changes dramatically, it is important to look at the trading volume and (if possible) who is trading. When such low-volume price moves occur, it is not a foregone conclusion that the price will revert immediately or in the near term. Institutional traders aren’t necessarily skilled or wise, and can get caught up in a frenzy or react to it — so such effects can have real market impact. However, most of the time they tend to be transient. Myth 7: Shorting is an abstraction, and is just like buying negative shares. In many cases, it effectively behaves like this for the trader. However, the actual process is more complicated. “Naked shorts” generally are not allowed, though they can arise in anomolous circumstances. When you sell short, you are not simply assigned a negative number of shares, which settles accordingly. You are borrowing specific shares of stock from a specific person who has a long position. The matching process is called a “locate” and is conducted at your broker’s level if possible or at the exchange level if the broker has no available candidates. There is an exception for market-makers and for brokers when a stock is deemed “easy to borrow”, meaning it is highly liquid and there will be no problem covering the short if necessary. Brokers maintain dynamic “easy to borrow” and “hard to borrow” lists for this purpose. From the standpoint of a trader, there are two situations in which a short may not behave as expected. Suppose Bob sells short 100 shares of ZZZ stock, and the broker locates it with Alice. Alice owns 100 shares, and the broker effectively lends these to Bob. If Alice decides to sell her shares, Bob now needs to return the shares he borrowed and be assigned new ones. Normally, this is transparent to Bob. But if replacement shares cannot be located, he must exit his short position. The short sale is contingent on the continuing existence of located shares. Because of the borrowing aspect, Bob’s broker also must ensure he has sufficient funds to cover any losses as ZZZ rises. This requires a margin. If ZZZ goes up, Bob may have to put up additional capital or exit his position (and take the loss). In principle, a short can result in an unlimited loss. In practice, Bob would fail a margin call before then. I.e., Bob cannot simply “wait out” a loss as he could with a long position. If — as you should — you view the value of your position as always marked-to-market, then (aside from transaction cost or tax concerns) you never should hold a position just to wait out a loss. Most people don’t think or act this way, and there sometimes are legitimate reasons not to. For example, a long term investment generally shouldn’t be adjusted unless new information arrives (though that information may regard other stocks or externalities which necessitate an overall portfolio adjustment). One could argue that short term random fluctuations do not constitute new information, and without an alpha model one should not trade on them. This is a reasonable view. However, the ability to avoid doing so is not symmetric. Because of the issues mentioned, short positions may be harder to sustain than long ones. The next couple of myths involve some tax lingo. In what follows “STCG” refers to “Short Term Capital Gain” and “LTCG” refers to “Long Term Capital Gain”. “STCL” and “LTCL” refer to the corresponding losses (i.e. negative gains). Myth 8: Shares are fungible. When you sell them, it doesn’t matter which ones you sell. This is true from the standpoint of stock trading, but not taxes. Most brokers allow you to specify the specific shares (the “lots”) you wish to sell, though the means of doing so may not be obvious. However, for almost all purposes two main choices suffice: LIFO and FIFO. Most of the time, FIFO is the default. With many brokers, you can change this default for your account, as well as override it for individual trades. Let’s look at the difference between FIFO and LIFO. Suppose Bob bought 100 shares of ZZZ at$50 3 years ago and bought another 100 shares of ZZZ at $75 6 months ago. ZZZ now is at$100, and he decides to sell 100 shares. If he sells the first 100 shares, a LTCG of $5000 ($10000 – $5000) is generated, but if he sells the second 100 shares a STCG of$2500 ($10000 –$7500) is generated. The implications of such gains can be significant, and are discussed below. The specifics of Bob’s situation will determine which sale is more advantageous — or less disadvantageous.

The first choice corresponds to FIFO accounting: first in, first out. The second corresponds to LIFO: last in, first out. One usually (but not always) benefits from FIFO, which is why this is the default. Note that FIFO and LIFO are relative to a given brokerage account, since a broker only knows what about your positions with it. If Bob had an earlier position with broker B, broker A does not know about it or cannot sell it. In that case, Bob must keep track of these things. FIFO and LIFO are relative to the specific account in question, but the tax consequences for Bob are determined across all brokerage accounts. We’ll see what this means in a moment.

All capital gains are relative to “basis” (or “tax basis”), generally the amount you paid for the stock when you bought it. In the example above, the basis for the first lot was $5000 and the basis for the second was$7500. This was why the LTCG from the first was $5000, while the STCG from the second was$2500. With stocks (but not necessarily mutual funds), a tax event only occurs when you close your position. If you hold the shares for 10 years, only on year 10 is a capital gains tax event generated. This can allow some strategic planning, and part of your overall investment strategy may involve choosing to sell in a low-income year. Note that dividends are taxed when you receive them, and regardless of whether they are cash or stock dividends or you chose to reinvest them. Also note that some mutual funds generate tax events from their own internal trading. You could be taxed on these (STCG or LTCG), and it is best to research the tax consequences of a fund before investing in it.

If you transfer stocks between accounts (usually done when transferring a whole account to a new broker), their tax basis is preserved. No tax events are generated. Note that the transfer must be done right. If you manually close your old positions and open new ones (with enough time between), you may generate a tax event. But if you perform an official “transfer” (usually initiated with your destination broker), the basis is preserved and no tax event occurs. Whether your broker will know that basis is another question. Not every broker’s technology or commitment to customer convenience is up to snuff. It is a good practice to keep your own careful records of all your trading activity.

When would LIFO be preferable? There are various cases, but the most common is to take a STCL to offset STCGs. STCGs tend to be taxed at a much higher rate than LTCGs, so taking a loss against them often is the desirable thing to do. In Bob’s case, if the price had gone down to $25 instead of up to$100, he could sell at a loss and use that loss to offset gains from some other stocks. He would have to specify LIFO to sell the newer lot and generate the STCL.

Myth 9: A “no-fee” trading account is better than one with fees. The cost to a trader involves several components. The main three are broker fees, exchange fees, and “execution”. “No-fee” refers to the broker fee. Unless many small trades are being executed with high frequency, the broker fee tends to be small. The exchange fees are passed along to you, even for “no-fee” accounts. The “execution” is the bulk of the cost. No or low-fee brokers often cross flow internally or sell flow to high-frequency firms which effectively front-run you. Market orders see slightly worse execution than they could, and limit orders get filled with slightly lower frequency than they could (or are deferred, causing slight adverse selection). These effects are not huge, but something to be aware of.

Suppose Alice buys 100 shares of ZZZ at $100. Broker X is no-fee, and Broker Y charges a fee of$7.95 per trade but has 10 bp (0.1%) better execution than Broker X on average. That 10 bp is just a price improvement of $0.10, and amounts to$10. Alice does better with Broker Y than Broker X. This benefit may seem to apply only to large trades, but it also applies to stocks with large spreads. For illiquid stocks (including penny stocks) the price improvement can be much more significant. There are trading styles (lots of small trades in highly liquid stocks) where no-fee sometimes trumps better execution, but most often it does not.

Myth 10: Taxes are something your accountant figures out, and shouldn’t affect your trading. Selling at the best price is all that matters. Taxes can eat a lot of your profit, and should be a primary consideration. Tax planning involves choosing accounts to trade in (401K or other tax-deferred vs regular), realizing losses to offset gains, and choosing assets with low turnover. As mentioned, some mutual funds can generate capital gains through their internal trading. In extreme cases, you could pay significant tax on a losing position in one.

Why are taxes so important to trading? The main reason is that there can be a 25% (or more) difference in tax rate between a LTCG and a STCG. STCGs often are taxed punitively, or at best are treated like ordinary income. Here in MA, the state tax alone is 12% for STCGs vs 5% for LTCGs. Federally, STCGs are treated as ordinary income while LTCGs have their own lower rate.

STCGs are defined as positions held for under one year, while LTCGs are held for over one year. Note that it is the individual positions that matter. If Bob owns 200 shares of ZZZ, bought in two batches, then each batch has its own basis and its own purchase date. Also note that most stock option positions result in a STCG or STCL. A STCG only can be offset by a STCL, but a LTCG can be offset by a LTCL or STCL. Clearly, STCLs are more valuable than LTCLs. They can be rolled to subsequent years under some circumstances, but may be automatically wasted against LTCGs if you are not careful.

A good understanding of these details can save a lot of money. To understand the impact, suppose Alice has a (state+federal) 20% LTCG marginal tax rate and a 45% STCG marginal tax rate. She makes $10,000 on a trade, not offset by any loss. If it is a LTCG, she pays$2000 in taxes and keeps $8000. If it is a STCG, she pays$4500 and keeps $5500. That’s an additional$2500 out of her pocket. Since the markets pay us to take risk, she must take more risk or tie up more capital to make the same $8000 of after-tax profit. How much more capital? Not just the missing 25%, because the extra profit will be taxed at 45% as well. We solve 0.55 x= 8000, to get 14,545. Alice must take tie up 45% more capital or (loosely speaking) take 45% more risk to walk away with the same after-tax profit. Myth 11: Options are like leveraged stock. No. This is untrue for many reasons, but I’ll point out one specific issue. Options can be thought of as volatility bets. Yes, the Black Scholes formula depends on the stock price in a nonlinear manner, and yes the Black Scholes model significantly underestimates tail risk. But for many purposes, it pays to think of options as predominantly volatility-based. Let’s return to our absurd but illustrative earlier scenario involving Bob bidding himself up and Alice happily making money off him. As before, they trade ZZZ stock and are the only market participants but don’t know it. They run up their positions as before, with Bob buying a share from Alice at$100, then $101, up to$109. He now owns 10 shares. Both are so excited to be trading, they fall over backward in their chairs and bang their heads. Alice goes from pessimistic to optimistic, while Bob goes from optimistic to pessimistic. He wants to unload some of his stock, and offers to sell a share at $109. Alice now is optimistic, so she buys. He tries again, but gets no bite so he lowers the price to$108. Alice thinks this is a good deal and snaps it up. Bob sees the price dropping and decides to get out while he can. He offers at $107, Alice buys. And so on. At$100 he has sold his last share. Both are back where they started, as is the last reported trade price of ZZZ. At this point, both lean back in relief and their chairs topple over again. Now they’re back to their old selves, and they repeat the original pattern, with Alice selling to Bob at $100,$101, etc. Their chairs are very unstable, and this pattern repeats several times during the day. The last leg of the day is a downward one.

The day’s trading involves ZZZ stock price see-sawing between 100 and 109, and the price ends where it started. Consider somebody trading the options market (maybe Alice and Bob are the only active stock traders that day because everybody else is focusing on the options market). The price of ZZZ is unchanged between the open and close, but the prices of most ZZZ call and put options have risen dramatically. Option prices are driven by several things: the stock price, the strike price, the time to expiry, and the volatility. If the stock price rises dramatically, put options will go down but not as much as the price change would seem to warrant. This is because the volatility has increased. In our see-saw case, the volatility rose even when the stock price remained the same.

Myth 12: There are 12 myths.

# Two-Envelope Problems

Let’s visit a couple of fun and extremely counterintuitive problems which sit in the same family. The first appears to be a “paradox,” and illustrates a subtle fallacy. The second is an absolutely astonishing (and legitimate) algorithm for achieving better than 50-50 oods of picking the higher of two unknown envelopes. Plenty of articles have discussed who discovered what ad nauseum so we’ll just dive into the problems.

— The Two Envelope Paradox: Optimizing Expected Return —

First, consider the following scenario. Suppose you are shown two identical envelopes, each containing some amount of money unknown to you. You are told that one contains double the money in the other (but not which is which or what the amounts are) and are instructed to choose one. The one you select is placed in front of you and its contents are revealed. You then are given a second choice: keep it or switch envelopes. You will receive the amount in the envelope you choose. Your goal is to maximize your expected payment.

Our intuition tells us that no information has been provided by opening the envelope. After all, we didn’t know the two values beforehand so learning one of them tells us nothing. The probability of picking the higher envelope should be ${1/2}$ regardless of whether we switch or not. But you weren’t asked to improve on the probability, just to maximize your expected payment. Consider the following 3 arguments:

• Let the amount in the the envelope you initially chose be ${z}$. If it is wrong to switch then the other envelope contains ${z/2}$, but if it is right to switch it contains ${2z}$. There are even odds of either, so your expectation if you switch is ${1.25z}$. This is better than the ${z}$ you get by sticking with the initial envelope, so it always is better to switch!
• Since we don’t know anything about the numbers involved, opening the first envelope gives us no information — so ignore that value. Call the amount in the other envelope ${z'}$. If it is wrong to switch then the envelope you chose contains ${2z'}$, and if right to switch it contains ${0.5z'}$. If you switch, you get ${z'}$ but if you don’t your expectation is ${1.25z'}$. So it always is better NOT to switch!
• Call the amounts in the two envelopes ${x}$ and ${2x}$ (though you don’t know which envelope contains which). You pick one, but there is equal probability of it being either ${x}$ or ${2x}$. The expected reward thus is ${1.5x}$. If you switch, the same holds true for the other envelope. So you still have an expected reward of ${1.5x}$. It doesn’t matter what you do.

Obviously, something is wrong with our logic. One thing that is clear is that we’re mixing apples and oranges with these arguments. Let’s be a bit more consistent with our terminology. Let’s call the value that is in the opened envelope ${z}$ and the values in the two envelopes ${x}$ and ${2x}$. We don’t know which envelope contains each, though. When we choose the first envelope, we observe a value ${z}$. This value may be ${x}$ or ${2x}$.

In the 3rd argument, ${P(z=x)= P(z=2x)= 0.5}$. If we switch, then ${\langle V \rangle= P(z=x)2x+P(z=2x)x = 1.5x}$. If we keep the initial envelope then ${\langle V \rangle= P(z=x)x+P(z=2x)2x = 1.5x}$. Whether we switch or not, the expected value is ${1.5x}$ though we do not know what this actually is. It could correspond to ${1.5z}$ or ${0.75z}$. We must now draw an important distinction. It is correct that ${P(z=x)= P(z=2x)= 0.5}$ for the known ${z}$ and given our definition of ${x}$ as the minimum of the two envelopes. However, we cannot claim that ${1.5x}$ is ${1.5z}$ or ${0.75z}$ with equal probability! That would be tantanmount to claiming that the envelopes contain the pairs ${(z/2,z)}$ or ${(z,2z)}$ with equal probability. We defined ${x}$ to be the minimum value so the first equality holds, but we would need to impose a constraint on the distribution over that minimum value itself in order for the second one to hold. This is a subtle point and we will return to it shortly. Suffice it to say that if we assume such a thing we are led right to the same fallacy the first two arguments are guilty of.

Obviously, the first two arguments can’t both be correct. Their logic is the same and therefore they must both be wrong. But how? Before describing the problems, let’s consider a slight variant in which you are NOT shown the contents of the first envelope before being asked to switch. It may seem strange that right after you’ve chosen, you are given the option to switch when no additional information has been presented. Well, this really is the same problem. With no apriori knowledge of the distribution over ${x}$, it is immaterial whether the first envelope is opened or not before the 2nd choice is made. This gives us a hint as to what is wrong with the first two arguments.

There actually are two probability distributions at work here, and we are confounding them. The first is the underlying distribution on ordered pairs or, equivalently, the distribution of the lower element ${x}$. Let us call it ${P(x)}$. It determines which two numbers ${(x,2x)}$ we are dealing with. We do not know ${P(x)}$.

The second relevant distribution is over how two given numbers (in our case ${(x,2x)}$) are deposited in the envelopes (or equivalently, how the player orders the envelopes by choosing one first). This distribution unambiguously is 50-50.

The problem arises when we implicitly assume a form for ${P(x)}$ or attempt to infer information about it from the revealed value ${z}$. Without apriori knowledge of ${P(x)}$, being shown ${z}$ makes no difference at all. Arguments which rely solely on the even-odds of the second distribution are fine, but arguments which implicitly involve ${P(x)}$ run into trouble.

The first two arguments make precisely this sort of claim. They implicitly assume that the pairs ${(z/2,z)}$ or ${(z,2z)}$ can occur with equal probability. Suppose they couldn’t. For simplicity (and without reducing the generality of the problem), let’s assume that the possible values in the envelopes are constrained to ${2^n}$ with ${n\in Z}$. The envelopes thus contain ${(2^n,2^{n+1})}$ for some integer ${n}$ (though we don’t know which envelope contains which value). For convenience, let’s work in terms of ${log_2}$ of the values involved (taking care to use ${2^n}$ when computing expectations).

In these terms, the two envelopes contain ${(n,n+1)}$ for some ${n=\log_2(x)}$ (defined to be the lesser of the two). We open one, and see ${m=\log_2(z)}$. If it is the upper then the pair is ${(m-1,m)}$, otherwise the pair is ${(m,m+1)}$. To claim that these have equal probabilities means that ${n=m-1}$ and ${n=m}$ are equally probable. We made this assumption independent of the value of ${m}$, so it would require that all pairs ${(n,n+1)}$ be equally probable.

So what? Why not just assume a uniform distribution? Well, for one thing, we should be suspicious that we require an assumption about ${P(x)}$. The 3rd argument requires no such assumption. Even if we were to assume a form for ${P(x)}$, we can’t assume it is uniform. Not just can’t as in “shouldn’t”, but can’t as in “mathematically impossible.” It is not possible to construct a uniform distribution on ${Z}$.

Suppose we sought to circumvent this issue by constraining ourselves to some finite range ${[M,N]}$, which we supposedly know or assume apriori. We certainly can impose a uniform distribution on it. Each pair ${(n,n+1)}$ has probability ${1/(N-M-1)}$ with ${n\in [M,N-1]}$. But now we’ve introduced additional information (in the form of ${N}$ and ${M}$), and it no longer is surprising that we can do better than even-odds! We always would switch unless the first envelope contained ${N}$. There is no contradiction between the first two arguments because we have apriori knowledge and are acting on it. We no longer are true to the original game.

Rather than dwell on this particular case, let’s solve the more general case of a given ${P(x)}$ (or in terms of ${log_2}$, ${P(n)}$). For any ${n}$ drawn according to ${P(n)}$, the envelopes contain ${(n,n+1)}$ in some order and it is equally likely that ${m=n}$ and ${m=n+1}$. If we know ${P}$ we can bet accordingly since it contains information. In that case, knowing ${m}$ (i.e. ${z}$) helps us. Let’s suppose we don’t know ${P}$. Then it still does not matter whether we observe the value ${z}$, because we don’t the know the underlying distribution!

There only are two deterministic strategies: always keep, always switch. Why? Suppose that the drawn value is ${n}$ (unknown to us) and the observed value is ${m}$. Note that these don’t require actual knowledge of the ${m}$ value, just that it has been fixed by the process of opening the envelope. Since we don’t know the underlying distribution, our strategy will be independent of the actual value. Given that the value doesn’t matter, we have nothing to do but always keep or always switch.

First consider the expected value with the always-keep strategy:

$\displaystyle \langle V_K \rangle= \sum_{n=-\infty}^\infty P(n) [P(m=n|n) 2^n + P(m=n+1|n) 2^{n+1}]$

I.e. we sum over all possible ordered pairs ${(n,n+1)}$ and then allow equal probability ${P(m=n+1|n)=P(m=n|n)=0.5}$ for either of the two envelope orders. So we have ${\langle V_K \rangle= \sum P(n) (2^n+2^{n+1})/2 = 3 \langle 2^{n-1} \rangle}$. We immediately see that for this to be defined the probability distribution must drop faster than ${2^n}$ as ${n}$ gets large! We already have a constraint on the possible forms for ${P}$.

Next consider the always-switch strategy. It’s easy to see that we get the same result:

$\displaystyle \langle V_S \rangle= \sum_{n=-\infty}^\infty P(n) [P(m=n|n) 2^{n+1} + P(m=n+1|n) 2^{n}]$

and since ${P(m=n|n)= P(m=n+1|n)}$ we get the same answer.

But let’s be extra pedantic, and connect this to the original formulation of the first two arguments. I.e., we should do it in terms of ${m}$, the observed value.

$\displaystyle \langle V_S \rangle= \sum_m P(m) [P(n=m|m) 2^{m+1} + P(n=m-1|m) 2^{m-1}]$

We observe that ${P(n=m|m)= P(m|n=m)P(n=m)/P(m)}$ and ${P(n=m-1|m)= P(m|n=m-1)P(n=m-1)/P(m)}$. We know that ${P(m|n=m)= P(m|n=m-1)= 0.5}$. Plugging these in, we get

$\displaystyle \langle V_S \rangle= \sum_m [0.5 P(n=m) 2^{m+1} + 0.5 P(n=m-1) 2^{m-1}]$

The first term gives us ${\sum_n P(n) 2^n}$. We can rewrite the index on the 2nd sum to get ${\sum_n P(n) 2^{n-1}}$, which gives us ${\langle V_S \rangle= \sum_m P(n) (2^n + 2^{n-1})}$, the exact same expression as before!

How does this apply to the ${[M,N]}$ ranged example we gave before? When we discussed it, we considered the case where the underlying distribution was known. In that and all other cases, a better than even-odds strategy based on such knowledge can be computed. In our actual formulation of the game, we don’t know ${P(n)}$ and there’s no reason it couldn’t be uniform on some unknown interval ${[M,N]}$. Suppose it was. It still seems from our earlier discussion as if we’d do better by always switching. We don’t. The average amount thrown away by incorrectly switching when ${m=N}$ exactly offsets the average gain from switching in all other cases. We do no better by switching than by keeping.

We thus see that without knowing the underlying distribution ${P(x)}$, the switching and keeping strategies have the same expected reward. Of the three arguments we originally proposed, the first 2 were flawed in that they assume a particular, and impossible, underlying distribution for ${x}$.

At the beginning of our discussion, we mentioned that our intuition says you cannot do better than 50-50 probability-wise. Let us set aside expected rewards and focus solely on probabilities. We now see how you actually can do better than 50-50, contrary to all intuition!

— Achieving better than 50-50 Odds with Two Envelopes —

Next let’s consider a broader class of two-envelope problems, but purely from the standpoint of probabilities. Now the two envelopes can contain any numbers; one need not be double the other. As before, we may choose an envelope, it is opened, and we are offered the opportunity to keep it or switch. Unlike before, our goal now is to maximize the probability of picking the larger envelope.

Since we are dealing with probabilities rather than expectation values, we don’t care what two numbers the envelopes contain. In fact, they need not be numbers at all — as long as they are distinct and comparable (i.e. ${a or ${b but not both). To meaningfully analyze the problem we require a slightly stronger assumption, though: specifically that the set from which they be drawn (without repetition) possesses a strict linear ordering. However, it need not even possess any algebraic structure or a metric. Since we are not concerned with expectation values, no such additional structure is necessary.

Our intuition immediately tells us that nothing can be gained by switching. In fact, nothing we do should have any impact on the outcome. After all, the probability of initially picking correctly is ${1/2}$. Switching adds no information and lands us with an identical ${1/2}$ probability. And that is that, right? It turns out that, contrary to our very strong intuition about the problem, there is in fact a way to improve those odds. To accomplish this, we’ll need to introduce a source of randomness. For convenience of exposition we’ll assume the envelopes contain real numbers, and revisit the degree to which we can generalize the approach later.

The procedure is as follows:

• Pick any continuous probability distribution ${P}$ which has support on all of ${R}$ (i.e. ${p(x)>0}$ for all real ${x}$). Most common distributions (normal, beta, exponential, etc) are fine.
• Choose an envelope and open it. We’ll denote its value ${z}$.
• Sample some value ${d}$ from our distribution ${P}$. If ${z>d}$ stick with the initial choice, otherwise switch. We’ll refer to ${z>d}$ or ${z because the probability that ${z=d}$ has measure ${0}$ and safely can be ignored.

At first, second, and ${n^{th}}$ glance, this seems pointless. It feels like all we’ve done is introduce a lot of cruft which will have no effect. We can go stand in a corner flipping a coin, play Baccarat at the local casino, cast the bones, or anything else we want, and none of that can change the probability that we’re equally likely to pick the lower envelope as the higher one initially — and thus equally likely to lose as to gain by switching. With no new information, there can be no improvement. Well, let’s hold that thought and do the calculation anyway. Just for fun.

First some terminology. We’ll call the value in the opened envelope ${z}$, and the value in the other envelope ${z'}$. The decision we must make is whether to keep ${z}$ or switch to the unknown ${z'}$. We’ll denote by ${x}$ and ${y}$ the values in the two envelopes in order. I.e., ${x by definition. In terms of ${z}$ and ${z'}$ we have ${x= \min(z,z')}$ and ${y= \max(z,z')}$. We’ll denote our contrived distribution ${P}$ in the abstract, with pdf ${p(v)}$ and cdf ${F(v)=\int_{-\infty}^v p(v') dv'}$.

Let’s examine the problem from a Bayesian perspective. There is a 50-50 chance that ${(z,z')=(x,y)}$ or ${(z,z')=(y,x)}$. So ${p(z=x)=p(z=y)=0.5}$. There are no subtleties lurking here. We’ve assumed nothing about the underlying distribution over ${(x,y)}$. Whatever ${(x,y)}$ the envelopes contain, we are equally likely to initially pick the one with ${x}$ or the one with ${y}$.

Once the initial envelope has been opened, and the value ${z}$ revealed, we sample ${d}$ from our selected distribution ${P}$ and clearly have ${p(d and ${p(d and ${p(d. The latter forms the criterion by which we will keep ${z}$ or switch to ${z'}$. Please note that in what follows, ${d}$ is not a free variable, but rather a mere notational convenience. Something like ${p(x is just notation for “the probability the sampled value is greater than ${x}$.” We can apply Bayes’ law to get (with all probabilities conditional on some unknown choice of ${(x,y)}$):

$\displaystyle p(z=x|d

What we really care about is the ratio:

$\displaystyle \frac{p(z=x | d

Here, we’ve observed that ${p(d and ${F(x) since by assumption ${x and ${F}$ is monotonically increasing (we assumed its support is all of ${R}$). I.e., if ${d there is a greater probability that ${z=y}$ than ${z=x}$. We shouldn’t switch. A similar argument shows we should switch if ${d>z}$.

So what the heck has happened, and where did the new information come from? What happened is that we actually know one piece of information we had not used: that the interval ${(x,y)}$ has nonzero probability measure. I.e. there is some “space” between ${x}$ and ${y}$. We don’t know the underlying distribution but we can pretend we do. Our strategy will be worse than if we did know the underlying ${p(x)}$, of course. We’ll return to this shortly, but first let’s revisit the assumptions which make this work. We don’t need the envelopes to contain real numbers, but we do require the following of the values in the envelopes:

• The set of possible values forms a measurable set with a strict linear ordering.
• Between any two elements there is a volume with nonzero probability. Actually, this only is necessary if we require a nonzero improvement for any ${(x,y)}$. If we only require an improvement on average we don’t need it. But in that scenario, the host can contrive to use a distribution which neutralizes our strategy and returns us to 50-50 odds.

What difference does ${P}$ itself make? We don’t have any way to choose an “optimal” distribution because that would require placing the bulk of probability where we think ${x}$ and ${y}$ are likely to lie. I.e. we would require prior knowledge. All we can guarantee is that we can improve things by some (perhaps tiny) amount. We’ll compute how much (for a given true underlying distribution) shortly.

Let’s assume that ${Q(x,y)}$ is the true underlying distribution over ${(x,y)}$. We won’t delve into what it means to “know” ${Q}$ since we are handed the envelopes to begin with. Perhaps the game is played many times with values drawn according to ${Q}$ or maybe it is a one-time affair with ${(x,y)}$ fixed (i.e. ${Q}$ a ${\delta}$-distribution). Ultimately, such considerations just would divert us to the standard core philosophical questions of probability theory. Suffice to say that there exists some ${Q(x,y)}$. By definition ${Q(x,y)=0}$ unless ${x. For convenience, we’ll define a symmetrized version as well: ${q(a,b)\equiv Q(a,b)+Q(b,a)}$. We don’t employ a factor of ${1/2}$ since the two terms are nonzero on disjoint domains.

Given ${Q}$, what gain do we get from a particular choice of ${P}$?

$\displaystyle \begin{array}{rcl} P(win)= \int_{x

I.e., the probability we keep ${z}$ when it is ${y}$ and switch when it is ${x}$. Clearly, ${p(z=x|(x,y))= p(z=y|(x,y))= 0.5}$ since those are the immutable 50-50 envelope ordering probabilities. After a little rearrangement, we get:

$\displaystyle P(win)= \frac{1}{2} + \langle F(y) - F(x) \rangle_Q$

Our gain is the mean value of ${F(y)-F(x)}$ over the joint distribution ${Q(x,y)}$. The more probability ${P}$ jams between ${x}$ and ${y}$, the more we gain should that ${(x,y)}$ arise. But without knowledge of the underlying joint distribution ${Q(x,y)}$, we have no idea how best to pick ${P}$. All we can do is guarantee some improvement.

How well can we do if we actually know ${Q}$? Well, there are two ways to use such information. We could stick to our strategy and try to pick an optimal ${P}$, or we could seek to use knowledge of ${Q}$ directly. In order to do the former, we need to exercise a little care. ${Q}$ is a two-dimensional distribution while ${P}$ is one-dimensional. How would we use ${Q}$ to pick ${P}$? Well, this is where we make use of the observed ${z}$.

In our previous discussion of the ${(x,2x)}$ envelope switching fallacy, the value of ${z}$ turned out to be a red-herring. Here it is not. Observing ${z}$ is essential here, but only for computation of probabilities. As mentioned, we assume no algebraic properties and are computing no expectations. We already know that the observation of ${z}$ is critical, since our algorithm pivots on a comparison between ${z}$ and our randomly sampled value ${d}$. Considering our ultimate goal (keep or switch), it is clear what we need from ${Q}$: a conditional probability that ${z'>z}$. However, we cannot directly use ${Q(y|x)}$ because we defined ${x. We want ${p(z'|z)}$ and we don’t know whether ${z or ${z'. Let’s start by computing the probability of ${z}$ (being the observed value) and of ${z,z'}$ (being the observed and unobserved values).

The probability of observing ${z}$ and the other envelope having ${z'}$ is the probability that the relevant ordered pair was chosen for the two envelopes multiplied by the ${1/2}$ probability that we initially opened the envelop containing the value corresponding to our observed ${z}$ rather than the other one.

$\displaystyle p(z,z')= Q(min(z,z'),max(z,z'))/2= q(z,z')/2$

To get ${p(z)}$ we integrate this. ${p(z)= \frac{1}{2}\int Q(z,y)dy + \frac{1}{2}\int Q(x,z)dz}$. This is a good point to introduce two quantities which will be quite useful going forward.

$\displaystyle I_1(z)\equiv \int_{-\infty}^z Q(x,z) dx$

$\displaystyle I_2(z)\equiv \int_z^\infty Q(z,y) dy$

In terms of these,

$\displaystyle p(z)= \frac{1}{2}[I_1(z)+I_2(z)]$

There’s nothing special about calling the variables ${x}$ or ${y}$ in the integrals and it is easy to see (since each only covers half the domain) that we get what we would expect:

$\displaystyle p(z)= \frac{1}{2}\int q(w,z)dw$

What we want is the distribution ${p(z'|z)= p(z,z'|z)= p(z,z')/p(z)= q(z,z')/p(z)}$. This gives us:

$\displaystyle p(z'|z)= \frac{q(z,z')}{\int q(w,z)dw}= \frac{q(z,z')}{I_1(z)+I_2(z)}$

Finally, this gives us the desired quantity ${p(z'>z)= \int_{z'>z} dz' p(z'|z)}$. It is easy to see that:

$\displaystyle p(z'

$\displaystyle p(z'>z)= \frac{I_2(z)}{I_1(z)+I_2(z)}$

As an example, consider the previous ${(x,2x)}$ case — where one envelope holds twice what the other does. We observe ${z}$, and ${z'}$ must be either ${2z}$ or ${z/2}$, though we don’t know with what probabilities. If we are given the underlying distribution on ${x}$, say ${P_2(x)}$, we can figure that out. ${Q(x,y)= P_2(x)\delta(y-2x)}$ and ${q}$ is the symmetrized version. ${\int q(w,z)dw= \int dw [Q(w,z)+Q(z,w)]= (P_2(z/2)+P_2(2z))}$. So ${p(z)= \frac{1}{2}(P_2(z/2)+P_2(2z))}$. This is just what we’d expect — though we’re really dealing with discrete values and are being sloppy (which ends us up with a ratio of infinities from the ${\delta}$ function when computing probability ratios, but we’ll ignore that here). The relevant probability ratio clearly is ${P_2(z/2)/P_2(2z)}$. From a purely probability standpoint, we should switch if ${P_2(2z)>P_2(z/2)}$. If we reimpose the algebraic structure and try to compute expectations (as in the previous problem) we would get an expected value of ${z}$ from keeping and an expected value of ${z[P_2(z/2)/2 + 2P(2z)]}$ from switching . Whether this is less than or greater than ${z}$ depends on the distribution ${P_2}$.

Returning to our analysis, let’s see how often we are right about switching if we know the actual distribution ${Q}$ and use that knowledge directly. The strategy is obvious. Using our above formulae, we can compute ${p(z' directly. To optimize our probability of winning, we observe ${z}$ then we switch iff ${I_1(z). If there is additional algebraic structure and expectations can be defined, then an analogous calculations give whatever switching criterion maximizes the relevant expectation value.

In terms of probabilities, full knowledge of ${Q}$ is the best we can do. The probability we act correctly is:

$\displaystyle \begin{array}{rcl} P'(win)= \int dz \frac{[\theta(I_1(z)-I_2(z)) I_1(z) + \theta(I_2(z)-I_1(z))I_2(z)]}{I_1(z)+I_2(z)} \\ = \int dz \frac{\max(I_1(z),I_2(z))}{(I_1(z)+I_2(z)} \end{array}$

$\displaystyle P'(win|z)= \frac{\max(I_1(z),I_2(z))}{(I_1(z)+I_2(z)}$

Since ${I_1}$ and ${I_2}$ are monotonic (one increasing, the other decreasing), we have a cutoff value ${\hat z}$ (defined by ${I_1({\hat z})= I_2({\hat z})}$) below which we should switch and above which we should not.

How do we do with our invented ${P}$ instead? We could recast our earlier formula for ${P(win)}$ into our current notation, but it’s easier to compute directly. For given ${z}$, the actual probability of needing to switch is ${I_2(z)/(I_1(z)+I_2(z))}$. Based on our algorithm, we will do so with probability ${P(z. The probability of not needing to switch is ${I_1(z)}$ and we do so with probability ${P(z>d)= F(z)}$. I.e., our probability of success for given ${z}$ is:

$\displaystyle P(win|z)= \frac{I_1(z)F(z) + I_2(z)(1-F(z))}{I_1(z)+I_2(z)}$

For any given ${z}$, this is of the form ${\alpha r + (1-\alpha)(1-r)}$ where ${r= F(z)}$ and ${\alpha= I_1(z)/(I_1(z)+I_2(z))}$. The optimal solutions lie at one end or the other. So it obviously is best to have ${F(z)=0}$ when ${z<{\hat z}}$ and ${F(z)=1}$ when ${z>{\hat z}}$. This would be discontinuous, but we could come up with a smoothed step function (ex. a logistic function) which is differentiable but arbitrarily sharp. The gist is that we want all the probability in ${F}$ concentrated around ${\hat z}$. Unfortunately, we have no idea where ${\hat z}$ is!

Out of curiosity, what if we pick instead ${P}$ to be the conditional distribution ${p(z'|z)}$ itself once we’ve observed ${z}$? We’ll necessarily do worse than by direct comparison using ${Q}$ (the max formula above), but how much worse? Well, ${p(z'|z)= q(z,z')/(I_1(z)+I_2(z))}$. Integrating over ${z' we have ${F(z)= \int_{-\infty}^z p(z'|z) dz'= I_1(z)/(I_1(z),I_2(z))}$. I.e., We end up with ${(I_1^2(z)+I_2^2(z))/(I_1(z)+I_2(z))^2}$ as our probability of success. If we had used ${1-p(z'|z)}$ for our ${P}$ instead we would get ${2I_1(z)I_2(z)/(I_1(z)+I_2(z))^2}$ instead. Neither is optimal in general.

Next, let’s look at the problem from an information theory standpoint. As mentioned, there are two sources of entropy: (1) the choice of the underlying pair ${(x,y)}$ (with ${x by definition) and (2) the selection ${(z,z')=(x,y)}$ or ${(z,z')=(y,x)}$ determined by our initial choice of an envelope. The latter is a fair coin toss with no information and maximum entropy. The information content of the former depends on the (true) underlying distribrution.

Suppose we have perfect knowledge of the underlying distribution. Then any given ${z}$ arises with probability ${p(z)=\frac{1}{2}[I_1(z)+I_2(z)]}$. Given that ${z}$, we have a Bernoulli random variable ${p(z'>z)}$ given by ${I_2(z)/(I_1(z)+I_2(z))}$. The entropy of that specific coin toss (i.e. the conditional entropy of the Bernoulli distribution ${p(z'> z|z)}$) is

$\displaystyle H(z'>z|z)= \frac{-I_1(z)\ln I(z) - I_2(z)\ln I_2(z) + (I_1(z)+I_2(z))\ln [I_1(z)+I_2(z)]}{I_1(z)+I_2(z)}$

With our contrived distribution ${P}$, we are implicitly are operating as if ${p(z'>z)= 1-F(z)}$. This yields a conditional entropy:

$\displaystyle H'(z'>z|z)= -(1-F(z))\ln (1-F(z)) - F(z)\ln F(z)$

There is a natural measure of the information cost of assuming an incorrect distribution. It is the Kullback Liebler Divergence (also known as the relative entropy). While it wouldn’t make sense to compute it between ${Q}$ and ${P}$ (which are, among other things, of different dimension, we certainly can compare the cost for given ${z}$ of the difference in our Bernoulli random variables for switching — and then integrate over ${z}$ to get an average cost in bits. Let’s denote by ${q(z'>z)}$ the probability based on the true distribution and keep ${p(z'>z)}$ for the contrived one. I.e. ${q(z'>z)= I_2(z)/(I_1(z)+I_2(z))}$ and ${p(z'>z)= 1-F(z)}$. For given ${z}$, the K-L divergence is:

$\displaystyle D(Q || P, z)= \frac{-I_2(z)\ln [(I_1(z)+I_2(z))(1-F(z))/I_2(z)] - I_1(z)\ln [(I_1(z)+I_2(z))F(z)/I_1(z)]}{I_1(z)+I_2(z)}$

Integrating this, we get the mean cost in bits of being wrong.

$\displaystyle \begin{array}{rcl} \langle D(Q || P) \rangle= \frac{1}{2}\int dz [-(I_1(z)+I_2(z))\ln [I_1(z)+I_2(z)] - I_2(z)\ln (1-F(z)) \\ -I_1(z)\ln F(z) + I_1(z)\ln I_1(z) + I_2(z)\ln I_2(z)] \end{array}$

The first term is simply ${H(z)}$, the entropy of our actual distribution over ${z}$. In fact, the first term and last 2 terms together we recognize as ${\langle H(z'>z|z) \rangle}$, the mean Bernoulli entropy of the actual distribution. In these terms, we have:

$\displaystyle \langle D(Q || P) \rangle= \langle H(z'>z|z) \rangle + \langle \frac{ -I_2(z)\ln(1-F(z)) - I_1(z)\ln F(z)}{I_1(z)+I_2(z)} \rangle$

where the expectations are over the unconditional actual distribution ${p(z)}$. The 2nd expectation on the right represents the cost of being wrong about ${P}$. If it was the optimal distribution with all probability centered near ${\hat z}$ then the term on the right would approach ${0}$ and there would be no entropy cost.

As an aside, this sort of probabilistic strategy should not be confused with the mixed strategies of game theory. In our case, a mixed strategy would be an apriori choice ${aK+(1-a)S}$ where ${K}$ is the always-keep strategy, ${S}$ is the always-switch strategy, and ${0\le a\le 1}$ is the probability of employing the always-keep strategy. A player would flip a biased-coin with Bernoulli probability ${a}$ and choose one of the two-strategies based on it. That has nothing to do with the measure-theory approach we’re taking here. In particular, a mixes strategy makes no use of the observed value ${x}$ or its relation to the randomly sampled value. Any mixed strategy gives even-odds because the two underlying deterministic strategies both have even-odds.

# Semidirect Products, Split Exact Sequences, and all that

One of the things I’ve butted heads with in studying Lie Groups is the semidirect product and its relationship to split exact sequences. It quickly became apparent that this was a pretty sizeable hole in my basic knowledge, so I decided to clarify this stuff once and for all.

— Normal Subgroups and Quotient Groups —

First, a brief refresher on Normal subgroups and Quotient groups. We are given a group ${G}$ and subgroup ${H\subseteq G}$.

• Left cosets are written ${gH}$ and right cosets are written ${Hg}$. Each is a set of elements in ${G}$. Not all left cosets are distinct, but any two are either equal or disjoint. Ditto for right cosets.
• The left (right) cosets form a partition of ${G}$, but they do not in general form a group. We can try to imbue them with a suitable product, but there are obstructions to the group axioms. For example ${g^{-1}H}$ is not a useful inverse since ${(gh)^{-1}= h^{-1}g^{-1}}$, so neither left cosets nor right cosets multiply as desired. More generally ${(gg')H}$ does not consist of a product of an element of ${gH}$ and an element of ${g'H}$.
• We define the Quotient Set ${G/H}$ to be the set of left cosets. As mentioned, it is not a group in general. There is an equivalent definition for right cosets, written ${H\setminus{}G}$, but it doesn’t appear often. In most cases we care about the two are the same.
• It is easy to see that the condition which removes the obstruction is that ${gH=Hg}$ for all ${g}$. Equivalently, ${gHg^{-1}=H}$ for all ${g}$. If this holds, the cosets form a group. Often the stated condition is that the sets of left and right cosets are the same. But ${g\in gH,Hg}$ so this is the same exact condition.
• ${H}$ is a Normal Subgroup if it obeys the conditions which make the cosets into a group.
• Usually a Normal Subgroup is denoted ${N}$, and we write ${N\triangleleft G}$ (or ${N\trianglelefteq G}$).
• For a Normal subgroup ${N}$, the Quotient Set ${Q=G/N}$ has (by definition) the natural structure of a group. It is called the Quotient Group.
• We have two natural maps associated with a Normal Subgroup:
• ${N\xrightarrow{i} G}$ is an inclusion (i.e. injective), defined by ${h\rightarrow h}$ (where the righthand ${h}$ is viewed in ${G}$). This is a homomorphism defined for any subgroup, not just normal ones
• ${G\xrightarrow{q} Q}$ is the quotient map (surjective), defined by ${g\rightarrow gN}$ (with the righthand viewed as a coset, i.e. an element of ${G/N}$). This map is defined for any subgroup, with ${Q}$ the Quotient Set. For Normal Subgroups, it is a group homomorphism.
• We know there is a copy of ${N}$ in ${G}$. Though ${Q}$ is derived from ${G}$ and ${N}$, and possesses no new info, there may or may not be a copy of it in ${G}$. Two natural questions are when that is the case, and how ${G}$, ${N}$, and ${Q}$ are related in general.

Let’s also recall the First Isomorphism Theorem for groups. Given any two groups ${G}$ and ${H}$ and a homomorphism ${\phi:G\rightarrow H}$, the following hold:

• ${\ker \phi}$ is a Normal Subgroup of ${G}$
• ${\mathop{\text{im}} \phi}$ is a subgroup of ${H}$
• ${\mathop{\text{im}} \phi}$ is isomorphic to the Quotient Group ${G/\ker\phi}$.

Again, we have to ask: since ${\ker\phi}$ is a Normal Subgroup of ${G}$, and ${\mathop{\text{im}}\phi}$ is isomorphic to the Quotient Group ${G/\ker\phi}$ which “sort of” may have an image in ${G}$, is it meaningful to write something like (playing fast and loose with notation) ${G\stackrel{?}{=} \ker\phi \oplus \mathop{\text{im}} \phi}$ (being very loose with notation)? The answer is no, it’s more complicated.

— Exact Sequences —

Next, a very brief review of exact sequences. We’ll use ${1}$ for the trivial group. The usual convention is to use ${1}$ for general groups and ${0}$ for Abelian groups. An exact sequence is a sequence of homomorphisms between groups ${\cdots \rightarrow G_n \xrightarrow{f_n} G_{n-1}\xrightarrow{f_{n-1}} \cdots}$ where ${\mathop{\text{im}} f_n= \ker f_{n-1}}$ for every pair. Here are some basic properties:

• ${1\rightarrow A \xrightarrow{f} B\cdots}$ means that ${f}$ is injective.
• ${\cdots A\xrightarrow{f} B\rightarrow 1}$ means that ${f}$ is surjective.
• ${1\rightarrow A\rightarrow B\rightarrow 1}$ means ${A=B}$.
• Short Exact Sequence (SES): This is defined as an exact sequence of the form: ${1\rightarrow A\xrightarrow{f} B\xrightarrow{g} C\rightarrow 1}$.
• For an SES, ${f}$ is injective, ${g}$ is surjective, and ${C=B/\mathop{\text{im}} f}$
• SES’s arise all the time when dealing with groups, and the critical question is whether they “split”.

We’re now ready to define Split SES’s.

• Right Split SES: There exists a homomorphism ${h:C\rightarrow B}$ such that ${g\circ h=Id_C}$. Basically, we can move to ${B}$ and back from ${C}$ without losing info — which means ${C}$ is in some sense a subgroup of ${B}$.
• Left Split SES: There exists a homomorphism ${h:B\rightarrow A}$ such that ${h\circ g=Id_A}$. Basically, we can move to ${B}$ and back from ${A}$ without losing info — which means ${A}$ is in some sense a subgroup of ${B}$.
• These two conditions are not in general equal, or even equivalently restrictive. The Left Split condition is far more constraining than the Right Split one in general. The direction of the homomorphisms in the SES introduce an asymmetry. [My note: it seems likely that the two are dual in some sense.]

— External vs Internal View —

We’re going to described 3 types of group operations: the direct product, semi-direct product, and group extension. Each has a particular relationship to Normality and SES’s. There are two equivalent ways to approach this, depending whether we prefer to define a binary operation between two distinct groups or to consider the relationship amongst subgroups of a given group.

• External view: We define a binary operation on two distinct, unrelated groups. Two groups go in, and another group comes out.
• Internal view: We define a relationship between a group and various groups derived from it (ex. Normal or Quotient).
• These approaches are equivalent. The Internal view describes the relationship amongst the two groups involved in the External view and their issue. Conversely, the derived groups in the Internal view may be recombined via the External view operation.

We must be a little careful with notation and terminology. When we use the symbol ${HK}$, it can mean one of two things.

• Case 1: ${H}$ and ${K}$ are distinct groups. ${HK}$ is just the set of all pairs of elements ${(h,k)}$. I.e. it is the direct product set (but not group).
• Case 2: ${H}$ and ${K}$ are subgroups of a common group ${G}$ (or have some natural implicit isomorphisms to such subgroups). In this case, ${HK}$ is the set of all elements in ${G}$ obtained as a product of an element of ${H}$ and an element of ${K}$ under the group multiplication.
• Note that we may prefer cases where two subgroups cover ${G}$, but there are plenty of other possibilities. For example, consider ${Z_{30}}$ (the integers mod 30). This has several obvious subgroups (${Z_2}$, ${Z_3}$, ${Z_5}$, ${Z_6}$, ${Z_{10}}$, ${Z_{15}}$). ${Z_2}$ and ${Z_3}$ only intersect on ${0}$ (the additive identity). However, the two do not cover (or even generate) the group! Similarly, ${Z_2}$ and ${Z_{10}}$ do not cover the group (or even generate it) but intersect on a nontrivial subset!
• Going the other way, we’ll say that ${G=HK}$ if ${H}$ and ${K}$ are subgroups and every element ${g}$ can be written as ${hk}$ for some ${h\in H}$ and ${k\in K}$. Note that ${H}$ and ${K}$ need not be disjoint (or even cover ${G}$ set-wise).

Another potentially confusing point should be touched on. When we speak of “disjoint” subgroups ${H}$ and ${K}$ we mean that ${H\cap K=\{e\}}$, NOT that it is the null set. I.e., ${H\cap K= 1}$, the trivial group.

— Semidirect Product —

The semidirect product may seem a bit arbitrary at first but, as we will see, it is a natural part of a progression which begins with the Direct Product. Here are the two ways of defining it.

• External view (aka Outer Semidirect Product): Given two groups ${H}$ and ${K}$ and a map ${\phi:K\rightarrow Aut(H)}$, we define a new group ${H\rtimes K}$. We’ll denote by ${\phi_k(h)}$ the effect of the automorphism ${\phi(k)}$ on ${h}$ (and thus an element of ${H}$). Set-wise, ${H\rtimes K}$ is just ${H\times K}$ (i.e. all pairs ${(h,k)}$). The identity is ${(e,e)}$. Multiplication on ${H\rtimes K}$ is defined as ${(h,k)(h',k')= (h\phi_k(h'),kk')}$. The inverse is ${(h,k)^{-1}= (\phi_{k^{-1}}(h^{-1}),k^{-1})}$.
• Internal view (aka Inner Semidirect Product): Given a group ${G}$ and two disjoint subgroups ${N}$ and ${K}$, such that ${G=NK}$ and ${N}$ is a Normal Subgroup, ${G}$ is called the Semidirect product ${N\rtimes K}$. The normality of ${H}$ constrains ${K}$ to be isomorphic to the Quotient Group ${G/N}$.

• There are (potentially) many Semidirect products of two given groups, obtained via different choices of ${\phi}$. The notation is deceptive because it hides our choice of ${\phi}$. Given any ${H,K,\phi}$ there exists a Semidirect product ${H\rtimes K}$. The various Semidirect products may be isomorphic to one another, but in general need not be. I.e., a given ${H}$ and ${K}$ may have multiple distinct semidirect products. This actually happens. Wikipedia mentions that there are 4 non-isomorphic semidirect products of ${C_8}$ and ${C_2}$ (the former being the Normal Subgroup in each case). One is a Direct Product, and the other 3 are not.
• It also is possible for a given group ${G}$ to arise from several distinct Semidirect products (of different pairs of groups). Again from Wikipedia, there is a group of order 24 which can be written as 4 distinct semiproducts of groups.
• Yet another oddity is that a seemingly nontrivial ${H\rtimes K}$ can be isomorphic to ${H\oplus K}$.
• If ${\phi= Id}$ (i.e. every ${k}$ maps to the identify map on ${H}$), then ${G=H\oplus K}$.
• To go from the External view to the Internal one, we note that, by construction, ${H}$ is a Normal Subgroup of ${H\rtimes K}$ and ${K}$ is the Quotient Group ${G/H}$. To be precise, the Normal Subgroup is ${(N,e)}$, which is isomorphic to ${N}$, and the Quotient Group ${G/(N,e)}$ is isomorphic to ${K}$.
• To go from the Internal view to the External one, we choose ${\phi_k(h)= khk^{-1}}$ as our function. I.e., ${\phi}$ is just conjugation by the relevant element.
• It may seem like there is an imbalance here. For a specific choice of Normal Subgroup ${N}$, the External view offers complete freedom of ${\phi}$, while the Internal view has a fixed ${\phi}$. Surely the latter is a special case of the former. The fallacy in this is that we must consider the pair ${(G,N)}$. We very well could have non-isomorphic ${G,G'}$ with Normal Subgroups ${N,N'}$ where ${N\approx N'}$. I.e. they are the same Normal Subgroup, but with different parent groups. We then would have different ${\phi}$‘s via our Internal view procedure. The correspondence is between ${(H,K,\phi)}$ and ${(G,N,K)}$ choices. Put differently, the freedom in ${\phi}$ loosely corresponds to a freedom in ${G}$.
• Note that, given ${G}$ and a Normal Subgroup ${N}$ — with the automatic Quotient Group ${G/N}$ — we do NOT necessarily have a Semidirect product relationship. The condition of the Semidirect product is stricter than this. As we will see it requires not just isomorphism, but a specific isomorphism, between ${H}$ and ${G/N}$. Equivalently, it requires a Right-Split SES (as we will discuss).
• The multiplication defined in the External view may seem very strange and unintuitive. In essence, here is what’s happening. For a direct product, ${H}$ and ${K}$ are independent of one another. Each half of the pair acts only on its own elements. For a semidirect product, the non-normal half ${K}$ can twist the normal half ${H}$. Each element of ${K}$ can alter ${H}$ in some prescribed fashion, embodied in ${\phi(k)}$. So ${K}$ is unaffected by ${H}$ but ${H}$ can be twisted by ${K}$.
• It is interesting to compare the basic idea to that of a Fiber bundle. There, the fiber can twist (via a group of homeomorphisms) as we move around the base space. Here, the normal subgroup can twist as we move around the non-normal part. Each generalizes a direct product and measures our need to depart from it.
• The semidirect product of two groups is Abelian iff it’s just a direct product of abelian groups.

— Group Extensions —

As with Semidirect products, there are 2 ways to view these. To make matters confusing, the notation speaks to an Internal view, while the term “extension” speaks to an External view.

• External view: Given groups ${A}$ and ${C}$, we say that ${B}$ is an extension of ${C}$ by ${A}$ if there is a SES ${1\rightarrow A\rightarrow B\rightarrow C\rightarrow 1}$.
• Internal view: Given a group ${G}$ and Normal Subgroup ${N\triangleleft G}$, we say that ${G}$ is an extension of ${Q}$ by ${N}$, where ${Q=G/N}$ is the Quotient Group.
• Note that the two are equivalent. If ${B}$ is an extension of ${A}$ by ${C}$, then ${A}$ is Normal in ${B}$ and ${C}$ is isomorphic to the Quotient Group ${B/A}$.
• Put simply, the most general form of the Group, Normal Subgroup, induced Quotient Group trio is the Group Extension.

— Direct Products, Semidirect Products, and Group Extensions —

In the External view, we’ve mentioned three means of getting a group ${B}$ from two groups ${A}$ and ${C}$:

• Direct Product: ${B=A\oplus C}$. This is unique.
• Semidirect Product: ${B=A\rtimes C}$. There may multiple of these, corresponding to different ${\phi}$‘s.
• Group Extension: A group ${B}$ for which there are 2 homomorphisms forming a SES ${1\rightarrow A\rightarrow B\rightarrow C\rightarrow 1}$. There may be many of these, corresponding to different choices of the two homomorphisms.

Equivalently, we have several ways of describing the relationship between two subgroups ${H,K\subseteq G}$ which are disjoint (i.e. ${H\cap K=\{e\}}$).

• Direct Product: ${G=H\oplus K}$ requires that both be Normal Subgroups.
• Semidirect Product: ${G=H\rtimes K}$ requires that ${H}$ be normal (in which case, ${Q=G/H}$, and ${\phi}$ is determined by it). For a given ${H}$ there may be multiple, corresponding to different ${G}$‘s.
• Group Extension: Both ${H}$ and ${K}$ sit in ${G}$ to some extent. ${H}$ must be Normal.

Note that not every possible relationship amongst groups is captured by these. For example, we could have two non-normal subgroups or two homomorphisms which don’t form an SES, or no relationship at all.

An excellent hierarchy of conditions was provided by Arturo Magidin in answer to someone’s question on Stackoverflow. I roughly replicate it here. Unlike him, I’ll be sloppy and not distinguish between subgroups and groups isomorphic to subgroups.

• Direct Product (${G=H\oplus K}$): ${H,K}$ both Normal Subgroups. ${H,K}$ disjoint. ${G=HK}$
• Semidirect Products (${G=H\rtimes K}$): ${H}$ Normal Subgroup, ${K}$ Subgroup. ${H,K}$ disjoint. ${G=HK}$. I.e., we lose Normality of ${K}$.
• Group Extension (${G}$ is extension of ${H}$ by ${K}$): ${H}$ Normal Subgroup, ${G/H\approx K}$. I.e. ${K}$ remains the Quotient Group (as before), but the Quotient Group may no longer be a subgroup of ${G}$ at all!

Now is a good time to mention the relationship between the various SES Splitting conditions:

• For all groups: Left Split is equivalent to ${B=A\oplus C}$, and they imply Right Split. (LS=DP) => RS always.
• For abelian groups, the converse holds and Right split implies Left Split and Direct Sum. I.e. the conditions are equivalent. LS=DP=RS for Abelian.
• For nonabelian groups: Right Split implies ${B=A\rtimes C}$ (with ${\phi}$ depending on the SES map). We’ll discuss this shortly.

Back to the hierarchy, now from a SES standpoint:

• Most general case: There is no SES at all. Given groups ${A,B,C}$, there may be no homomorphisms between them. If there are homomorphisms, there may be none which form an SES. Consider a general pair of homomorphisms ${f:A\rightarrow B}$ and ${g:B\rightarrow C}$, with no assumptions. We may turn to the first isomorphism theorem for help, but that does us no good. The first isomorphism theorem says that ${\ker f \triangleleft B}$ and ${\mathop{\text{im}} f\approx A/\ker f}$, and ${\ker g \triangleleft C}$ and ${\mathop{\text{im}} g\approx B/\ker g}$. This places no constraints on ${A}$ or ${C}$.
• Group Extension: Any SES defines a group extension. They are the same thing.
• Semidirect Product: Any SES which right-splits corresponds to a Semidirect Product (with the right-split map determining ${\phi}$)
• Direct Product: Any SES which left-splits (and thus right-splits too) corresponds to a direct product.

So, when we see the standard SES: ${1\rightarrow N\rightarrow G\rightarrow G/N\rightarrow 1}$, this is a group extension. Only if it right splits can we write ${G= N\rtimes G/N}$, and only if it left splits can we write ${G= N\oplus G/N}$.

— Some Notes —

• Group Extensions are said to be equivalent if their ${B}$‘s are isomorphic and there exists an isomorphism between them which makes a diamond diagram commute. It is perfectly possible for the ${B}$‘s to be isomorphic but for two SES’s not to be equivalent extensions.
• Subtlety referred to above. A quotient group need not be isomorphic to a subgroup of ${G}$. It only is defined when ${N}$ is normal, and there automatically is a surjective homomorphism ${G\rightarrow Q}$. But we don’t have an injective homomorphism ${Q\rightarrow G}$, which is what would be need for it to be isomorphic to a subgroup of ${G}$. This is precisely what the right-split furnishes. In that case, it is indeed a subgroup of ${G}$. The semidirect product may be thought of as the statement that ${Q}$ is a subgroup of ${G}$.
• In the definition of right split and left split, the crucial aspect of the “inverse” maps is that they be homomorphisms. A simple injective (for right-split, or surjective for left-split) map is not enough!
• It is sometimes said that the concept of subgroup is dual to the concept of quotient group. This is intuitive in the following sense. A subgroup can be thought of as an injective homomorphism. By the SES for normal/quotient groups, we can think of a quotient group as a surjective homomorphism. Since injections and surjections are categorically dual, it makes sense to think of quotient groups and subgroups as similarly dual. Whether the more useful duality is subgroup quotient group or normal subgroup quotient group is unclear to me.

# Differential Entropy

This originally appeared on my tech blog. It’s primarily notes on the subtleties of differential entropy, but also contains a review of discrete entropy, various entropy-related information quantities such as mutual information, and a listing of various axiomatic formulations.