# The Truth about Stock Prices: 13 Myths

No-fee trading has invited a huge influx of people new to trading. In this article, I discuss the basics of “price formation,” the mechanism by which stock prices are determined.

Like many of us, for much of my life I assumed that a stock had a well-defined “price” at any given point in time. You could buy or sell at that price, and the price would move based on this activity. If it went up you made money, if it went down you lost money. Trading was easy: you just picked the stocks which would go up.

Unfortunately, a youthful indiscretion landed me doing five years at the Massachusetts Institute of Technology. When the doors finally slammed shut behind me, I emerged with little more than a bus ticket and some physics-department issued clothes. Women crossed themselves, and men looked away. Nobody reputable would hire a man with such a checkered past, and the PhD tats didn’t help. So I ended up with the only sort of people interested in such hard cases: Wall Street.

After a couple of years, I caught the eye of a particularly unsavory boss, and he recruited me into a crew doing stat arb at a place called Morgan Stanley. It took me five years to find a way out, and even then the way was fraught with peril. I tried to get out, but they kept pulling me back in. I was in and out of corporations for the next few years, and even did some contract work for a couple of big hedge funds. Only in the turf wars of 2008 did I manage to cut ties for good. The big boys were so busy bankrupting one another, who’d notice one missing guy? The scars are still there, and I always keep an eye on the street. Who knows when a van full of Harvard MBAs will come for me.

On the plus side, I did learn a bit about market microstructure. As it happens, my erstwhile view of prices were naive in many ways.

Rather than a detailed exposition on market microstructure (which varies from exchange to exchange, but has certain basic principles), I will go through a number of possible misconceptions. Hopefully, this will be of some small help to new traders who wish to better understand the dynamics of the stock market. At the very least, it will make you sound smart at cocktail parties.

Because we live in America, and everybody sues everyone about everything, I’ll state the obvious. Before you do anything, make sure you know what you are doing. If you read it here, that doesn’t mean it’s right or current. Yes, I worked in high frequency statistical arbitrage for many years. However, my specific knowledge may be dated. You should confirm anything before relying heavily on it. In particular, I am no tax expert. Be sure to consult an accountant, a lawyer, a doctor, a rabbi, and a plumber before attempting anything significant. And if you do, please send me their info. It’s really hard to find a good accountant, lawyer, doctor, rabbi, or plumber.

Seriously, don’t take anything I say (or anyone else says) as gospel. I’ve tried to be as accurate as possible, but that doesn’t mean there aren’t technical errors. As always, the onus is on you to take care of your own money. As someone pointed out when I started, back when the traders still seemed like superstars: they weren’t paid the big bucks not to make mistakes, but to catch those mistakes before they became problems. My advice, and the best I can give, is that you inform yourself, do research, check, recheck, and recheck again before committing to a trade. In my personal trading I’ve never missed out by being slow and cautious, but I have gotten hammered by being hasty.

Now to the possible misconceptions. I’ll call them “myths” because that’s what popular websites do, so obviously it’s the right thing to do, and I prefer to do the right thing because the wrong thing is wrong.

In what follows “STCG” refers to “Short Term Capital Gain” and “LTCG” refers to “Long Term Capital Gain”. “STCL” and “LTCL” refer to the corresponding losses (i.e. negative gains).

Myth 1: There is a “price” for a stock at any given point in time. When a stock is traded during market hours, there is no single price. There is a bid (the highest priced buy-order) and an ask (the lowest priced sell-order). Often, what people call “the price” is the last trade price. However, sometimes it is the midpoint (bid+ask)/2 or (bid*bidsize+ask*asksize)/(bidsize+asksize), and sometimes more complicated limit-book centroids are used.

Myth 2: I can put a limit order for any price I want. Stocks (and options) trade at defined ticks. A tick is the spacing between allowed prices, and may itself vary with price. For example, the tick-size in stock ZZZ could be 0.01 for prices below 1.00 and 0.05 above that. Often, ticks are things like 1/8 or 1/16 rather than multiples of 0.01. The tick-size rules are per-exchange (or per-security-type on a given exchange) rather than per-stock. In our example, any stock’s price would have allowable values of …, 0.98,0.99, 1.00,1.05, 1.10, …

Myth 3: Limit Orders are better than market orders. Limit orders offer greater control over execution price, but they may not be filled or may result in adverse selection. Suppose ZZZ is trading with a bid of 100 and an ask of 101, with a tick size of 0.50. Alice places a limit order at 100.5 for 100 shares. It is quite possible that it will be filled right away, giving her at least 0.50 of execution improvement (per share) over a market order. But what if it is not filled? If the stock goes up, Alice has incurred what is called “opportunity cost.” Rather than 0.50 in savings she now must pay a higher price or forego the purchase. Why not just leave the limit order out there? Surely it will get filled as the stock bounces around. In fact, why not put a limit order at 98? If it gets executed, it’s free money. The problem is adverse selection. The limit order most likely would get filled when the market was dropping. Sure it could catch a temporary dip, but it also could be caught during a major decline. Statistically, the order will be filled at 98 when Alice does not want it to. She would have been able to buy at 97 or 96 or will be stuck with a falling stock. In the presence of an alpha (a statistical signal informing a predicted return) a noncompetitive bid may at times be appropriate, but In general there is no “free money.”

Myth 6: The price moves because when someone buys people get optimistic and when someone sells people get pessimistic. That certainly can happen. However, there is a more basic reason the price moves. When you buy at the ask, some or all of the sell-orders at that ask-level are filled. There may be hidden size which immediately appears or someone may jump in (or adjust a higher sell-order down), but generally the quote changes. Your trade also is registered as the last trade.

Myth 8: Shorting is just like buying negative shares, and the only difference is the sign. In many cases, it effectively behaves like this for the trader. However, the actual process is more complicated. “Naked shorts” generally are not allowed, though they can arise in anomalous circumstances. When you sell short, you are not simply assigned a negative number of shares and your PnL computed accordingly. You borrow specific shares of stock from a specific person. The matching process is called a “locate”, and is conducted at the broker-level if possible or at the exchange-level if the broker has no available candidates. There is an exception for market-makers and for brokers when a stock is deemed “easy to borrow”, meaning it is highly liquid and there will be no problem covering the short if necessary. Brokers maintain dynamic “easy to borrow” and “hard to borrow” stock lists for this purpose. There are situations in which a short may not behave as expected. Suppose Bob sells short 100 shares of ZZZ stock, and the broker locates it with Alice. Alice owns 100 shares, and the locate assigns these to Bob. If Alice decides to sell them at some point, Bob needs to be assigned new shares [note that this has no effect on the person Bob sold short to, just Bob]. If these cannot be located, he must exit his position. The short sale is contingent on the continuing existence of located shares. Moreover, if the market goes up a lot Bob may have to put up additional capital for the cost of covering at the higher price. In principle, a short can result in an unlimited loss. In practice, Bob would be closed out by margin call before then.

Myth 9: Shares are fungible. When you sell them, it doesn’t matter which ones you sell. They are fungible from the standpoint of stock trading (aside from the short-selling locates just discussed), but not from a tax standpoint. Most brokers allow you to choose which specific shares you are selling. Suppose Bob bought 100 shares of ZZZ at 50 three years ago and bought another 100 shares of ZZZ at 75 six months ago. ZZZ now is 100 and he decides to sell 100 shares. Selling the first 100 shares generates a LTCG of 5,000, whereas selling the second 100 shares generates a STCG of 2,500. The tax implications can be significant, and are discussed further below. The specifics of Bob’s situation will determine which sale is more advantageous (or less disadvantageous). Brokers generally default to FIFO accounting, meaning that the first shares bought are the first shares sold. Most brokers allow alternatives to be specified at the time of the trade, however. These may include LIFO (last shares bought are first shares sold) or direct specification of the shares themselves. Note that such accounting only applies within a given brokerage account, whereas the tax consequences are determined across all brokerage accounts.

Myth 11: Taxes just nuisances, and the price is what really matters. Taxes can eat a lot of your profit, and should be a primary consideration. Tax planning involves choosing accounts to trade in (401K or other tax-deferred vs regular), realizing losses to offset gains, and choosing assets with low turnover. Note that some mutual funds can generate weird capital gains through their internal trading. In extreme cases, someone could pay significant tax on a losing position in one. Why are taxes so important to trading? The main reason is that there can be a 25% (or more) difference in tax rate between a LTCG and a STCG. STCGs often are taxed punitively, and at best treated like ordinary income. Here in MA, the state tax alone is 12% for STCGs vs 5% for LTCGs. Federally, STCGs are treated as ordinary income while long term gains have their own rate. STCGs are defined (currently) as positions held for under one year, while LTCGs are held for over one year. Note that it is the individual positions that matter. If Bob owns 200 shares of ZZZ, bought in two batches then each batch has its own cost basis (price he paid for it) and purchase date. Also note that most options positions are expire in less than a year and would result in a STCG or STCL. A STCG can only be offset by a STCL, but a LTCG can be offset by a LTCL or STCL. Needless to say, STCLs are valuable (unpleasant since they’re losses, but valuable from a tax standpoint). They can be rolled to subsequent years under some circumstances, but may be automatically wasted against LTCGs if you are not careful. A good understanding of these details can save a lot of money. To illustrate the impact, suppose Alice has a (state+federal) 20% LTCG rate (marginal) and a 45% STCG rate (marginal). She makes 10,000 on a trade, not offset by any loss. If it is a LTCG, she pays 2,000 in taxes and keeps 8,000. If it is a STCG, she pays 4,500 and keeps 5,500. That′s an additional 2,500 out of her pocket. Since the markets pay us to take risk, she must take more risk or tie up more capital to make the same 8,000 of after-tax profit. How much more risk or capital? Not just the missing 25%, because the extra profit will be taxed at the 45% rate as well. We solve 0.55*x= 0.8*10,000, to get 14,545. Alice must take (loosely speaking) 45% more risk or tie up 45% more capital to take home the same amount. Note that the appearance of 45% both here and as the tax rate is coincidental.

Myth 12: Options act like leveraged stock. This is untrue for many reasons, but I’ll point out one specific issue. Options can be thought of as volatility bets. Yes, the Black Scholes formula depends on the stock price in a nonlinear manner, and yes the Black Scholes formula significantly underestimates tail risk. But for many purposes, it pays to think of options as predominantly volatility-based. Let’s return to our absurd but illustrative scenario involving Alice and Bob and their ridiculous behavior. As before, they trade ZZZ stock and are the only market participants. Alice sells to Bob until the price reaches 110, then decides she misses her 10 shares of stock. Bob too has an epiphany. He decides he hates ZZZ stock. They now switch roles and Bob sells her a share at 110, but he gets her strategy backward so the price goes down with each share rather than up. He sells her a share at 110, then 109, then 108, down to 101. Now he’s out of shares and they have both another revelation. They return to their original roles, and up the price goes. The day’s trading involves ZZZ stock price see-sawing between 101 and 110 in this fashion. Neither makes a net profit, and the price ends where it started (well, 101 vs 100 but that’s not important here). Consider somebody trading the options market (we said Alice and Bob were the only stock traders, but there could be a thriving options market). At the start of the day and the end of the day, the price is pretty much at the same level. However, the price of both call and put options has risen dramatically. Options prices are driven by several things: the stock price, the strike price, the time to expiry, and the volatility. If the stock price rises dramatically, put options will go down but not as much as the price change would seem to warrant. This is because volatility has increased. In our see-saw case, everything was constant (approximately) except the volatility. The stock price is unchanged, but the option prices have changed dramatically.

Myth 13: There are 13 myths. If you spend your time puzzling over this rather than trading, you will end up with the same amount of money (on average and minus transaction costs). Which leaves the market to me, so I can run the stock up to infinity, then sell to the one unwary buyer who gets greedy and dips his toe in at the wrong time. All your base are belong to me.

# Two-Envelope Problems

Let’s visit a couple of fun and extremely counterintuitive problems which sit in the same family. The first appears to be a “paradox,” and illustrates a subtle fallacy. The second is an absolutely astonishing (and legitimate) algorithm for achieving better than 50-50 oods of picking the higher of two unknown envelopes. Plenty of articles have discussed who discovered what ad nauseum so we’ll just dive into the problems.

— The Two Envelope Paradox: Optimizing Expected Return —

First, consider the following scenario. Suppose you are shown two identical envelopes, each containing some amount of money unknown to you. You are told that one contains double the money in the other (but not which is which or what the amounts are) and are instructed to choose one. The one you select is placed in front of you and its contents are revealed. You then are given a second choice: keep it or switch envelopes. You will receive the amount in the envelope you choose. Your goal is to maximize your expected payment.

Our intuition tells us that no information has been provided by opening the envelope. After all, we didn’t know the two values beforehand so learning one of them tells us nothing. The probability of picking the higher envelope should be ${1/2}$ regardless of whether we switch or not. But you weren’t asked to improve on the probability, just to maximize your expected payment. Consider the following 3 arguments:

• Let the amount in the the envelope you initially chose be ${z}$. If it is wrong to switch then the other envelope contains ${z/2}$, but if it is right to switch it contains ${2z}$. There are even odds of either, so your expectation if you switch is ${1.25z}$. This is better than the ${z}$ you get by sticking with the initial envelope, so it always is better to switch!
• Since we don’t know anything about the numbers involved, opening the first envelope gives us no information — so ignore that value. Call the amount in the other envelope ${z'}$. If it is wrong to switch then the envelope you chose contains ${2z'}$, and if right to switch it contains ${0.5z'}$. If you switch, you get ${z'}$ but if you don’t your expectation is ${1.25z'}$. So it always is better NOT to switch!
• Call the amounts in the two envelopes ${x}$ and ${2x}$ (though you don’t know which envelope contains which). You pick one, but there is equal probability of it being either ${x}$ or ${2x}$. The expected reward thus is ${1.5x}$. If you switch, the same holds true for the other envelope. So you still have an expected reward of ${1.5x}$. It doesn’t matter what you do.

Obviously, something is wrong with our logic. One thing that is clear is that we’re mixing apples and oranges with these arguments. Let’s be a bit more consistent with our terminology. Let’s call the value that is in the opened envelope ${z}$ and the values in the two envelopes ${x}$ and ${2x}$. We don’t know which envelope contains each, though. When we choose the first envelope, we observe a value ${z}$. This value may be ${x}$ or ${2x}$.

In the 3rd argument, ${P(z=x)= P(z=2x)= 0.5}$. If we switch, then ${\langle V \rangle= P(z=x)2x+P(z=2x)x = 1.5x}$. If we keep the initial envelope then ${\langle V \rangle= P(z=x)x+P(z=2x)2x = 1.5x}$. Whether we switch or not, the expected value is ${1.5x}$ though we do not know what this actually is. It could correspond to ${1.5z}$ or ${0.75z}$. We must now draw an important distinction. It is correct that ${P(z=x)= P(z=2x)= 0.5}$ for the known ${z}$ and given our definition of ${x}$ as the minimum of the two envelopes. However, we cannot claim that ${1.5x}$ is ${1.5z}$ or ${0.75z}$ with equal probability! That would be tantanmount to claiming that the envelopes contain the pairs ${(z/2,z)}$ or ${(z,2z)}$ with equal probability. We defined ${x}$ to be the minimum value so the first equality holds, but we would need to impose a constraint on the distribution over that minimum value itself in order for the second one to hold. This is a subtle point and we will return to it shortly. Suffice it to say that if we assume such a thing we are led right to the same fallacy the first two arguments are guilty of.

Obviously, the first two arguments can’t both be correct. Their logic is the same and therefore they must both be wrong. But how? Before describing the problems, let’s consider a slight variant in which you are NOT shown the contents of the first envelope before being asked to switch. It may seem strange that right after you’ve chosen, you are given the option to switch when no additional information has been presented. Well, this really is the same problem. With no apriori knowledge of the distribution over ${x}$, it is immaterial whether the first envelope is opened or not before the 2nd choice is made. This gives us a hint as to what is wrong with the first two arguments.

There actually are two probability distributions at work here, and we are confounding them. The first is the underlying distribution on ordered pairs or, equivalently, the distribution of the lower element ${x}$. Let us call it ${P(x)}$. It determines which two numbers ${(x,2x)}$ we are dealing with. We do not know ${P(x)}$.

The second relevant distribution is over how two given numbers (in our case ${(x,2x)}$) are deposited in the envelopes (or equivalently, how the player orders the envelopes by choosing one first). This distribution unambiguously is 50-50.

The problem arises when we implicitly assume a form for ${P(x)}$ or attempt to infer information about it from the revealed value ${z}$. Without apriori knowledge of ${P(x)}$, being shown ${z}$ makes no difference at all. Arguments which rely solely on the even-odds of the second distribution are fine, but arguments which implicitly involve ${P(x)}$ run into trouble.

The first two arguments make precisely this sort of claim. They implicitly assume that the pairs ${(z/2,z)}$ or ${(z,2z)}$ can occur with equal probability. Suppose they couldn’t. For simplicity (and without reducing the generality of the problem), let’s assume that the possible values in the envelopes are constrained to ${2^n}$ with ${n\in Z}$. The envelopes thus contain ${(2^n,2^{n+1})}$ for some integer ${n}$ (though we don’t know which envelope contains which value). For convenience, let’s work in terms of ${log_2}$ of the values involved (taking care to use ${2^n}$ when computing expectations).

In these terms, the two envelopes contain ${(n,n+1)}$ for some ${n=\log_2(x)}$ (defined to be the lesser of the two). We open one, and see ${m=\log_2(z)}$. If it is the upper then the pair is ${(m-1,m)}$, otherwise the pair is ${(m,m+1)}$. To claim that these have equal probabilities means that ${n=m-1}$ and ${n=m}$ are equally probable. We made this assumption independent of the value of ${m}$, so it would require that all pairs ${(n,n+1)}$ be equally probable.

So what? Why not just assume a uniform distribution? Well, for one thing, we should be suspicious that we require an assumption about ${P(x)}$. The 3rd argument requires no such assumption. Even if we were to assume a form for ${P(x)}$, we can’t assume it is uniform. Not just can’t as in “shouldn’t”, but can’t as in “mathematically impossible.” It is not possible to construct a uniform distribution on ${Z}$.

Suppose we sought to circumvent this issue by constraining ourselves to some finite range ${[M,N]}$, which we supposedly know or assume apriori. We certainly can impose a uniform distribution on it. Each pair ${(n,n+1)}$ has probability ${1/(N-M-1)}$ with ${n\in [M,N-1]}$. But now we’ve introduced additional information (in the form of ${N}$ and ${M}$), and it no longer is surprising that we can do better than even-odds! We always would switch unless the first envelope contained ${N}$. There is no contradiction between the first two arguments because we have apriori knowledge and are acting on it. We no longer are true to the original game.

Rather than dwell on this particular case, let’s solve the more general case of a given ${P(x)}$ (or in terms of ${log_2}$, ${P(n)}$). For any ${n}$ drawn according to ${P(n)}$, the envelopes contain ${(n,n+1)}$ in some order and it is equally likely that ${m=n}$ and ${m=n+1}$. If we know ${P}$ we can bet accordingly since it contains information. In that case, knowing ${m}$ (i.e. ${z}$) helps us. Let’s suppose we don’t know ${P}$. Then it still does not matter whether we observe the value ${z}$, because we don’t the know the underlying distribution!

There only are two deterministic strategies: always keep, always switch. Why? Suppose that the drawn value is ${n}$ (unknown to us) and the observed value is ${m}$. Note that these don’t require actual knowledge of the ${m}$ value, just that it has been fixed by the process of opening the envelope. Since we don’t know the underlying distribution, our strategy will be independent of the actual value. Given that the value doesn’t matter, we have nothing to do but always keep or always switch.

First consider the expected value with the always-keep strategy:

$\displaystyle \langle V_K \rangle= \sum_{n=-\infty}^\infty P(n) [P(m=n|n) 2^n + P(m=n+1|n) 2^{n+1}]$

I.e. we sum over all possible ordered pairs ${(n,n+1)}$ and then allow equal probability ${P(m=n+1|n)=P(m=n|n)=0.5}$ for either of the two envelope orders. So we have ${\langle V_K \rangle= \sum P(n) (2^n+2^{n+1})/2 = 3 \langle 2^{n-1} \rangle}$. We immediately see that for this to be defined the probability distribution must drop faster than ${2^n}$ as ${n}$ gets large! We already have a constraint on the possible forms for ${P}$.

Next consider the always-switch strategy. It’s easy to see that we get the same result:

$\displaystyle \langle V_S \rangle= \sum_{n=-\infty}^\infty P(n) [P(m=n|n) 2^{n+1} + P(m=n+1|n) 2^{n}]$

and since ${P(m=n|n)= P(m=n+1|n)}$ we get the same answer.

But let’s be extra pedantic, and connect this to the original formulation of the first two arguments. I.e., we should do it in terms of ${m}$, the observed value.

$\displaystyle \langle V_S \rangle= \sum_m P(m) [P(n=m|m) 2^{m+1} + P(n=m-1|m) 2^{m-1}]$

We observe that ${P(n=m|m)= P(m|n=m)P(n=m)/P(m)}$ and ${P(n=m-1|m)= P(m|n=m-1)P(n=m-1)/P(m)}$. We know that ${P(m|n=m)= P(m|n=m-1)= 0.5}$. Plugging these in, we get

$\displaystyle \langle V_S \rangle= \sum_m [0.5 P(n=m) 2^{m+1} + 0.5 P(n=m-1) 2^{m-1}]$

The first term gives us ${\sum_n P(n) 2^n}$. We can rewrite the index on the 2nd sum to get ${\sum_n P(n) 2^{n-1}}$, which gives us ${\langle V_S \rangle= \sum_m P(n) (2^n + 2^{n-1})}$, the exact same expression as before!

How does this apply to the ${[M,N]}$ ranged example we gave before? When we discussed it, we considered the case where the underlying distribution was known. In that and all other cases, a better than even-odds strategy based on such knowledge can be computed. In our actual formulation of the game, we don’t know ${P(n)}$ and there’s no reason it couldn’t be uniform on some unknown interval ${[M,N]}$. Suppose it was. It still seems from our earlier discussion as if we’d do better by always switching. We don’t. The average amount thrown away by incorrectly switching when ${m=N}$ exactly offsets the average gain from switching in all other cases. We do no better by switching than by keeping.

We thus see that without knowing the underlying distribution ${P(x)}$, the switching and keeping strategies have the same expected reward. Of the three arguments we originally proposed, the first 2 were flawed in that they assume a particular, and impossible, underlying distribution for ${x}$.

At the beginning of our discussion, we mentioned that our intuition says you cannot do better than 50-50 probability-wise. Let us set aside expected rewards and focus solely on probabilities. We now see how you actually can do better than 50-50, contrary to all intuition!

— Achieving better than 50-50 Odds with Two Envelopes —

Next let’s consider a broader class of two-envelope problems, but purely from the standpoint of probabilities. Now the two envelopes can contain any numbers; one need not be double the other. As before, we may choose an envelope, it is opened, and we are offered the opportunity to keep it or switch. Unlike before, our goal now is to maximize the probability of picking the larger envelope.

Since we are dealing with probabilities rather than expectation values, we don’t care what two numbers the envelopes contain. In fact, they need not be numbers at all — as long as they are distinct and comparable (i.e. ${a or ${b but not both). To meaningfully analyze the problem we require a slightly stronger assumption, though: specifically that the set from which they be drawn (without repetition) possesses a strict linear ordering. However, it need not even possess any algebraic structure or a metric. Since we are not concerned with expectation values, no such additional structure is necessary.

Our intuition immediately tells us that nothing can be gained by switching. In fact, nothing we do should have any impact on the outcome. After all, the probability of initially picking correctly is ${1/2}$. Switching adds no information and lands us with an identical ${1/2}$ probability. And that is that, right? It turns out that, contrary to our very strong intuition about the problem, there is in fact a way to improve those odds. To accomplish this, we’ll need to introduce a source of randomness. For convenience of exposition we’ll assume the envelopes contain real numbers, and revisit the degree to which we can generalize the approach later.

The procedure is as follows:

• Pick any continuous probability distribution ${P}$ which has support on all of ${R}$ (i.e. ${p(x)>0}$ for all real ${x}$). Most common distributions (normal, beta, exponential, etc) are fine.
• Choose an envelope and open it. We’ll denote its value ${z}$.
• Sample some value ${d}$ from our distribution ${P}$. If ${z>d}$ stick with the initial choice, otherwise switch. We’ll refer to ${z>d}$ or ${z because the probability that ${z=d}$ has measure ${0}$ and safely can be ignored.

At first, second, and ${n^{th}}$ glance, this seems pointless. It feels like all we’ve done is introduce a lot of cruft which will have no effect. We can go stand in a corner flipping a coin, play Baccarat at the local casino, cast the bones, or anything else we want, and none of that can change the probability that we’re equally likely to pick the lower envelope as the higher one initially — and thus equally likely to lose as to gain by switching. With no new information, there can be no improvement. Well, let’s hold that thought and do the calculation anyway. Just for fun.

First some terminology. We’ll call the value in the opened envelope ${z}$, and the value in the other envelope ${z'}$. The decision we must make is whether to keep ${z}$ or switch to the unknown ${z'}$. We’ll denote by ${x}$ and ${y}$ the values in the two envelopes in order. I.e., ${x by definition. In terms of ${z}$ and ${z'}$ we have ${x= \min(z,z')}$ and ${y= \max(z,z')}$. We’ll denote our contrived distribution ${P}$ in the abstract, with pdf ${p(v)}$ and cdf ${F(v)=\int_{-\infty}^v p(v') dv'}$.

Let’s examine the problem from a Bayesian perspective. There is a 50-50 chance that ${(z,z')=(x,y)}$ or ${(z,z')=(y,x)}$. So ${p(z=x)=p(z=y)=0.5}$. There are no subtleties lurking here. We’ve assumed nothing about the underlying distribution over ${(x,y)}$. Whatever ${(x,y)}$ the envelopes contain, we are equally likely to initially pick the one with ${x}$ or the one with ${y}$.

Once the initial envelope has been opened, and the value ${z}$ revealed, we sample ${d}$ from our selected distribution ${P}$ and clearly have ${p(d and ${p(d and ${p(d. The latter forms the criterion by which we will keep ${z}$ or switch to ${z'}$. Please note that in what follows, ${d}$ is not a free variable, but rather a mere notational convenience. Something like ${p(x is just notation for “the probability the sampled value is greater than ${x}$.” We can apply Bayes’ law to get (with all probabilities conditional on some unknown choice of ${(x,y)}$):

$\displaystyle p(z=x|d

What we really care about is the ratio:

$\displaystyle \frac{p(z=x | d

Here, we’ve observed that ${p(d and ${F(x) since by assumption ${x and ${F}$ is monotonically increasing (we assumed its support is all of ${R}$). I.e., if ${d there is a greater probability that ${z=y}$ than ${z=x}$. We shouldn’t switch. A similar argument shows we should switch if ${d>z}$.

So what the heck has happened, and where did the new information come from? What happened is that we actually know one piece of information we had not used: that the interval ${(x,y)}$ has nonzero probability measure. I.e. there is some “space” between ${x}$ and ${y}$. We don’t know the underlying distribution but we can pretend we do. Our strategy will be worse than if we did know the underlying ${p(x)}$, of course. We’ll return to this shortly, but first let’s revisit the assumptions which make this work. We don’t need the envelopes to contain real numbers, but we do require the following of the values in the envelopes:

• The set of possible values forms a measurable set with a strict linear ordering.
• Between any two elements there is a volume with nonzero probability. Actually, this only is necessary if we require a nonzero improvement for any ${(x,y)}$. If we only require an improvement on average we don’t need it. But in that scenario, the host can contrive to use a distribution which neutralizes our strategy and returns us to 50-50 odds.

What difference does ${P}$ itself make? We don’t have any way to choose an “optimal” distribution because that would require placing the bulk of probability where we think ${x}$ and ${y}$ are likely to lie. I.e. we would require prior knowledge. All we can guarantee is that we can improve things by some (perhaps tiny) amount. We’ll compute how much (for a given true underlying distribution) shortly.

Let’s assume that ${Q(x,y)}$ is the true underlying distribution over ${(x,y)}$. We won’t delve into what it means to “know” ${Q}$ since we are handed the envelopes to begin with. Perhaps the game is played many times with values drawn according to ${Q}$ or maybe it is a one-time affair with ${(x,y)}$ fixed (i.e. ${Q}$ a ${\delta}$-distribution). Ultimately, such considerations just would divert us to the standard core philosophical questions of probability theory. Suffice to say that there exists some ${Q(x,y)}$. By definition ${Q(x,y)=0}$ unless ${x. For convenience, we’ll define a symmetrized version as well: ${q(a,b)\equiv Q(a,b)+Q(b,a)}$. We don’t employ a factor of ${1/2}$ since the two terms are nonzero on disjoint domains.

Given ${Q}$, what gain do we get from a particular choice of ${P}$?

$\displaystyle \begin{array}{rcl} P(win)= \int_{x

I.e., the probability we keep ${z}$ when it is ${y}$ and switch when it is ${x}$. Clearly, ${p(z=x|(x,y))= p(z=y|(x,y))= 0.5}$ since those are the immutable 50-50 envelope ordering probabilities. After a little rearrangement, we get:

$\displaystyle P(win)= \frac{1}{2} + \langle F(y) - F(x) \rangle_Q$

Our gain is the mean value of ${F(y)-F(x)}$ over the joint distribution ${Q(x,y)}$. The more probability ${P}$ jams between ${x}$ and ${y}$, the more we gain should that ${(x,y)}$ arise. But without knowledge of the underlying joint distribution ${Q(x,y)}$, we have no idea how best to pick ${P}$. All we can do is guarantee some improvement.

How well can we do if we actually know ${Q}$? Well, there are two ways to use such information. We could stick to our strategy and try to pick an optimal ${P}$, or we could seek to use knowledge of ${Q}$ directly. In order to do the former, we need to exercise a little care. ${Q}$ is a two-dimensional distribution while ${P}$ is one-dimensional. How would we use ${Q}$ to pick ${P}$? Well, this is where we make use of the observed ${z}$.

In our previous discussion of the ${(x,2x)}$ envelope switching fallacy, the value of ${z}$ turned out to be a red-herring. Here it is not. Observing ${z}$ is essential here, but only for computation of probabilities. As mentioned, we assume no algebraic properties and are computing no expectations. We already know that the observation of ${z}$ is critical, since our algorithm pivots on a comparison between ${z}$ and our randomly sampled value ${d}$. Considering our ultimate goal (keep or switch), it is clear what we need from ${Q}$: a conditional probability that ${z'>z}$. However, we cannot directly use ${Q(y|x)}$ because we defined ${x. We want ${p(z'|z)}$ and we don’t know whether ${z or ${z'. Let’s start by computing the probability of ${z}$ (being the observed value) and of ${z,z'}$ (being the observed and unobserved values).

The probability of observing ${z}$ and the other envelope having ${z'}$ is the probability that the relevant ordered pair was chosen for the two envelopes multiplied by the ${1/2}$ probability that we initially opened the envelop containing the value corresponding to our observed ${z}$ rather than the other one.

$\displaystyle p(z,z')= Q(min(z,z'),max(z,z'))/2= q(z,z')/2$

To get ${p(z)}$ we integrate this. ${p(z)= \frac{1}{2}\int Q(z,y)dy + \frac{1}{2}\int Q(x,z)dz}$. This is a good point to introduce two quantities which will be quite useful going forward.

$\displaystyle I_1(z)\equiv \int_{-\infty}^z Q(x,z) dx$

$\displaystyle I_2(z)\equiv \int_z^\infty Q(z,y) dy$

In terms of these,

$\displaystyle p(z)= \frac{1}{2}[I_1(z)+I_2(z)]$

There’s nothing special about calling the variables ${x}$ or ${y}$ in the integrals and it is easy to see (since each only covers half the domain) that we get what we would expect:

$\displaystyle p(z)= \frac{1}{2}\int q(w,z)dw$

What we want is the distribution ${p(z'|z)= p(z,z'|z)= p(z,z')/p(z)= q(z,z')/p(z)}$. This gives us:

$\displaystyle p(z'|z)= \frac{q(z,z')}{\int q(w,z)dw}= \frac{q(z,z')}{I_1(z)+I_2(z)}$

Finally, this gives us the desired quantity ${p(z'>z)= \int_{z'>z} dz' p(z'|z)}$. It is easy to see that:

$\displaystyle p(z'

$\displaystyle p(z'>z)= \frac{I_2(z)}{I_1(z)+I_2(z)}$

As an example, consider the previous ${(x,2x)}$ case — where one envelope holds twice what the other does. We observe ${z}$, and ${z'}$ must be either ${2z}$ or ${z/2}$, though we don’t know with what probabilities. If we are given the underlying distribution on ${x}$, say ${P_2(x)}$, we can figure that out. ${Q(x,y)= P_2(x)\delta(y-2x)}$ and ${q}$ is the symmetrized version. ${\int q(w,z)dw= \int dw [Q(w,z)+Q(z,w)]= (P_2(z/2)+P_2(2z))}$. So ${p(z)= \frac{1}{2}(P_2(z/2)+P_2(2z))}$. This is just what we’d expect — though we’re really dealing with discrete values and are being sloppy (which ends us up with a ratio of infinities from the ${\delta}$ function when computing probability ratios, but we’ll ignore that here). The relevant probability ratio clearly is ${P_2(z/2)/P_2(2z)}$. From a purely probability standpoint, we should switch if ${P_2(2z)>P_2(z/2)}$. If we reimpose the algebraic structure and try to compute expectations (as in the previous problem) we would get an expected value of ${z}$ from keeping and an expected value of ${z[P_2(z/2)/2 + 2P(2z)]}$ from switching . Whether this is less than or greater than ${z}$ depends on the distribution ${P_2}$.

Returning to our analysis, let’s see how often we are right about switching if we know the actual distribution ${Q}$ and use that knowledge directly. The strategy is obvious. Using our above formulae, we can compute ${p(z' directly. To optimize our probability of winning, we observe ${z}$ then we switch iff ${I_1(z). If there is additional algebraic structure and expectations can be defined, then an analogous calculations give whatever switching criterion maximizes the relevant expectation value.

In terms of probabilities, full knowledge of ${Q}$ is the best we can do. The probability we act correctly is:

$\displaystyle \begin{array}{rcl} P'(win)= \int dz \frac{[\theta(I_1(z)-I_2(z)) I_1(z) + \theta(I_2(z)-I_1(z))I_2(z)]}{I_1(z)+I_2(z)} \\ = \int dz \frac{\max(I_1(z),I_2(z))}{(I_1(z)+I_2(z)} \end{array}$

$\displaystyle P'(win|z)= \frac{\max(I_1(z),I_2(z))}{(I_1(z)+I_2(z)}$

Since ${I_1}$ and ${I_2}$ are monotonic (one increasing, the other decreasing), we have a cutoff value ${\hat z}$ (defined by ${I_1({\hat z})= I_2({\hat z})}$) below which we should switch and above which we should not.

How do we do with our invented ${P}$ instead? We could recast our earlier formula for ${P(win)}$ into our current notation, but it’s easier to compute directly. For given ${z}$, the actual probability of needing to switch is ${I_2(z)/(I_1(z)+I_2(z))}$. Based on our algorithm, we will do so with probability ${P(z. The probability of not needing to switch is ${I_1(z)}$ and we do so with probability ${P(z>d)= F(z)}$. I.e., our probability of success for given ${z}$ is:

$\displaystyle P(win|z)= \frac{I_1(z)F(z) + I_2(z)(1-F(z))}{I_1(z)+I_2(z)}$

For any given ${z}$, this is of the form ${\alpha r + (1-\alpha)(1-r)}$ where ${r= F(z)}$ and ${\alpha= I_1(z)/(I_1(z)+I_2(z))}$. The optimal solutions lie at one end or the other. So it obviously is best to have ${F(z)=0}$ when ${z<{\hat z}}$ and ${F(z)=1}$ when ${z>{\hat z}}$. This would be discontinuous, but we could come up with a smoothed step function (ex. a logistic function) which is differentiable but arbitrarily sharp. The gist is that we want all the probability in ${F}$ concentrated around ${\hat z}$. Unfortunately, we have no idea where ${\hat z}$ is!

Out of curiosity, what if we pick instead ${P}$ to be the conditional distribution ${p(z'|z)}$ itself once we’ve observed ${z}$? We’ll necessarily do worse than by direct comparison using ${Q}$ (the max formula above), but how much worse? Well, ${p(z'|z)= q(z,z')/(I_1(z)+I_2(z))}$. Integrating over ${z' we have ${F(z)= \int_{-\infty}^z p(z'|z) dz'= I_1(z)/(I_1(z),I_2(z))}$. I.e., We end up with ${(I_1^2(z)+I_2^2(z))/(I_1(z)+I_2(z))^2}$ as our probability of success. If we had used ${1-p(z'|z)}$ for our ${P}$ instead we would get ${2I_1(z)I_2(z)/(I_1(z)+I_2(z))^2}$ instead. Neither is optimal in general.

Next, let’s look at the problem from an information theory standpoint. As mentioned, there are two sources of entropy: (1) the choice of the underlying pair ${(x,y)}$ (with ${x by definition) and (2) the selection ${(z,z')=(x,y)}$ or ${(z,z')=(y,x)}$ determined by our initial choice of an envelope. The latter is a fair coin toss with no information and maximum entropy. The information content of the former depends on the (true) underlying distribrution.

Suppose we have perfect knowledge of the underlying distribution. Then any given ${z}$ arises with probability ${p(z)=\frac{1}{2}[I_1(z)+I_2(z)]}$. Given that ${z}$, we have a Bernoulli random variable ${p(z'>z)}$ given by ${I_2(z)/(I_1(z)+I_2(z))}$. The entropy of that specific coin toss (i.e. the conditional entropy of the Bernoulli distribution ${p(z'> z|z)}$) is

$\displaystyle H(z'>z|z)= \frac{-I_1(z)\ln I(z) - I_2(z)\ln I_2(z) + (I_1(z)+I_2(z))\ln [I_1(z)+I_2(z)]}{I_1(z)+I_2(z)}$

With our contrived distribution ${P}$, we are implicitly are operating as if ${p(z'>z)= 1-F(z)}$. This yields a conditional entropy:

$\displaystyle H'(z'>z|z)= -(1-F(z))\ln (1-F(z)) - F(z)\ln F(z)$

There is a natural measure of the information cost of assuming an incorrect distribution. It is the Kullback Liebler Divergence (also known as the relative entropy). While it wouldn’t make sense to compute it between ${Q}$ and ${P}$ (which are, among other things, of different dimension, we certainly can compare the cost for given ${z}$ of the difference in our Bernoulli random variables for switching — and then integrate over ${z}$ to get an average cost in bits. Let’s denote by ${q(z'>z)}$ the probability based on the true distribution and keep ${p(z'>z)}$ for the contrived one. I.e. ${q(z'>z)= I_2(z)/(I_1(z)+I_2(z))}$ and ${p(z'>z)= 1-F(z)}$. For given ${z}$, the K-L divergence is:

$\displaystyle D(Q || P, z)= \frac{-I_2(z)\ln [(I_1(z)+I_2(z))(1-F(z))/I_2(z)] - I_1(z)\ln [(I_1(z)+I_2(z))F(z)/I_1(z)]}{I_1(z)+I_2(z)}$

Integrating this, we get the mean cost in bits of being wrong.

$\displaystyle \begin{array}{rcl} \langle D(Q || P) \rangle= \frac{1}{2}\int dz [-(I_1(z)+I_2(z))\ln [I_1(z)+I_2(z)] - I_2(z)\ln (1-F(z)) \\ -I_1(z)\ln F(z) + I_1(z)\ln I_1(z) + I_2(z)\ln I_2(z)] \end{array}$

The first term is simply ${H(z)}$, the entropy of our actual distribution over ${z}$. In fact, the first term and last 2 terms together we recognize as ${\langle H(z'>z|z) \rangle}$, the mean Bernoulli entropy of the actual distribution. In these terms, we have:

$\displaystyle \langle D(Q || P) \rangle= \langle H(z'>z|z) \rangle + \langle \frac{ -I_2(z)\ln(1-F(z)) - I_1(z)\ln F(z)}{I_1(z)+I_2(z)} \rangle$

where the expectations are over the unconditional actual distribution ${p(z)}$. The 2nd expectation on the right represents the cost of being wrong about ${P}$. If it was the optimal distribution with all probability centered near ${\hat z}$ then the term on the right would approach ${0}$ and there would be no entropy cost.

As an aside, this sort of probabilistic strategy should not be confused with the mixed strategies of game theory. In our case, a mixed strategy would be an apriori choice ${aK+(1-a)S}$ where ${K}$ is the always-keep strategy, ${S}$ is the always-switch strategy, and ${0\le a\le 1}$ is the probability of employing the always-keep strategy. A player would flip a biased-coin with Bernoulli probability ${a}$ and choose one of the two-strategies based on it. That has nothing to do with the measure-theory approach we’re taking here. In particular, a mixes strategy makes no use of the observed value ${x}$ or its relation to the randomly sampled value. Any mixed strategy gives even-odds because the two underlying deterministic strategies both have even-odds.

# Semidirect Products, Split Exact Sequences, and all that

One of the things I’ve butted heads with in studying Lie Groups is the semidirect product and its relationship to split exact sequences. It quickly became apparent that this was a pretty sizeable hole in my basic knowledge, so I decided to clarify this stuff once and for all.

— Normal Subgroups and Quotient Groups —

First, a brief refresher on Normal subgroups and Quotient groups. We are given a group ${G}$ and subgroup ${H\subseteq G}$.

• Left cosets are written ${gH}$ and right cosets are written ${Hg}$. Each is a set of elements in ${G}$. Not all left cosets are distinct, but any two are either equal or disjoint. Ditto for right cosets.
• The left (right) cosets form a partition of ${G}$, but they do not in general form a group. We can try to imbue them with a suitable product, but there are obstructions to the group axioms. For example ${g^{-1}H}$ is not a useful inverse since ${(gh)^{-1}= h^{-1}g^{-1}}$, so neither left cosets nor right cosets multiply as desired. More generally ${(gg')H}$ does not consist of a product of an element of ${gH}$ and an element of ${g'H}$.
• We define the Quotient Set ${G/H}$ to be the set of left cosets. As mentioned, it is not a group in general. There is an equivalent definition for right cosets, written ${H\setminus{}G}$, but it doesn’t appear often. In most cases we care about the two are the same.
• It is easy to see that the condition which removes the obstruction is that ${gH=Hg}$ for all ${g}$. Equivalently, ${gHg^{-1}=H}$ for all ${g}$. If this holds, the cosets form a group. Often the stated condition is that the sets of left and right cosets are the same. But ${g\in gH,Hg}$ so this is the same exact condition.
• ${H}$ is a Normal Subgroup if it obeys the conditions which make the cosets into a group.
• Usually a Normal Subgroup is denoted ${N}$, and we write ${N\triangleleft G}$ (or ${N\trianglelefteq G}$).
• For a Normal subgroup ${N}$, the Quotient Set ${Q=G/N}$ has (by definition) the natural structure of a group. It is called the Quotient Group.
• We have two natural maps associated with a Normal Subgroup:
• ${N\xrightarrow{i} G}$ is an inclusion (i.e. injective), defined by ${h\rightarrow h}$ (where the righthand ${h}$ is viewed in ${G}$). This is a homomorphism defined for any subgroup, not just normal ones
• ${G\xrightarrow{q} Q}$ is the quotient map (surjective), defined by ${g\rightarrow gN}$ (with the righthand viewed as a coset, i.e. an element of ${G/N}$). This map is defined for any subgroup, with ${Q}$ the Quotient Set. For Normal Subgroups, it is a group homomorphism.
• We know there is a copy of ${N}$ in ${G}$. Though ${Q}$ is derived from ${G}$ and ${N}$, and possesses no new info, there may or may not be a copy of it in ${G}$. Two natural questions are when that is the case, and how ${G}$, ${N}$, and ${Q}$ are related in general.

Let’s also recall the First Isomorphism Theorem for groups. Given any two groups ${G}$ and ${H}$ and a homomorphism ${\phi:G\rightarrow H}$, the following hold:

• ${\ker \phi}$ is a Normal Subgroup of ${G}$
• ${\mathop{\text{im}} \phi}$ is a subgroup of ${H}$
• ${\mathop{\text{im}} \phi}$ is isomorphic to the Quotient Group ${G/\ker\phi}$.

Again, we have to ask: since ${\ker\phi}$ is a Normal Subgroup of ${G}$, and ${\mathop{\text{im}}\phi}$ is isomorphic to the Quotient Group ${G/\ker\phi}$ which “sort of” may have an image in ${G}$, is it meaningful to write something like (playing fast and loose with notation) ${G\stackrel{?}{=} \ker\phi \oplus \mathop{\text{im}} \phi}$ (being very loose with notation)? The answer is no, it’s more complicated.

— Exact Sequences —

Next, a very brief review of exact sequences. We’ll use ${1}$ for the trivial group. The usual convention is to use ${1}$ for general groups and ${0}$ for Abelian groups. An exact sequence is a sequence of homomorphisms between groups ${\cdots \rightarrow G_n \xrightarrow{f_n} G_{n-1}\xrightarrow{f_{n-1}} \cdots}$ where ${\mathop{\text{im}} f_n= \ker f_{n-1}}$ for every pair. Here are some basic properties:

• ${1\rightarrow A \xrightarrow{f} B\cdots}$ means that ${f}$ is injective.
• ${\cdots A\xrightarrow{f} B\rightarrow 1}$ means that ${f}$ is surjective.
• ${1\rightarrow A\rightarrow B\rightarrow 1}$ means ${A=B}$.
• Short Exact Sequence (SES): This is defined as an exact sequence of the form: ${1\rightarrow A\xrightarrow{f} B\xrightarrow{g} C\rightarrow 1}$.
• For an SES, ${f}$ is injective, ${g}$ is surjective, and ${C=B/\mathop{\text{im}} f}$
• SES’s arise all the time when dealing with groups, and the critical question is whether they “split”.

We’re now ready to define Split SES’s.

• Right Split SES: There exists a homomorphism ${h:C\rightarrow B}$ such that ${g\circ h=Id_C}$. Basically, we can move to ${B}$ and back from ${C}$ without losing info — which means ${C}$ is in some sense a subgroup of ${B}$.
• Left Split SES: There exists a homomorphism ${h:B\rightarrow A}$ such that ${h\circ g=Id_A}$. Basically, we can move to ${B}$ and back from ${A}$ without losing info — which means ${A}$ is in some sense a subgroup of ${B}$.
• These two conditions are not in general equal, or even equivalently restrictive. The Left Split condition is far more constraining than the Right Split one in general. The direction of the homomorphisms in the SES introduce an asymmetry. [My note: it seems likely that the two are dual in some sense.]

— External vs Internal View —

We’re going to described 3 types of group operations: the direct product, semi-direct product, and group extension. Each has a particular relationship to Normality and SES’s. There are two equivalent ways to approach this, depending whether we prefer to define a binary operation between two distinct groups or to consider the relationship amongst subgroups of a given group.

• External view: We define a binary operation on two distinct, unrelated groups. Two groups go in, and another group comes out.
• Internal view: We define a relationship between a group and various groups derived from it (ex. Normal or Quotient).
• These approaches are equivalent. The Internal view describes the relationship amongst the two groups involved in the External view and their issue. Conversely, the derived groups in the Internal view may be recombined via the External view operation.

We must be a little careful with notation and terminology. When we use the symbol ${HK}$, it can mean one of two things.

• Case 1: ${H}$ and ${K}$ are distinct groups. ${HK}$ is just the set of all pairs of elements ${(h,k)}$. I.e. it is the direct product set (but not group).
• Case 2: ${H}$ and ${K}$ are subgroups of a common group ${G}$ (or have some natural implicit isomorphisms to such subgroups). In this case, ${HK}$ is the set of all elements in ${G}$ obtained as a product of an element of ${H}$ and an element of ${K}$ under the group multiplication.
• Note that we may prefer cases where two subgroups cover ${G}$, but there are plenty of other possibilities. For example, consider ${Z_{30}}$ (the integers mod 30). This has several obvious subgroups (${Z_2}$, ${Z_3}$, ${Z_5}$, ${Z_6}$, ${Z_{10}}$, ${Z_{15}}$). ${Z_2}$ and ${Z_3}$ only intersect on ${0}$ (the additive identity). However, the two do not cover (or even generate) the group! Similarly, ${Z_2}$ and ${Z_{10}}$ do not cover the group (or even generate it) but intersect on a nontrivial subset!
• Going the other way, we’ll say that ${G=HK}$ if ${H}$ and ${K}$ are subgroups and every element ${g}$ can be written as ${hk}$ for some ${h\in H}$ and ${k\in K}$. Note that ${H}$ and ${K}$ need not be disjoint (or even cover ${G}$ set-wise).

Another potentially confusing point should be touched on. When we speak of “disjoint” subgroups ${H}$ and ${K}$ we mean that ${H\cap K=\{e\}}$, NOT that it is the null set. I.e., ${H\cap K= 1}$, the trivial group.

— Semidirect Product —

The semidirect product may seem a bit arbitrary at first but, as we will see, it is a natural part of a progression which begins with the Direct Product. Here are the two ways of defining it.

• External view (aka Outer Semidirect Product): Given two groups ${H}$ and ${K}$ and a map ${\phi:K\rightarrow Aut(H)}$, we define a new group ${H\rtimes K}$. We’ll denote by ${\phi_k(h)}$ the effect of the automorphism ${\phi(k)}$ on ${h}$ (and thus an element of ${H}$). Set-wise, ${H\rtimes K}$ is just ${H\times K}$ (i.e. all pairs ${(h,k)}$). The identity is ${(e,e)}$. Multiplication on ${H\rtimes K}$ is defined as ${(h,k)(h',k')= (h\phi_k(h'),kk')}$. The inverse is ${(h,k)^{-1}= (\phi_{k^{-1}}(h^{-1}),k^{-1})}$.
• Internal view (aka Inner Semidirect Product): Given a group ${G}$ and two disjoint subgroups ${N}$ and ${K}$, such that ${G=NK}$ and ${N}$ is a Normal Subgroup, ${G}$ is called the Semidirect product ${N\rtimes K}$. The normality of ${H}$ constrains ${K}$ to be isomorphic to the Quotient Group ${G/N}$.

• There are (potentially) many Semidirect products of two given groups, obtained via different choices of ${\phi}$. The notation is deceptive because it hides our choice of ${\phi}$. Given any ${H,K,\phi}$ there exists a Semidirect product ${H\rtimes K}$. The various Semidirect products may be isomorphic to one another, but in general need not be. I.e., a given ${H}$ and ${K}$ may have multiple distinct semidirect products. This actually happens. Wikipedia mentions that there are 4 non-isomorphic semidirect products of ${C_8}$ and ${C_2}$ (the former being the Normal Subgroup in each case). One is a Direct Product, and the other 3 are not.
• It also is possible for a given group ${G}$ to arise from several distinct Semidirect products (of different pairs of groups). Again from Wikipedia, there is a group of order 24 which can be written as 4 distinct semiproducts of groups.
• Yet another oddity is that a seemingly nontrivial ${H\rtimes K}$ can be isomorphic to ${H\oplus K}$.
• If ${\phi= Id}$ (i.e. every ${k}$ maps to the identify map on ${H}$), then ${G=H\oplus K}$.
• To go from the External view to the Internal one, we note that, by construction, ${H}$ is a Normal Subgroup of ${H\rtimes K}$ and ${K}$ is the Quotient Group ${G/H}$. To be precise, the Normal Subgroup is ${(N,e)}$, which is isomorphic to ${N}$, and the Quotient Group ${G/(N,e)}$ is isomorphic to ${K}$.
• To go from the Internal view to the External one, we choose ${\phi_k(h)= khk^{-1}}$ as our function. I.e., ${\phi}$ is just conjugation by the relevant element.
• It may seem like there is an imbalance here. For a specific choice of Normal Subgroup ${N}$, the External view offers complete freedom of ${\phi}$, while the Internal view has a fixed ${\phi}$. Surely the latter is a special case of the former. The fallacy in this is that we must consider the pair ${(G,N)}$. We very well could have non-isomorphic ${G,G'}$ with Normal Subgroups ${N,N'}$ where ${N\approx N'}$. I.e. they are the same Normal Subgroup, but with different parent groups. We then would have different ${\phi}$‘s via our Internal view procedure. The correspondence is between ${(H,K,\phi)}$ and ${(G,N,K)}$ choices. Put differently, the freedom in ${\phi}$ loosely corresponds to a freedom in ${G}$.
• Note that, given ${G}$ and a Normal Subgroup ${N}$ — with the automatic Quotient Group ${G/N}$ — we do NOT necessarily have a Semidirect product relationship. The condition of the Semidirect product is stricter than this. As we will see it requires not just isomorphism, but a specific isomorphism, between ${H}$ and ${G/N}$. Equivalently, it requires a Right-Split SES (as we will discuss).
• The multiplication defined in the External view may seem very strange and unintuitive. In essence, here is what’s happening. For a direct product, ${H}$ and ${K}$ are independent of one another. Each half of the pair acts only on its own elements. For a semidirect product, the non-normal half ${K}$ can twist the normal half ${H}$. Each element of ${K}$ can alter ${H}$ in some prescribed fashion, embodied in ${\phi(k)}$. So ${K}$ is unaffected by ${H}$ but ${H}$ can be twisted by ${K}$.
• It is interesting to compare the basic idea to that of a Fiber bundle. There, the fiber can twist (via a group of homeomorphisms) as we move around the base space. Here, the normal subgroup can twist as we move around the non-normal part. Each generalizes a direct product and measures our need to depart from it.
• The semidirect product of two groups is Abelian iff it’s just a direct product of abelian groups.

— Group Extensions —

As with Semidirect products, there are 2 ways to view these. To make matters confusing, the notation speaks to an Internal view, while the term “extension” speaks to an External view.

• External view: Given groups ${A}$ and ${C}$, we say that ${B}$ is an extension of ${C}$ by ${A}$ if there is a SES ${1\rightarrow A\rightarrow B\rightarrow C\rightarrow 1}$.
• Internal view: Given a group ${G}$ and Normal Subgroup ${N\triangleleft G}$, we say that ${G}$ is an extension of ${Q}$ by ${N}$, where ${Q=G/N}$ is the Quotient Group.
• Note that the two are equivalent. If ${B}$ is an extension of ${A}$ by ${C}$, then ${A}$ is Normal in ${B}$ and ${C}$ is isomorphic to the Quotient Group ${B/A}$.
• Put simply, the most general form of the Group, Normal Subgroup, induced Quotient Group trio is the Group Extension.

— Direct Products, Semidirect Products, and Group Extensions —

In the External view, we’ve mentioned three means of getting a group ${B}$ from two groups ${A}$ and ${C}$:

• Direct Product: ${B=A\oplus C}$. This is unique.
• Semidirect Product: ${B=A\rtimes C}$. There may multiple of these, corresponding to different ${\phi}$‘s.
• Group Extension: A group ${B}$ for which there are 2 homomorphisms forming a SES ${1\rightarrow A\rightarrow B\rightarrow C\rightarrow 1}$. There may be many of these, corresponding to different choices of the two homomorphisms.

Equivalently, we have several ways of describing the relationship between two subgroups ${H,K\subseteq G}$ which are disjoint (i.e. ${H\cap K=\{e\}}$).

• Direct Product: ${G=H\oplus K}$ requires that both be Normal Subgroups.
• Semidirect Product: ${G=H\rtimes K}$ requires that ${H}$ be normal (in which case, ${Q=G/H}$, and ${\phi}$ is determined by it). For a given ${H}$ there may be multiple, corresponding to different ${G}$‘s.
• Group Extension: Both ${H}$ and ${K}$ sit in ${G}$ to some extent. ${H}$ must be Normal.

Note that not every possible relationship amongst groups is captured by these. For example, we could have two non-normal subgroups or two homomorphisms which don’t form an SES, or no relationship at all.

An excellent hierarchy of conditions was provided by Arturo Magidin in answer to someone’s question on Stackoverflow. I roughly replicate it here. Unlike him, I’ll be sloppy and not distinguish between subgroups and groups isomorphic to subgroups.

• Direct Product (${G=H\oplus K}$): ${H,K}$ both Normal Subgroups. ${H,K}$ disjoint. ${G=HK}$
• Semidirect Products (${G=H\rtimes K}$): ${H}$ Normal Subgroup, ${K}$ Subgroup. ${H,K}$ disjoint. ${G=HK}$. I.e., we lose Normality of ${K}$.
• Group Extension (${G}$ is extension of ${H}$ by ${K}$): ${H}$ Normal Subgroup, ${G/H\approx K}$. I.e. ${K}$ remains the Quotient Group (as before), but the Quotient Group may no longer be a subgroup of ${G}$ at all!

Now is a good time to mention the relationship between the various SES Splitting conditions:

• For all groups: Left Split is equivalent to ${B=A\oplus C}$, and they imply Right Split. (LS=DP) => RS always.
• For abelian groups, the converse holds and Right split implies Left Split and Direct Sum. I.e. the conditions are equivalent. LS=DP=RS for Abelian.
• For nonabelian groups: Right Split implies ${B=A\rtimes C}$ (with ${\phi}$ depending on the SES map). We’ll discuss this shortly.

Back to the hierarchy, now from a SES standpoint:

• Most general case: There is no SES at all. Given groups ${A,B,C}$, there may be no homomorphisms between them. If there are homomorphisms, there may be none which form an SES. Consider a general pair of homomorphisms ${f:A\rightarrow B}$ and ${g:B\rightarrow C}$, with no assumptions. We may turn to the first isomorphism theorem for help, but that does us no good. The first isomorphism theorem says that ${\ker f \triangleleft B}$ and ${\mathop{\text{im}} f\approx A/\ker f}$, and ${\ker g \triangleleft C}$ and ${\mathop{\text{im}} g\approx B/\ker g}$. This places no constraints on ${A}$ or ${C}$.
• Group Extension: Any SES defines a group extension. They are the same thing.
• Semidirect Product: Any SES which right-splits corresponds to a Semidirect Product (with the right-split map determining ${\phi}$)
• Direct Product: Any SES which left-splits (and thus right-splits too) corresponds to a direct product.

So, when we see the standard SES: ${1\rightarrow N\rightarrow G\rightarrow G/N\rightarrow 1}$, this is a group extension. Only if it right splits can we write ${G= N\rtimes G/N}$, and only if it left splits can we write ${G= N\oplus G/N}$.

— Some Notes —

• Group Extensions are said to be equivalent if their ${B}$‘s are isomorphic and there exists an isomorphism between them which makes a diamond diagram commute. It is perfectly possible for the ${B}$‘s to be isomorphic but for two SES’s not to be equivalent extensions.
• Subtlety referred to above. A quotient group need not be isomorphic to a subgroup of ${G}$. It only is defined when ${N}$ is normal, and there automatically is a surjective homomorphism ${G\rightarrow Q}$. But we don’t have an injective homomorphism ${Q\rightarrow G}$, which is what would be need for it to be isomorphic to a subgroup of ${G}$. This is precisely what the right-split furnishes. In that case, it is indeed a subgroup of ${G}$. The semidirect product may be thought of as the statement that ${Q}$ is a subgroup of ${G}$.
• In the definition of right split and left split, the crucial aspect of the “inverse” maps is that they be homomorphisms. A simple injective (for right-split, or surjective for left-split) map is not enough!
• It is sometimes said that the concept of subgroup is dual to the concept of quotient group. This is intuitive in the following sense. A subgroup can be thought of as an injective homomorphism. By the SES for normal/quotient groups, we can think of a quotient group as a surjective homomorphism. Since injections and surjections are categorically dual, it makes sense to think of quotient groups and subgroups as similarly dual. Whether the more useful duality is subgroup quotient group or normal subgroup quotient group is unclear to me.

# Differential Entropy

This originally appeared on my tech blog. It’s primarily notes on the subtleties of differential entropy, but also contains a review of discrete entropy, various entropy-related information quantities such as mutual information, and a listing of various axiomatic formulations.