Statistics, Foundations of

views updated

STATISTICS, FOUNDATIONS OF

Thorny conceptual issues arise at every turn in the ongoing debate between the three major schools of statistical theory: the Bayesian (B), likelihood (L), and frequentist (F). (F) rather uneasily combines the Neyman-Pearson-Wald conception of statistics as "the science of decision making under uncertainty" with Ronald A. Fisher's theories of estimation and significance testing, viewed by him as inferential. However, in keeping with his frequentist conception of probability, Fisher viewed the inferential theory of Thomas Bayes and Pierre Simon de Laplace as applicable only where the needed prior probability inputs are grounded in observed relative frequencies. Maximum likelihood estimates and significance tests were intended as substitutes for Bayesian inference in all other cases. (F), (B) and (L) all provide a framework for comparatively appraising statistical hypotheses, but Fisher questioned whether one can fruitfully assimilate the weighing of evidence to decision making.

Given the response probabilities for a diagnostic test shown in Table 1:

	Positive	Negative
Infected (h)	0.95	0.05
Uninfected (k)	0.02	0.98

one may, following Richard M. Royall (1997, p. 2), usefully distinguish three questions of evidence, belief, and decision when a subject (S) tests positive:

Q1. Is this result evidence that S has the disease?
Q2. What degree of belief that S has the disease is warranted?
Q3. Should S be treated for the disease?

(L) addresses only Q1 and does so by what Ian Hacking (1965) dubs the law of likelihood (LL):

evidence e supports hypothesis h over k if and only if (Pe|h ) > P (e|k ); moreover, the likelihood ratio (LR), P (e|h ) : P (e|k ), measures the strength of the support e accords h over k.

The LL follows from Bayes's fundamental rule for revising a probability assignment given new data. Indeed, Laplace arrived (independently) at this rule by appeal to the intuition that the updated odds in favor of h against k in light of e should be the product of the initial odds by the LR (Hald 1998, p. 158):
(1)
If the rival (mutually exclusive) hypotheses h and k are treated as exhaustive, so that their probabilities sum to one, then (1) yields the usual form of Bayes's rule:
(2)
with P (e ) usually given in the general case by the partitioning formula:
(3)      P (e ) = P (e |h ₁)P (h ₁) +…+ P (e |h _n)P (h _n)
with the (mutually exclusive) considered hypotheses h ₁, …, h _n treated as exhaustive.

One also sees how (B) answers Q2 by multiplying the initial odds, based on what is known about the incidence of the disease, by the LR of 95/2 provided by a positive reaction. If the incidence of the disease is even as low as 1 per 1,000, the posttest (or "posterior") probability of infection may still lie well below 50 percent. Notice, too, that knowledge of the infection rate may rest on the same sort of empirical frequency data that underwrites the conditional probabilities of Table 1. When this is true, (L) and (F) have no qualms about applying (2) to answer Q2. They do not question the validity of (2), only whether the initial probabilities needed to apply it can be freed of the taint of subjectivism.

The Likelihood Principle

Statistical hypotheses typically assign values to one or more parameters of an assumed probability model of the experiment, for example, to the mean of a normal distribution or the probability of success in a sequence of Bernoulli trials. If θ is such a parameter and X the experimental random variable then
P (x |θ )
is called the sampling distribution when considered as a function of the observation x and the likelihood function qua function of θ.

The case of randomly sampling an urn with replacement, with p the population proportion of white balls, affords a simple illustration. Then the probability of x white and n-x black in a sample of n is given by the binomial (sampling) distribution:

For comparing two hypotheses about p by the LR, the binomial coefficients cancel and so one may ignore them and define the likelihood function for this experiment by:
L (p ) = p ^x(1 − p )^n−x
The value of p, which maximizes L (p ), is called the maximum likelihood (ML) estimate of p and is easily found, by calculus, to be x/n, the observed sample proportion (of white balls) or successes.

Consider, next, a second experiment in which one samples until the first success is observed. This happens on trial n with probability, p (1 − p )ⁿ⁻¹, since n − 1 failures must precede the first success. More generally, if one samples until the rth success is observed, this happens on trial n with probability:

which reduces to p (1 − p )^{n− 1} when r = 1. This sampling distribution is called the negative binomial (or waiting time) distribution; it gives rise to the same likelihood function as the first experiment.

Now suppose Jay elects to observe n = 30 trials and finds x = 12 successes, while May elects to sample until she finds r = 12 successes but that happens to occur on the thirtieth trial. In a literal sense, both experimenters have observed the same thing: twelve successes in thirty Bernoulli trials. One would think they would then draw the same conclusions. (F) violates this prescription, called the likelihood principle (LP). In so doing (F) allows the experimenter's intentions when to stop sampling to influence the evidential import of what is observed. It also makes the import of the outcome observed dependent on the entire sample space, hence, on outcomes that might have been but were not observed (see de Groot 1986, p. 417). By the same token, the unbiased estimators favored by (F), those centered on the true value of the parameter, violate the LP (p. 417), since this concept depends on all possible values of the estimator. Thus, the unbiased estimates of p are, respectively, k /n and (k − 1 )/(n − 1 ) for the two previous experiments. The LP virtually defines the difference between (B) and (L), on the one hand, and (F), on the other.

In effect, (B) and (L) charge (F) with inconsistency, with basing different assessments of the evidence (or different decisions to accept or reject hypotheses) on equivalent outcomes, for two outcomes are accounted equivalent by the LP if they define the same likelihood function. This charge of inconsistency can be carried to a higher metalevel since (F) accepts Bayes's rule (2), and with it the LP, when the prior probabilities are known from past frequency data. Hence (F) finds itself in the odd position of accepting or rejecting the LP according as the prior probabilities are "known" or "unknown." Charges of inconsistency are the weapon of choice in the ongoing battles between the three schools, beginning with the charge that Bayes's postulate for assigning a uniform distribution to a parameter about which nothing is known leads to inconsistent assignments. In the sequel, one will explore how consistency may be used instead to forge agreement.

Fisherian Significance Tests

Fisher (1935, chapter 2, the locus classicus ) presented significance tests as analogues of the logicians' modus tollens : if A then B, not-B/∴A. When the probability, P (e|h ₀), falls below α, one counts e as evidence against h ₀, the smaller α, the stronger the evidence. As Fisher describes it, the logic is "that of a simple disjunction: Either an exceedingly rare chance has occurred, or the theory is not true." Using (2), the probabilistic analogue of modus tollens is:

which shows that for not-B to seriously infirm A requires, not merely that P (B|A ) be small, but small relative to P (B ), so that some alternative to A must accord not-B a higher probability.

Much of Fisher's practice conforms to this precept. In his famous example of the tea-tasting lady (1935), the lady claims that she can tell whether tea or milk was infused first in a mixture of the two. To test her claim she is asked to classify eight cups of which four are tea-first and the other four milk-first, but, of course, she does not know which four. The relevant statistic is the number R of correct classifications and its sampling distribution on the null hypothesis that she lacks such ability is:

Notice, the probability that R = r on the alternative hypothesis of skill cannot be computed so that likelihood ratios do not exist. All that one has to work with is an intuitive rank ordering of the outcomes with larger values of R more indicative of skill. What P (R ≥ r*|h ₀) measures may be verbalized as "the probability of obtaining, by chance, agreement with the hypothesis of skill as good as that observed" (Fisher 1935, p.13). Although Fisher rejected the implication that by "disproving" the null hypothesis one "demonstrates" the alternative (p. 16), he also says that "we should admit that the lady had made good her claim" (p. 14) if she classified all eight cups correctly. He argues that one can (effectively) disprove the null hypothesis because it is "exact," while the alternative of skill is vague. However, this does not preclude one from adopting the natural view of most researchers that a significant result is evidence in favor of the alternative hypothesis. The null hypothesis is then cast in the subtly different role of a fixed point of comparison that permits computation of the relevant chance probability (Rosenkrantz 1977, chapter 9).

This is, in fact, the logic of most nonparametric tests, the Wilcoxon rank sum test for comparing two treatments being paradigmatic (see Hodges and Lehmann 1970, §§12.3–12.4, especially p. 333). Table 2 compares the survival times (in years) following a heart attack of t = 6 patients receiving a new treatment and s = 4 controls receiving the standard treatment, with their ranks in parentheses.

Treated	7.3 (4)	17.2 (1)	6.1 (6)	11.4 (3)	15.8 (2)	5.2 (7)
Controls	1.4 (9)	0.6 (10)	5.0 (8)	6.7 (5)

The sum, W _t of the ranks of the t -treated patients is a suitable test statistic, and under the null hypothesis that the new treatment is no better than the old, all assignments of ranks 1 through 10 to the six treated patients are equiprobable. Hence, the paucity of possible rank sums as small as the observed value, W _t = 1 + 2 + 3 + 4 + 6 + 7 = 23, measures the strength of the evidence, the smaller this proportion the stronger the evidence of improved efficacy. Since only three other possible rank sums are as small as the observed value of W _t, the relevant proportion is 4/210 = .019, or about 2 percent.

This same form of argument also enjoys widespread currency in the sciences, as when an anthropologist maintains that certain cultural commonalities are too numerous and striking to be ascribed to parallel development and point instead to contact between two civilizations, or when an evolutionist argues that the structural similarities between two organs that do not even perform the same function in two species are homologous and not merely analogous, hence indicative of common ancestry. Indeed, the rationale behind the principle of parsimony—that a phylogeny is more plausible if it requires fewer evolutionary changes—is this same piling up of otherwise improbable coincidences. And how improbable that various methods of reconstructing a phylogeny—for example, the ordering of fish, amphibians, reptiles, and mammals—based on the fossil record, homologies, serology, or DNA and protein sequencing should all agree if the phyla in question were separately created?

Fisher's foremost contribution to the design of experiments, randomization, also fits this logic (Fisher 1935, pp. 17–21, 41–44, 62–66). If, for example, the treated subjects of Table 2 were all younger than the controls, they might be expected to live longer in any case. However if, after controlling for such plainly relevant differences, the patients were assigned at random to the two groups, the chances are just one in that all treated subjects will share some hidden trait conducive to longevity that is lacking in the controls, thus removing any suspicion of selection bias. In addition, randomization underwrites the probability model of the experiment from which the sampling distribution of the chosen test statistic, W _T, is deduced (for a more leisurely discussion of randomization, see Hodges and Lehmann 1970, §12.1).

Since significance tests apply, on this reading, only when the likelihood function does not exist, they can be viewed as complements rather than alternatives to the methods of (B) or (L). Seen in this positive light, significance tests have a deeper Bayesian rationale. For the paucity of possible outcomes a model with zero or more adjustable parameters accommodates measures the support in its favor when the observed outcome belongs to this set (Rosenkrantz 1977, chapter 5). Echoing I. J. Good (who echoed Fisher), to garner support requires not just accuracy but improbable accuracy.

Moreover, the present formulation resolves many of the controversies that have swirled about significance testing (see Morrison and Henkel 1970), above all, the question whether a significant outcome with a small sample constitutes stronger evidence against null than one with a large sample (see Royall 1997, pp. 70–71). If, infact, the chance probability of agreement with the causal hypothesis of interest is the same in both cases, the evidence in favor of that causal hypothesis is also equally strong.

All these advantages notwithstanding, significant test results are still most widely viewed as evidence against the null hypothesis and, indeed, without reference to alternative hypotheses (see Fisher 1935, pp. 15–16; 1956, pp. 40–42; and for a critique of this viewpoint, Royall 1997, chapter 3). Thus, one classifies the observed outcome as evidence for or against h ₀ not by comparing its probability on h ₀ to its probability on alternative hypotheses but by comparing its probability on h ₀ with that of other possible outcomes.

Neyman-Pearson Theory

In the late 1920s Jerzy Neyman and Egon S. Pearson (henceforth, NP) set forth a new approach to the testing of statistical hypotheses. Although initially presented as a refinement of Fisherian significance testing, NP actually addressed the different problem of testing one hypothesis against one or more alternatives in situations where the likelihoods do exist. In such cases, Fisher's practice, in accord with (L), was to compare the relevant hypotheses by their likelihoods. NP proposed, instead, to lay down in advance a rule of rejection, that is, a critical region R of the space of outcomes such that the tested hypothesis is rejected just in case the outcome actually observed falls in R.

In the simplest case of testing one point hypothesis, h ₀ :θ = θ ₀ against another, h ₁ :θ = θ ₁, called simple dichotomy, one can err not only by rejecting h ₀ when it is true but also by accepting h ₀ when the alternative hypothesis, h ₁, is true. Plainly, one cannot reduce both these error probabilities,
α = P (X ∈ R |h ₀)
and
β = P (X ∉ R |h ₁)
without increasing the sample size. NP's recommended procedure was to so label the hypotheses that rejecting h ₀is the more serious error, fix θ at a tolerable level, α ₀, called the size or significance level of the test, and then among all tests of this size, α ≤ α ₀, choose the one that minimizes β, or, equivalently, maximizes the power 1 − β. The test is thus chosen as the solution of a well-defined optimization problem, a feature modeled on Fisher's approach to estimation. The fundamental lemma of NP theory then affirms the existence of a unique solution, that is, the existence of a most powerful test of a given size. Finally, test statistics could then be compared in terms of their power. The overall effect was to unify point estimation, interval estimation (confidence intervals), and testing under the broader rubric of "decision making under uncertainty," a viewpoint made explicit in the later work of Abraham Wald. In this scheme of things, estimates, confidence intervals, and tests are to be judged solely in terms of such performance characteristics as their mean squared error or their error probabilities. That is, arguably, the feature of the approach that continues to exercise the most powerful influence on the orthodox (i.e., frequentist) school (see Hodges and Lehmann 1970, chapters 11–13; de Groot 1986, chapter 7).

These developments occurred in such rapid succession that they have yet to be fully digested. NP had uppermost in mind massed tests like screening a population for a disease, testing a new drug, or industrial sampling inspection where the same practical decision, such as classifying a patient as infected or uninfected, must be faced repeatedly. For such situations, a reliable rule that controls for the probability of error seemed preferable to an explicitly (Bayesian) decision theoretic treatment that would require prior probabilities that the statistician could not base on any objective rule, as well as on loss or utility functions that would vary even more from one policy maker to another. To be sure, one might know the distribution of the proportion of defectives from past experience with a manufacturing process and be able to supply objective cost functions, but such cases would be uncommon.

But even in cases where an assembly line approach seems appropriate, NP's recommended procedure is open to question. If the more serious type 1 error is deemed, say, a hundred times more serious than the less serious type 2 error, should one not prefer a test whose probability of committing the more serious error is correspondingly less than its probability of committing the less serious error? In short, why not minimize the weighted sum, 100α + β ? After all, the result of fixing α at some tolerable level, then minimizing β, might be to drive β much lower than α, which is wasteful, or else to drive β so high as to render the test powerless. This point is not merely academic, for a random sample of some seventy-one clinical trials revealed that overemphasis on controlling type 1 error probability led to a 10 percent risk of missing a 50 percent therapeutic improvement (Good 1983, p. 144).

To minimize the total risk, aα + bβ, one finds, writing f _i(x ) = P (X = x _i|h _i), i = 1, 2, that

Hence, the total risk is minimized by making af ₀(x ) − bf ₁(x ) < 0 for all x ∈ R. Then h ₀ is rejected when
f ₁(x ):f ₀(x ) > a :b
which says: Reject h ₀ (in favor of h ₁) when the LR in favor of h ₁ exceeds the relative seriousness, a :b, of the two kinds of error. More advanced readers will recognize this as a Bayesian decision rule for the special case of constant regret functions, appropriate in situations where "a miss is as good as a mile," and equal prior probabilities. In the general case, one may interpret a :b : as the product of the prior odds by the ratio of the regrets. The fundamental lemma then drops out as an easy corollary (de Groot 1986, p. 444), where the most powerful test of size α has critical region, R = {x :f ₁(x ):f ₀(x ) > k }, with k the least number for which P (X ∈ R|h ₀) ≤ α. The main virtue of this approach, however, is that it allows one to adjust the sample size so as to achieve a tolerable level of overall risk. Roughly speaking, one goes on sampling until the marginal cost of one more item exceeds the marginal risk reduction.

NP's decision theoretic formulation notwithstanding, users of statistical tests have continued to interpret them as evidence and to view NP tests as a refinement of Fisher's significance tests. One reason for this is that NP continued to use the language of hypothesis testing, of accepting or rejecting hypotheses. A more important reason is that in many, if not most, scientific inquiries, practical decisions are nowhere in view. Even where questions of public policy impinge, as in the smoking-cancer or charter school controversies, it is deemed necessary to first weigh the evidence before deciding what policy or legislation to adopt. The tendency of NP is to subsume the individual test under a rule of specifiable reliability. Rejection of h ₀ at a 5 percent level does not mean that the probability is 0.05 that a type 1 error was committed in this case, much less that h ₀ has probability 0.05 given the outcome. The error probability refers to the procedure, not the result. However, this raises new concerns.

Consider a test of normal means of common (unknown) variance, σ ², h ₀: μ = μ ₀ versus h ₁:μ = μ ₁.The optimal 5 percent test rejects h ₀ when x̄ ≥ μ ₀ + 1.64σ /√n, where n is the sample size and x̄ = (x̄ ₁ +…+ x _n)/n is the sample mean. For as Carl Friedrich Gauss first showed, x̄ ∼N (μ, σ ²/n ), that is, the sample mean for independent and identically distributed normal variates, X _i∼N (μ, σ ²/n ), is normally distributed about their common mean, m, with variance, σ ²/n, or precision, σ ²/n, n times that of a single measurement. For example, if μ ₀ = 0, μ ₁ = σ ² = 1, and n = 30 so that σ ²/n = 0.18, then h ₀ is rejected when x̄ ≥ .30. However, x̄ = .30 is .70/.18 = 3.89 standard deviation units below the mean of μ = 1 posited by h ₁, and thus much closer to μ ₀ = 0. It is strange that such an observation should be interpreted as strong evidence against h ₀. Indeed, the LR given a random sample of n measurements is:

which, using ∑x _i = nx̄, simplifies further to:
(4)
And with the values chosen, this specializes at the boundary point, x̄ = 1.645σ /√n,to
f ₀/f ₀ = exp(1.645√n − 0.5n )
which tends to zero as n →∞. Even at a modest n = 30 one finds:
f ₀/f ₁ = exp(1.645(√30) − 15) = 0.0025 = 1/400
or an LR in favor of the rejected h ₀ of roughly 400:1.

Thus, one has a recognizable subset of the critical region, namely outcomes at or near the boundary, which more and more strongly favor the rejected hypothesis. The 5 percent significance level is achieved by a surreptitious averaging, for the critical region is built up by incorporating outcomes that give LR's greater than a critical value, starting with the largest LR and continuing until the size of the test is .05. Those first included give evidence against h ₀ stronger than the significance level indicates, but the last few included often favor h ₀. Better disguised examples of this phenomenon drawn from actual frequentist practice are given in chapter 9 of Jaynes (1983, especially pp. 182f), a critical comparison of orthodox and Bayesian methods that focuses on actual performance. For other criticisms of NP along these lines, see Fisher (1959, chapter 4), and John Kalbfleisch and D.A. Sprott, both of which repay careful study.

It is clear as well that NP violates the LP. In the example of binomial versus negative binomial given earlier, Jay's most powerful 5 percent test of h ₀:p = ¼ against h ₁:p = ¾ rejects h ₀ when X ≥ 12 successes occur in the n = 30 trials, while May's best 5 percent test rejects h ₀ when n ₀ ≤ 29, that is, when the twelfth success occurs on or before the twenty-ninth trial. Hence, they reach opposite conclusions when Jay records twelve successes and May obtains the twelfth success on the thirtieth trial. Notice, too, the outcomes 12 and 13 of Jay's experiment both favor h ₀, even though the error probabilities of Jay's test are eminently satisfactory, with α ≤ .05 and β = .0001.

In keeping with the LP, it seems perfectly permissible to stop sampling as soon as the accumulated data are deemed sufficiently strong evidence for or against the tested hypothesis. This is, after all, the idea behind Wald's extension of NP theory to sequential tests (see Hodges and Lehmann 1970, §6.10). Could it really make a difference whether one had planned beforehand to stop when the sample proportion of defectives exceeds B or falls below A or decided this on the spur of the moment? To continue sampling till the bitter end in keeping with a preset sample size may place experimental subjects in needless jeopardy or even cause their death (for a chilling real-life example, see Royall 1997, §4.6). Thus, the ongoing debate over optional stopping raises serious ethical, as well as methodological, concerns.

(B) and (L) also permit enlarging a promising study to solidify the evidence, but because this can only increase the type 1 error probabilities, NP disallows it. This further points to the need to separate the presampling design of an experiment from the postsampling analysis of the resulting data.

But what about the fraud who resolves to go on sampling until some targeted null hypothesis is rejected? The reply to this objection to optional stopping is that while such deception is, indeed, possible using standard NP tests, for the power of such a test, as illustrated earlier, approaches one as the sample size increases, the chances of such deception using a likelihood criterion are remote. Using the familiar mathematics of gambler's ruin (de Groot 1986, §2.4), one can show, for example, that the probability of achieving an LR of 32 in favor of a cure rate of 75 percent for a new drug against the 25 percent rate of the drug currently in use, which requires an excess of s − t ≥ 4 cures over noncures, is given by:

with q = 1 − p, which increases rapidly to its limit of 1/81 as m →∞.

In espousing an evidential interpretation of NP, Egon S. Pearson speaks of "a class of results which makes us more and more inclined … to reject the hypothesis tested in favor of alternatives which differ from it by increasing amounts" (1966, p. 173). Deborah G. Mayo, who defends an evidential version of NP, remarks that "one plausible measure of this inclination is the likelihood" (1996, p. 389), but Pearson rejects this on the grounds that "if we accept the criterion suggested by the method of likelihood it is still necessary to determine its sampling distribution in order to control the error involved in rejecting a true hypothesis" (quoted by Mayo 1996, p. 393). What Pearson, Mayo, and others fail to appreciate, however, is the possibility of retaining the law of likelihood while still assessing and controlling beforehand the probability of obtaining misleading results.

If a LR, L = f ₁/f ₀ greater than L* is accounted strong evidence in favor of h ₁ against h ₀, then one may compute P (f ₁/f ₀ ≥ L*|h ₀) as readily as one computes α = P (X ∈ R|h ₀), and in place of β = P (X ∉ R| h ₁) one may compute P (f ₁/f ₀ < L*|h ₁), which is the probability of misleading evidence against h ₁. (It should be emphasized that it is the evidence itself that is misleading, not one's interpretation of it.)

An important general result, noted independently by C. A. B. Smith and Alan Birnbaum, affirms that the probability of obtaining an LR of at least k in favor of h ₀ when h ₁ holds is at most 1/k :
(5) P (f ₁/f ₀ ≥ k |h ₀) ≤ k ⁻¹
For if S is the subset of outcomes for which the LR is at least k, then

Naturally, this universal bound can be considerably sharpened in special cases, as in the example of a would-be fraud. A specially important case is that of testing hypotheses about a normal mean of known variance with LR given by (4). If the distance Δ = |μ ₁ − μ ₀| is measured in units of the standard deviation of x̄, Δ = cσ /√n,one finds:

whence

with Φ (x ) the (cumulative) normal distribution. Hence, the probability of misleading evidence in this case is a maximum when c /2 + ln k /c is a minimum. By calculus this happens when c = √2lnk, in which case c = c /2 + lnk /c. Thus,
(6) maxP (f ₁/f ₀ ≥ k |h ₀) = Φ (−√2lnk
For example, for k = 8, Φ (−√2ln8 = .021, while for k = 32, Φ (−√2ln32 ) = .0043, which improve considerably on the universal bounds of 1/8 and 1/32. In fact, the ratio, Φ (−√2ln )/k ⁻¹ is easily seen to be decreasing, so that the relative improvement over the universal bound is greater for larger k. Royall (1997) greatly extends the reach of (6) by invoking the fact that the log-likelihood is asymptotically normal about its maximum (the ML estimate of the parameter) with precision given by the Fisher information, with an analogous result for the multiparameter case (Lindley 1965, §7.1; Hald 1998, p. 694).

The upshot is that one can retain the law of likelihood and the likelihood principle and still control for the probability of misleading evidence, the feature that lent NP so much of its initial appeal. This "Royall road" opens the way to further reconciliation of (F) with (B) and (L) and to the removal of many perplexing features of NP significance tests (Royall 1997, chapter 5). In retrospect, one sees that the significance level was made to play a dual role in NP theory as both an index of the evidence against null (Fisher's interpretation) and the relative frequency of erroneous rejections of the tested hypothesis. Fisher vigorously rejected the latter interpretation of significance levels and offered a pertinent counterexample (1956, pp. 93–96). He even says, "[T]he infrequency with which, in particular circumstances, decisive evidence is obtained, should not be confused with the force, or cogency, of such evidence" (p. 96).

NP's ban on optional stopping as well as on what Pearson brands "the dangerous practice of basing the choice of test … on inspection of the observations" (1966, p. 127) is rooted in a conception of testing as subsumption under a reliable rule. One's particular experiment is viewed as one trial of a repeatable sequence of identical experiments in which the considered hypotheses and a division of the outcomes into those supporting and those not supporting the tested hypothesis are specified in advance (compare Fisher 1956, pp. 81–82, who rejects this formulation in no uncertain terms). Thus, it is con sidered cheating to publish the error probabilities com puted for a post facto test as if that test had been predesignated. See Mayo (1996, chapter 9) for numerous statements and illustrations of this stance, especially when she maintains, "Using the computed significance level in post-designated cases … conflicts with the intended interpretation and use of significance levels (as error probabilities)" (p. 317). Most textbooks are curi ously silent on this issue (see Hodges and Lehmann 1970, chapters 11, 13; de Groot 1986, chapter 8), but Mayo's strictures seem to be widely shared by users of statistical tests. The question is whether a statistician, even an orthodox statistician, can function within the confines of such a strict predesignationism.

From Fisher on, modern statisticians have empha sized the importance of checking the assumptions of one's model, and, of course, these are not the object of one's test. Moreover, the most sensitive test of such com mon assumptions as independence, normality, or equal ity of variances, is often suggested by the deviations observed in one's data, thus violating Pearson's proscrip tion. But, ironically, the most telling counterexamples come from the bible of NP theory, Erich Lehmann's clas sic, Testing Statistical Hypotheses (1959, p. 7). In testing a hypothesis about a normal mean of unknown variance, one cannot tell how large a sample is needed for a sharp result until one has estimated the variance. Or, again, if X is uniformly distributed in a unit interval of unknown location, one can stop sampling if the first two observa tions are (very nearly) a unit distance apart, but if the first n observations all lie within a tiny distance of each other, no more has been learned than the first two observations convey and one must go on sampling. In these workaday examples of Lehmann's, optional stopping is not optional; it is the only option.

Obviously, the issue just raised has strong links to the philosophy of science that holds that "evidence predicted by a hypothesis counts more in its support than evidence that accords with a hypothesis constructed after the fact" (Mayo 1996, p. 251). It would be digressive to enter into this issue here, so one must refer to Mayo (chapter 8) for further discussion and references, and to Stephen G. Brush (1994).

Goodness-of-Fit Tests

Karl Pearson's goodness-of-fit test (de Groot 1986, §§9.1–4; Hodges and Lehmann 1970, §11.3) rejects a multinomial model h ₀ of categorical data when the deviation between observed (n _i) and predicted category counts (np _i) is improbably large conditional on h ₀. The measure of deviation employed by Pearson is the chisquared measure:
(7)
with f _i = n _i/n. Pearson showed that if h ₀ is true, X ² has, asymptotically, a chi-squared distribution with v = k − 1 degrees of freedom. The mean and variance are v and 2v and a rule of thumb is that roughly 90 to 95 percent of the probability mass of the chi-squared distribution lies to the left of the mean plus two standard deviations. These and other mathematically convenient features are, essentially, the only thing that recommends this particular measure of deviation (see the two texts just cited and Jaynes 2003, p. 299).

On the surface, Pearson's chi-squared test appears to test the goodness-of-fit of a model without reference to alternatives. (B) offers a less well known test whose rationale is best brought out by considering Jaynes's example of a thick coin (2003, p. 300) that may land on its edge with a probability of .002 and is otherwise balanced (h ₀). In n = 29 tosses, D =(n ₁,n ₂,n ₃) = (14, 14, 1) is observed, that is, the coin lands on its edge once and lands heads and tails equally often, in an almost "best possible" agreement with h ₀. However, X ² = 15.33, which is more than seven standard deviations beyond the mean of 2. Defenders of the test will be quick to point out that the chi-square approximation to the distribution of X ²breaks down when one or more of the expected counts is less than 5, but that is not the problem here. For one can use brute force to compute P (X ² ≥ 15.33|h ₀) exactly, since the only outcomes that give a smaller value of X ² are (l,29 − l, 0) and (29 − l, l, 0) with 4 ≤ l ≤ 14. The sum of their probabilities on h ₀ is 0.9435, whence P (X ² ≥ 15.33|h ₀) = 0.0565. Hence, Pearson's test just fails by a whisker to reject h ₀ at the 5 percent significance level conventionally associated with strong evidence against h ₀. The source of the trouble is that X ² wrongly orders the possible outcomes; some accounted less deviant than (14, 14, 1) are actually less probable on h ₀. Ideally, outcomes less probable on h ₀ should be accounted more deviant.

Given data D =(n ₁,…,n _k), one might ask a somewhat different question than the one Pearson asked, namely: How much support is apt to be gained in passing to some alternative hypothesis? For as Fisher and others emphasize, before rejecting a model as ill fitting one should attempt to find a plausible alternative that fits the data better. Plausibility aside, there is always one alternative hypothesis—call it the tailored hypothesis—that fits D better than h ₀ by positing the observed relative frequencies, f _i = n _i/n, as its category probabilities. In effect, one wants to test the given model against the ideally best-fitting alternative, and this prompts one to look at the LR in favor of F = (f ₁,…, f ₂) against the probability distribution P = (p ₁,…, p ₂) of h ₀, namely, , or, better, at its logarithm, ln(f _i / p _i), which is additive in independent samples. This proves to be n times
(8)
which may be viewed as a measure of the nearness of F to P. Though (8) was used by Alan Turing and his chief statistical assistant, I. J. Good during World War II, Solomon Kullback, another wartime code breaker, was the first to publish a systematic treatment of its properties and applications to statistics, dubbing it discrimination information (see the entry on information theory). Since F is tailored to achieve perfect fit, H (F, P ) sets an upper limit to how much one can improve the fit to the data by scrapping h ₀ in favor of a simple or composite alternative hypothesis (Jaynes 2003, pp. 293–297).

Happily, ψ = 2nH (F, P ) is also asymptotically distributed as , the chi-square variate with k − 1d.f. (degrees of freedom). This hints that Pearson's X ² approximates ψ (Jaynes 1983, pp. 262–263). For example, Mendel's predicted phenotypic ratios of AB:Ab:aB:ab = 9:3:3:1 for a hybrid cross, AaBb × AaBb, gave rise to counts of 315, 101, 108, and 32 among n = 556 offspring. This gives X ² = .4700 and ψ = .4754. But when the expected category counts include a small value or the deviations are large, the approximation degrades, and with it the performance of Pearson's test. Thus, in Jaynes's (2003) thick coin example, X ² rates the outcomes (l, 29 − l, 0) and (29 − l, l, 0) for 4 ≤ l ≤ 8 as less deviant than (14, 14, 1) even though they are also less probable on h ₀; by contrast, ψ errs only in failing to count (9, 20, 0) and (20, 9, 0) as less deviant than (14, 14, 1). Hence, the exact probability that ψ is less than its value of 3.84 at (14, 14, 1) is twice the sum of the probabilities (on h ₀) of the outcomes (l, 29 − l, 0) for 10 ≤ l ≤ 14, or 0.7640, whence P (ψ ≥ 3.84|h ₀) = .2360. Clearly, the ψ -test gives no reason to believe support can be much increased by passing to an alternative hypothesis, but it will be instructive to carry the analysis a step further.

The only plausible alternative that presents itself is the composite hypothesis, H:p ₁ = p ₂ = ½(1 − θ ), p ₃ = θ (0 < θ < 1), which includes h ₀ as the special case θ = .002. Since one d.f. is lost for each parameter estimated from the data in using Pearson's test (de Groot 1986, §9.2), this is one way of trading off the improved accuracy that results when a parameter is added against the loss of simplicity. It is insensitive, however, to whatever constraints may govern the parameters. A Bayesian treatment tests h ₀against the composite alternative H − h ₀ (i.e., H exclusive of the value θ = .002) and goes by averaging the likelihoods of the special cases of H − h ₀ against a uniform prior of θ over its allowed range—unless more specific knowledge of θ is available. (The affect is to exact a maximum penalty for the given complication of h ₀.) On canceling the multinomial coefficient and using the beta integral (v.s.), the ratio of the likelihoods reduces to:

Thus, the data D = (14, 14, 1) favors h ₀ over the composite alternative, and this remains true, albeit less strongly, if one integrates, say, from 0 to 0.1. By contrast, the chi-square test favors H − h ₀ over h ₀ by mere dint of the fact that the composite hypothesis includes the tailored hypothesis as a special case, namely, θ = 1/29, for then the value of X ² is zero. Thus, any complication of an original model that happens to include the tailored hypothesis will be preferred to the original model.

Notice, the parameter distribution must reflect only what is known before sampling. Unfortunately, more cannot be said about the different ways (F) and (L) handle the problem of trading off the improved accuracy gained in complicating a model, retaining the original model as a special case, against the loss of simplicity as compared to the Bayesian method just illustrated of averaging the likelihoods. For more on this, see Roger D. Rosenkrantz (1977, chapters 5, 7, and 11) and Arnold Zellner, Hugo A. Keuzenkamp, and Michael McAleer (2001) for other approaches.

Probability as Logic

Bayesians view probability as the primary (or primitive) concept and induction or inference as derived (see Finetti 1938/1980, p. 194). They emphasize that their methods, properly applied, have never been rejected on the basis of their actual performance (Jaynes 1983, chapter 9; 2003, p. 143). As a corollary, they maintain that the canons of scientific method and inductive reasoning have a Bayesian rationale, while this is vigorously contested by frequentists (e.g., Mayo 1996, chapters 3 and 11). In particular, Bayesians evolved a mathematical analysis of inductive reasoning with its source in the original memoir of Thomas Bayes that includes purported solutions of the notorious problem of induction by Laplace (see Hald 1998, chapter 15) and de Finetti (1937/1981), as well as the equally notorious paradoxes of confirmation (see Good 1983, chapter 11; Rosenkrantz 1977, chapter 2).

Plainly, one's view of statistics is highly colored by one's interpretation of probability. The approaches of Fisher, Neyman, and Pearson, as well as that of most (L) proponents, like Royall, are grounded in a frequency interpretation that equates probabilities with asymptotically stable relative frequencies. The criticisms of the frequency theory, nicely summed up by L. J. Savage (1954, pp. 61–62), are, first, that it is limited (and limiting) in refusing to treat as meaningful the probabilities of singular or historical events, or (in most cases) scientific theories or hypotheses, like the hypothesis that smoking causes lung cancer, and, second, that it is circular. The model of random independent (Bernoulli) trials considered earlier is often held to justify the definition of probability as a limiting relative frequency, but all that theorem does is assign a high probability to the proposition that the observed relative frequency will lie within any preassigned error of the true probability of success in a sufficiently long sequence of such trials.

Savage's criticism along these lines is more subtle. Bayes saw that a distinctly inverse or inductive inference is needed to infer probabilities from observed frequency behavior. Thus, even Bayesians, like Good or Rudolf Carnap, who admit physical probabilities, insist that epistemic probabilities are needed to measure or infer the values of physical probabilities. A more sophisticated view is that physical probabilities arise from the absence of microscopic control over the outcome of one's experiment (see the final section).

Modern Bayesians have sought deeper foundations for probability qua degree of belief and the rules governing it in the bedrock of consistency. It is not merely "common sense reduced to a calculus" (Laplace) but a "logic of consistency" (F. P. Ramsey). Needed, in particular, is a warrant for (2), for it is in Bayesian eyes the basic (not to say the "bayesic") mode of learning from experience. Epistemologists of the naturalist school seriously question this, as when Ronald N. Giere contends that "there are many different logically possible ways of 'conditionalizing' on the evidence, and no a priori way of singling out one way as uniquely rational" (1985, p. 336). Rather than multiply one's initial odds by the LR, why not by some positive power of the LR? At any rate, this marks a major parting of the ways in contemporary epistemology.

One Bayesian response has been to argue that alternatives to the usual rules of probability open one to sure loss in a betting context, to a so-called Dutch book. However, this justification imports strategic or game theoretic considerations of doubtful relevance, which is why Bruno de Finetti (1972), an early sponsor of the argument, turns, instead, to the concept of a proper scoring rule, a means of evaluating the accuracy of a probabilistic forecast that offers forecasters no incentive to announce degrees of prediction different from their actual degrees of belief. (It is rumored that some weather forecasters overstate the probability of a storm, for example, to guard against blame for leaving the citizenry unwarned and unprepared.) This move to scoring rules opens the way to a means-end justification of (2) as the rule that leaves one, on average, closest to the truth after sampling.

By far the most direct way of sustaining Ramsey's declaration that "the laws of probability are laws of consistency" is that developed by the physicist Richard T. Cox (1946). Besides a minimal requirement of agreement with common sense, his main appeal is to a requirement of consistency (CON), that two ways of doing a calculation permitted by the rules must yield the same result. In particular, one must assign a given proposition the same probability in two equivalent versions of a problem.

In a nutshell, Cox's argument for the product rule, P (AB|C ) = P (A|BC )P (B|C ), from which (2) is immediate, exploits the associativity of conjunction.

First phase: Letting AB |C denote the plausibility of the conjunction AB supposing that C, show that AB |C depends on (and only on) A |BC and B |C, so that
(i) AB |C = F (A |BC, B |C )
Moreover, by the requirement of agreement with qualitative common sense, the function F (x, y ) must be continuous and monotonically increasing in both arguments, x and y.

Second phase: Using first one side then the other of the equivalence of (AB )D and A (BD ):
ABD |C = F (AB |DC, D |C ) = F (F (A |BDC, B |DC ), D |C )
ABD |C = F (A |BDC, BD |C ) = F (F (A |BDC, F (B |DC, D |C ))
leading by (CON) to the associativity functional equation first studied by Niels Henrik Abel in 1826:
(ii) F (F (x, y ), z ) = F (x, F (y, z ))
Cox solved (ii) by assuming that, in addition, F (x, y ) is differentiable. An elementary approach sketched by C. Ray Smith and Gary J. Erickson (1990) based on functional iteration, due to J. Aczel, dispenses with this assumption and leads to the solution: w (F (x, y )) = w (x )w (y ), with w continuous and monotonic, hence to
(iii) w (AB |C ) = w (A |BC )w(B |C )

Third phase: Specializing (iii) to the cases where A is certain or impossible given C, one deduces that w (A|A ) = 1 and w (A|A ) = 0 or ∞. But these two choices lead to equivalent theories, so one may as well assume that w (A|A ) = 0 in line with the usual convention.

Cox (1946) gives a similar derivation of the negation rule. P (A ) + P (A ) = 1, and in conjunction with the product rule just derived, this yields the sum rule as follows:

Notice, Cox's derivation is restricted to finite algebras of sets, though not to finite sample spaces.

Non-Bayesian methods (or surrogates) of inference, which ipso facto violate one or more of Cox's desiderata, tend to break down in extreme cases. For example, unbiased estimates can yield values of the estimated parameter that are deductively excluded and frequentist confidence intervals can include impossible values of the parameter. A weaker but more general result to account for this affirms that one maximizes one's expected score after sampling (under any proper scoring rule) with (2) in preference to any other inductive rule (Rosenkrantz 1992, p. 535). This optimality theorem, which seems to have many discoverers, affords a purely cognitive justification of (2) as the optimally efficient means to one's cognitive end of making inferences that leave one as close to the truth as possible. This rationale has been extended by inductive logicians to the justification of more specialized predictive rules that are seen as optimal for universes or populations of specifiable orderliness (see Festa 1993).

An interesting implication of the optimality theorem is that it pays to sample, or that informed forecasts are better than those that lack or waste given information. To see this, compare (2) to the impervious rule that fixes updated probabilities at their initial values. Moreover, since the utility scoring rule, S (R, h _i) = U (a _R,h _i), is proper, where a _R maximizes expected utility against the probability distribution, R =(r ₁,…, r _n), over states of nature, one can expect higher utility after sampling as well, a result first given by Good (1983, chapter 17). Thus, both cognitive and utilitarian ends are encompassed.

The optimality theorem presents Bayesian conditioning as the solution of a well-defined optimization problem, thus connecting it to related results on optimal searching and sorting and continuing the tradition of Fisher, Neyman, Pearson, and Wald of viewing rules of estimation, statistical tests, and decision functions (strategies) as solutions of well-posed optimization problems.

The Controversial Status of Prior Probabilities

Objections to (B) center on the alleged impossibility of objectively representing complete ignorance by a uniform probability distribution (Fisher 1956, chapter 2; Mayo 1996, pp. 72ff; Royall 1997, chapter 8). For if one is ignorant of V (volume), then, equally, one is ignorant of D = 1/V (density), but a uniform distribution of V entails a nonuniform distribution of D and vice versa, since equal intervals of V correspond to unequal intervals of D, so it appears one is landed in a contradiction (for some of the tangled history of this charge of noninvariance, see Hald 1998, §15.6; Zabell 1988).

Bayesian subjectivists also deny that any precise meaning can be attached to ignorance (Savage 1954, pp. 64–66), but often avail themselves of uniform priors when the prior information is diffuse (e.g., Lindley 1965, p. 18). This affords a reasonably good approximation to any prior that is relatively flat in the region of high likelihood and not too large outside that region, provided there is such a region (or, in other words, that the evidence is not equally diffuse). For a precise statement, proof, and discussion of this so-called principle of stable estimation, see Ward Edwards, Harold Lindman and Leonard J. Savage (1965, pp. 527–534), as well as Dennis V. Lindley (1965, §5.2) for the important special case of sampling a normal population.

Bayesians have also used Harold Jeffreys's log-uniform prior with density
(9)      p (θ |I ₀) ∝ θ ⁻¹
for a positive variate or parameter, θ > 0, where I ₀ represents a diffuse state of prior knowledge. (9) is equivalent to assigning lnθ a uniform distribution, whence the name log-uniform. If θ is known to lie within finite bounds, a ≤ θ ≤ b, the density (9) becomes
(9a)
where R ₀ = b /a, hence, the probability that θ lies in a subinterval, [c, d ] of [a, b ] is given by:
(9b)
It follows that θ is log-uniformly distributed in [a, b ] if and only if, for any integer k, θ ^k is log-uniformly distributed in [a ^k, b ^k], since

This at once resolves the objection from the (alleged) arbitrariness of parameterization mentioned at the outset. For V (volume) is a positive quantity, hence, the appropriate prior is, not uniform, but log-uniform, and it satisfies the required invariance: all (positive or negative) powers of V, including V ⁻¹, have the same (log-uniform) distribution.

Its invariance would be enough to recommend (9), but Jeffreys provided further justifications (for his interesting derivation of 1932, see Jaynes 2003, p. 498). He did not, however, derive (9) from a basic principle clearly capable of broad generalization (Kendall and Stuart 1967, p. 152). Nevertheless, his insistence that parameters with the same formal properties be assigned the same prior distribution hinted at a Tieferlegung. And while the leaders of the Bayesian revival of the 1950s, Savage, Good, and Lindley, did not find in Jeffreys's assorted derivations of (9) a principle definite enough to qualify as a postulate of rationality, they did clearly believe that given states of partial knowledge are better represented by some priors than by others they denigrated as pig-headed (Lindley 1965, p. 18) or highly opinionated (e.g., Zabell 1988, p. 157). Such out-of-court priors might be highly concentrated in the face of meager information or import a dependence between two parameters (de Groot 1986, p. 405). There matters stood when Jaynes published his fundamental paper, "Prior Probabilities" in 1968 (chapter 7 of Jaynes 1983).

Bayesian subjectivists are as committed to consistency as Bayesian objectivists, and to assign different probabilities to equivalent propositions or to the same proposition in two equivalent formulations of a problem is to commit the most obvious inconsistency. Savage (1954, p. 57), for one, viewed it as unreasonable to not remove an inconsistency, once detected.

Consider a horse race about which one knows only the numbers—better, the labels—of the entries. Since the labels convey no information (or so one is assuming), any relabeling of the horses leads to an equivalent problem, and the only distribution invariant under all permutations of the labels is, of course, the uniform distribution. Thus reinvented as an equivalence principle, Laplace's hoary principle of indifference is given a new lease on life: The vague notion of indifference between events or possibilities gives way to the relatively precise notion of indifference between problems (Jaynes 1983, p. 144). Two versions of a problem that differ only in details left unspecified in the statement of the problem are ipso facto equivalent (p. 144). In this restricted form Laplace's principle can be applied to the data or sampling distributions to which (F) and (L) are confined as well as to the prior distributions on which (B) relies. Indeed, from this point of view, "exactly the same principles are needed to assign either sampling distributions or prior probabilities, and one man's sampling probability is another man's prior probability" (Jaynes 2003, p. 89).

Invariance also plays a leading role in frequentist accounts of estimation and testing (Lehmann 1959, chapter 6). In testing a bivariate distribution of shots at a target for central symmetry, Lehmann notes, the test itself should exhibit such symmetry, for if not, "acceptance or rejection will depend on the choice of [one's coordinate] system, which under the assumptions made is quite arbitrary and has no bearing on the problem" (p. 213).

To see how the principle can be used to arrive at a sampling distribution, consider, again, Frank Wilcoxon's statistic, W _t, for the sum of the ranks of the t treated subjects, with W _c the corresponding statistic for the c controls, where t + c = N. Clearly, it is a matter of arbitrary convention whether subjects who show a greater response are assigned a higher or lower number as rank. In Table 2, the inverse ranks of the t = 8 treated subjects are, respectively, 10, 13, 8, 11, 12, and 7, where each rank and its inverse sum to N + 1 = 14. This inversion of the ranks leaves the problem unchanged. On the null hypothesis, h ₀, that the treatment is without affect, both W _t and the corresponding statistic, , for the sum of the inverse ranks, are sums of t numbers picked at random from the numbers 1 through N. Hence, W _t and have the same distribution, which we write as:

This is the invariance step where the Jaynesian principle of indifference is applied. Furthermore, since , it follows that

whence

which implies that W _t is symmetrically distributed about its mean. Next, recenter the distribution by subtracting the minimum rank sum of 1 + 2 + … + t = t (t + 1)/2 from W _t, that is, define:

and, similarly,

for the controls. Then both U _t and U _c range from 0 to tc, have mean ½tc, and inherit the symmetry of W _t and about their mean, which suggests, but does not prove, that U _tïU _c. This follows from

using

and the symmetry of W _t, while at the same time,

and

so that U _t − ½tc ≃ U _c − ½tc,or U _t ≃ U _c. Finally, from the common distribution of U _t and U _c, which is easily tabulated for small values of t and c using an obvious recurrence, and for large values using a normal approximation (Hodges and Lehmann 1970, chapter 12, especially p. 349), the distributions of W _t and W _c, with either convention governing the ranks, can be obtained.

Consider, next, Jaynes's (1983, p.126) derivation of the distribution of the rate parameter, λ, of the Poisson distribution (POIS):

which gives the probability that n events (e.g., accidents, cell divisions, or arrivals of customers) occur in an interval of time of length t. Nothing being said about the time scale, two versions of the problem that differ in their units of time are equivalent. Then the times t and t ' in the two versions are related by
(i)      t = qt '
so that corresponding pairs (λ, t ) and (λ', t ') satisfy λt = λ't ', or
(ii)      λ' = q λ
Indeed, (ii) is what defines λ as a scale parameter. Then d λ' = qd λ, that is, corresponding intervals of time also differ by the scale conversion factor. Hence, if f (λ)d λ and g (λ')d λ' are the probabilities of lying in corresponding small intervals, d λ and d λ', then (step 1):
(iii)       f (l)d λ = g (λ')d λ'
since one is observing the same process in the two time frames, or, using (ii),
(iv)      f (l) = qg (q λ)
Now (step 2) invoke the consistency requirement to affirm that f = g, leading to the functional equation
f (λ) = qf (q λ)
whose (unique) solution (step 3) is readily seen to be f (λ) ∝ 1/ λ, the log-uniform distribution of Jeffreys. Thus, if all one knows about a parameter is that it is a scale parameter, then consistency demands that one assigns it a scale-invariant distribution. Following Jaynes (2003, §17.3), it is instructive to compare the estimates of λ and powers thereof to which the log-uniform distribution leads with the unbiased estimates favored by frequentist theory.

Using the gamma integral,

for integers k = 1,2,3,…,one sees that the rate parameter, λ, is also the mean and variance of POIS. Hence, the mean of any (integer) power of λ after observing n incidents in a chosen unit interval of time (used now in place of an interval of length t ) is given by:

In particular, the posterior mean of λ is n + 1, that of λ⁻¹is n ⁻¹, and that of λ² is (n + 2)(n + 1), so that the variance of the posterior distribution is equal to n + 1, the same as that of λ, itself a kind of invariance. (F) favors using unbiased statistics (estimators) to estimate a parameter and then among them, choosing the one of minimum variance. That is, on the analogy to target shooting, one uses statistics centered on the bull's eye and most tightly concentrated there (Hodges and Lehmann 1970, chapter 8; de Groot 1986, §7.7). However, as Jaynes shows (2003, §17.3) this "nice property" is not so nice. For while the unbiased estimator of λ is n, which is reasonable and close to its (B) counterpart, the only unbiased estimator f (n ) of λ² when n is the number of incidents recorded in the unit of time, is
f (n ) = n (n − 1)
and f (n ) = 0 otherwise. Thus, when n = 1 incident is observed, the unbiased estimate of λ² is zero, which entails that λ = 0. That is, one is led to an estimate of λ² that is deductively excluded by the observation. (It only gets better—or worse!—when one looks at higher powers of λ.) Moreover, no unbiased estimator of λ⁻¹ exists. In essence, unbiased estimators are seen to be strongly dependent on which power of the unknown parameter one chooses to estimate, Bayes estimators (equating these with the mean of the posterior distribution) only weakly so.

It is also well known that, for any distribution, the sample variance, , is a biased estimator of the population variance, σ ², while is unbiased. If, however, one's goal is to minimize the mean-squared error, E_θ [(θ ̂ − θ )], of one's estimate of θ ̂ of θ (de Groot 1986, p. 412), the avowed goal of (F), then it can be shown that the biased estimator, , of a normal population variance has, for every value of σ ², a smaller MSE than either of the two cases of the class given earlier (de Groot 1986, pp. 414–415). Hence, the unbiased sample variance is dominated by a biased one; it is, in this precise sense of decision theory, inadmissible. Thus, the two leading (F) criteria of unbiasedness and admissibility are seen to conflict. This insight of Charles Stein's shows, too, that an unbiased estimator is by no means certain to have lower MSE than a biased one, for the MSE is a sum of two terms, the bias and the variance, and in the case at hand, the biased sample variance more than makes up in its smaller variance what it gives up in bias (for more on this, including the waste of information that often accompanies unbiased estimation, see Jaynes 2003, pp. 511ff).

If the density of a variate, X, can be written:
(10)
then μ is called a location parameter and σ a scale parameter of the distribution. For changes in μ translate the density curve along the x-axis without changing its shape, while changes in σ alter the shape (or spread) without changing the location. The exemplars are, of course, the mean and standard deviation of a normal distribution. Pretty clearly, Jaynes's derivation of the log-uniform distribution of a Poisson rate applies to any scale parameter (1983, pp. 125–127). That is the justification Jeffreys lacked, though anticipating it in his requirement that formally identical parameters should have the same distribution. The essential point is that not every transformation of a parameter leads to an equivalent problem. Even a subjectivist with no prior information about the population proportion p of some trait would balk at having his or her beliefs represented by a uniform prior of some high power of p.

Notice, the range of lnσ for 0 < σ < ∞ is the whole real line, as is that of a uniform prior of a variate that can assume any real value. Such functions are, of course, non-integrable (nonnormalizable) and are termed improper. They cause no trouble—lead to a normalizable posterior density—when the likelihood function tails off sufficiently fast, as it will when the sample information is non-negligible. In sampling a normal population of known precision, h = σ ⁻², a normal prior, N (μ ₀, h ₀), of the unknown mean, μ, combines with the normal likelihood based on a random sample of size n to yield a normal posterior density, N (μ ₁, h ₁) with precision given by h ₁ = h ₀ + nh, the sum of the prior and the sample precision, and mean:
(11)
a precision-weighted average of the prior mean and the sample mean (Lindley 1965, §5.1; Edwards, Lindman, and Savage 1965, pp. 535–538). Small prior precision, h ₀, represents a poverty of prior information about the mean, and letting it approach zero yields a uniform prior as a limiting case. Then the posterior mean, μ ₁, becomes the sample mean. This is a way of realizing Fisher's ideal of "allowing the data to speak for themselves" and can be applied in the spirit of the "jury principle" when the experimenter is privy to prior information not widely shared by the relevant research community. Priors that achieve this neutrality are termed uninformative or reference priors (see Loredo 1990, p. 119).

This example of closure—a normal prior combining with the (normal) likelihood to yield a normal postsampling distribution—is prototypic and one speaks of the relevant distribution as conjugate to the given likelihood function or data distribution. Other examples (de Groot 1986, pp. 321–327) include the beta:
f _β(p |a, b )dp = B (a, b )⁻¹p ^a−1(1 − p )^b−1
with and Γ(n) = (n-1)! when n is an integer, which combines with a binomial likelihood, L (p ) = p ^r(1 − p )^s, to yield a beta posterior density, f _β(p|a + r, b + s ); or, again, the gamma distribution with density

which combines with a Poisson likelihood to yield a gamma posterior density (de Groot 1986, p. 323). In general, any (one-parameter) data distribution of the form:
(12) f (x |θ ) = F (x )G (θ )exp[u (x )φ (θ )]
will combine with a prior of the form, p (θ |I )dθ ∝ G (θ )^aexp (bφ (θ )), to yield a density of the same so-called Koop-man-Darmois form (Lindley 1965, p. 55). These are precisely the data distributions that admit a fixed set of sufficient statistics, namely, estimators of the unknown parameter(s) that yield the same posterior distribution as the raw data (Lindley 1965, §5.5; de Groot 1986, §6.7; or, for more advanced readers, Jaynes 2003, chapter 8).

The parameters of a conjugate prior represent a quantity of information. For example, for the beta prior, a + b may be the size of a pilot sample or a virtual sample. By letting these parameters approach zero, one obtains an uninformed prior in the limit that represents, so to speak, the empty state of prior knowledge. The log-uniform prior (9) of a normal variance can be obtained in this way from the conjugate chi-squared prior (Lindley 1965, §5.3, p. 32), thus complementing its derivation as the distribution of a scale parameter about which nothing else is assumed.

In all the cases considered, the improper prior arises as a well-defined limit of proper priors. When this finite sets policy, which Jaynes traces to Gauss, is violated, paradoxes result, that is, in Jaynesian parlance, "errors so pervasive as to become institutionalized" (2003, p. 485). Such paradoxes can be manufactured at will in accordance with the following prescription:

Start with a mathematically well-defined problem involving a finite set, a discrete or a normalizable distribution, where the correct solution is not in doubt;
Pass to a limit without specifying how the limit is approached;
Ask a question whose answer depends on how that limit is approached.

Jaynes adds that "as long as we look only at the limit, and not the limiting process, the source of the error is concealed from view" (p. 485).

Jaynes launches his deep-probing analysis of these paradoxes with the following exemplar, a proof that an infinite series, S = Σa _n, converges to any real number x one cares to name. Denoting the partial sums, s _n = a ₁ + a ₂+…+ a _n with s ₀ = 0, one has for n ≥ 1:
a _n = (s _n − x ) − (s _n− 1 − x )
and so the series becomes
S = (s ₁ − x ) + (s ₂ − x ) + (s ₃ − x ) + …
−(s ₀ − s ) − (s ₁ − x ) − (s ₂ − x ) − …
Since the terms s ₁ − x, s ₂ − x, … all cancel out, one arrives at S = −(s ₀ − x ) = x.

Apart from assuming convergence, the fallacy here lies in treating the series as if it were a finite sum. The nonconglomerability paradox, which purports to show that the average, P (A |I ), of a bounded infinite set of conditional probabilities, P (A |C _jI ), can lie outside those bounds, also turns on the misguided attempt to assign these probabilities directly on an infinite matrix rather than approaching them as well-defined limits of the same probabilities on finite submatrices (Jaynes 2003, §15.3). Jaynes goes on to consider countable additivity, the Borel-Kolmogorov paradox, which involves conditioning on a set of measure zero, and the marginalization paradoxes aimed at discrediting improper priors. These paradoxes have little to do with prior probabilities per se and everything to do with ambiguities in the foundations of continuous probability theory.

Leaving these subtle fallacies to one side, one can apply Jaynes's policy of starting with finite sets and then passing to well-defined limits to another old chestnut, the water-and-wine paradox in which one is told only that the ratio of water (H) to wine (W) in a mixture lies between 1 and 2. Then the inverse ratio of wine to water lies between ½ and 1, and, in the usual way, a uniform distribution of one ratio induces a nonuniform distribution of the other. One can eliminate ambiguity, however, by quantizing the problem. There are, after all, just a finite number N of molecules of liquid, of which N _H are water molecules and N _W are wine molecules. Then the inequality, 1 ≤ N _H:N _W ≤ 2, is equivalent to N _W ≤ N _H ≤ 2N _W, and so the admissible pairs (N _H, N _W)are:
{(N _H, N − N _H):½N ≤ N _H ≤ ⅔ N }
Moreover, this remains true when one starts with the other (equivalent) version of the problem in which the given is the inequality, ½ ≤ N _W:N _H ≤ 1, governing the inverse ratio. One then assigns equal probabilities to these(⅔ − ½)N = ⅙N allowed pairs. Then to find, for example, the probability that ½≤ N _W:N _H ≤ ¾, one takes the ratio of the allowed pairs meeting this condition, which is equivalent to
⁴/7N ≤ N _H ≤ ⅔N
to the total number, N /6, of allowed pairs to find, not ½, but

which is surprisingly close to:

Or, again, the probability that N _W:N _H lies between ⅝ and ¹¹/13 is found to be 23/52 = .442, which is close to (ln¹¹/13 − ln⅝)/ln2 = .437. Thus, by assigning equal probabilities in a discrete version of the problem—the only invariant assignment—one appears to be led once more to the log-uniform prior.

Another familiar puzzle of geometric probability is Joseph Bertrand's chord paradox, which asks for the probability that a chord of a circle of radius R drawn at random exceeds the side s = √3R of the inscribed equilateral triangle. Depending on how one defines "drawn at random," different answers result, and Bertrand himself seems to have attached no deeper significance to the example than that "la question est mal posee."

Like the water-and-wine example, this puzzle is more redolent of the faculty lounge than the laboratory, so following Jaynes (1983, chapter 8; 2003, §12.4.4), one can connect it to the real world by giving it a physical embodiment in which broom straws are dropped onto a circular target from a great enough height to preclude skill. Nothing being said about the exact size or location of the target circle, the implied translation and scale invariance uniquely determine a density:
(13)
for the center (r, θ ) of the chord in polar coordinates. And since

it follows that annuli whose inner and outer radii, r ₁ and r ₂, stand in the same ratio should experience the same frequency of hits by the center of a chord. With L = the length of a chord whose center is at (r, θ ), the relative length, x = L/2R, of a chord has the induced density:
(13a)
finally since L = √3R is the side-length of the inscribed equilateral triangle, the probability sought is:

with u = 1 − x ².

All these predictions of Jaynes's solution can be put to the test (for one such test and its outcome, see Jaynes 1983, p. 143). In particular, (13) tells one to which hypothesis space a uniform distribution should be assigned to get an empirically correct result, namely, to the linear distance between the centers of the chord and circle. There is no claim, however, to be able to derive empirically correct distributions a priori, much less to conjure them out of ignorance. All that has been shown is that any distribution other than (13) must violate one or more of the posited invariances. If, for example, the target circle is slightly displaced in the grid of straight lines defined by a rain of straws, then the proportion of hits predicted by that other distribution will be different for the two circles. However if, Jaynes argues (p. 142), the straws are tossed in a manner that precludes even the skill needed to make them fall across the circle, then, surely, the thrower will lack the microscopic control needed to produce a different distribution on two circles that differ just slightly in size or location.

The broom straw experiment, which readers are urged to repeat for themselves, is highly typical of those to which one is tempted to ascribe physical probabilities or objective chances, for example, the chance of 1/2 that the chord fixed by a straw that falls across the circle exceeds the side of the inscribed triangle. However, as Zabell (1988, pp. 156–157) asks, if there is a "propensity" or "dispositional property" present, of what is it a property? Surely not of the straws, nor, he argues, of the manner in which they are tossed. A skilled practitioner of these arts can make a coin or a die show a predominance of heads or sixes (see Jaynes 2003, chapter 10). Nor is it at all helpful to speak of identical trials of the experiment, for if truly identical, they will yield the same result every time. Zabell concludes that "the suggested chance setup is in fact nothing other than a sequence of objectively differing trials which we are subjectively unable to distinguish between." However, one may well be able to distinguish between different throws of a dart in terms of how tightly one gripped it, for example, without being able to produce different distributions on slightly differing targets. It is the absence of such skill that seems to matter, and that feature of the chance setup is objective. On this basis, Jaynes is led to characterize the resulting invariant distribution as "by far the most likely to be observed experimentally in the sense that it requires by far the least skill" (1983, p. 133).

For a different example, consider the law of first digits. Naive application of the principle of indifference at the level of events leads to an assignment of equal probabilities to the hypotheses, h _d, d = 1, 2, …, 9, that d is the first significant digit of an entry, X, in a table of numerical data. Nothing being said about the scale units employed, the implied scale invariance implies a log-uniform distribution of X with normalization constant, 1/ln10, since a = 10^k ≤ X < 10^k+1 = b forces (which is independent of k ). Hence, d is the first significant digit with probability:
(14)
so that p ₁ = log ₁₀ 2 = .301,…, p ₉ = 1 − log ₁₀ 9 = .046. Known earlier to Simon Newcomb, (14) was rediscovered in 1938, though not explained, by Frank Benford, who tested it against twenty tables ranging from the surface areas of rivers and lakes to the specific heats of thousands of compounds. Surprisingly, Benford found that (14) even applies to populations of towns or to street addresses, which are certainly not ratio scaled. The explanation lies in the recent discovery of T. P. Hill (1995) that "base invariance implies Benford's law." That is, (14) is invariant under any change of the base b > 1 of the number system. Moreover, since scale invariance implies base invariance—but not conversely—the scale-invariant tables for which (14) holds are a proper subset of the base-invariant ones. Indeed, Hill derives a more general form of (14) that applies to initial blocks of k ≥ 1 digits of real numbers expressed in any base, namely:
(14a)
where is the ith significant digit of x in base b. For example, for base ten, and k = 2, the probability that the first two digits are 3 and 7 is log₁₀[1 + (37)⁻¹] = .01158, while, as one may verify, the probability that the second digit is d is given by:
(14b)
Hill's derivation of (14) is a beautiful and instructive exercise in measure theoretic probability, but the main point to register here is that (14) is not the chance distribution of any readily conceivable physical process or random experiment. One can be just as certain, though, that any list or table of numbers that violates (14) must yield different frequencies of first (second,…) digits when the scale or number system is changed. More generally, the output of a deterministic process, like that which generates the digits of p or random numbers, for that matter, can be as random as one likes under the most stringent criterion or definition of randomness. These categories, so commonly contrasted, are not mutually exclusive. However, it is far from clear how to characterize an intrinsically random physical process in a way that is free of circularity and amenable to experimental confirmation (Jaynes 2003, §10.5). Jaynes views such random processes as mythic products of what he labels the "Mind Projection Fallacy."

Bayes Equivalence

Part of the motivation of the frequency theory was to develop objective means of assessing the evidence from an experiment, leaving readers of the report free to supply their own priors or utility functions. However, this ideal of separating evidence from opinion is unrealizable because, first, the support of a composite hypothesis or model with adjustable parameters depends on the weights assigned its various simple components, and, second, because of the presence of so-called nuisance parameters.

For an example of the former (Royall 1997, pp. 18–19), one can compare the hypothesis (H ) that the proportion p of red balls in an urn is either ¼ or ¾ with the simple hypothesis (k ) that p = ¼, given that a ball drawn at random is red. The bearing of this outcome is wholly dependent on the relative weights assigned the two simple components of H, namely, p = ¼ and p = ¾. Or, again, how does drawing an ace of clubs bear on the hypothesis that the deck is a trick deck (fifty-two copies of the same card) versus the hypothesis that it is a normal deck? If one's intuition is that a single card can tell one nothing, then one is implicitly assigning equal probabilities to all fifty-two components of the trick deck hypothesis, but if, for example, one has information that most trick decks are composed of aces or picture cards, then drawing that ace of clubs will favor (for one) the trick deck hypothesis by a factor ranging from 1 to 52.

In practice, (F) resorts to comparing two models by the ratios of their maximum likelihoods, as in the orthodox t-test for comparing two normal means (de Groot 1986, §8.6). This is often a good approximation to the Bayes factor, the ratio of average likelihoods, when the two models are of roughly equal simplicity (Rosenkrantz 1977, p. 99), but this practice is otherwise highly biased (in the colloquial sense) in favor of the more complicated hypothesis, as in the trick deck example.

An equally formidable bar to the separation of sample information and prior information is the presence of parameters other than the one of interest. In testing the equality of two normal means, the difference, x̄ − ȳ, of the sample means means different things depending on one's beliefs about the variances of the two populations. An even simpler example is random sampling of an urn of size N without replacement. If interest centers on the number R of red balls in the urn and N is also unknown, then an outcome, D = (n, r ), of r red in a sample of n,will mean different things depending on one's prior beliefs about the relationship, if any, between R and N. If, for example, extensive previous experience renders it almost certain that the incidence of a certain birth defect lies well below one in a thousand, then a sample of modest size in which such a defect occurs, for example, (n, r ) = (500, 1), tells one not merely that N ≥ 500, the sample size, but (almost surely) that N ≥ 1,000. Even so simple a problem as this appears to lie entirely beyond the scope of (F) or (L), but as Jaynes amply demonstrates, this shopworn topic of introductory probability-statistics courses takes on a rich new life when the inverse problem of basing inferences about N and R on observed samples is considered and different kinds of prior information are incorporated in the resulting data analysis (2003, chapters 3 and 6).

In general, (B) handles nuisance parameters by marginalization, that is, by finding the joint posterior density, say, p (θ ₁, θ ₂|DI ) for the case of two parameters, and then integrating with respect to θ ₂:
p (θ ₁ | DI ) = ∫ p (θ ₁, θ ₂ | DI )dθ ₂
the discrete analogue being P (A |DI ) = ΣP (AB _i|DI ) for mutually exclusive and exhaustive B _i′s. Thus, intuition expects that a more focused belief state will result when there is prior knowledge of θ ₂ than when its value is completely unknown before sampling.

Consider the case of sampling a normal population when nothing is known about θ ₁ = μ and θ ₂ = σ ², so that their joint prior is the Jeffreys prior:
p (θ ₁, θ ₂|I ) = p (θ ₁|I )p(θ ₂|I ) = θ ₂⁻¹
while the (normal) likelihood is

using the obvious identity, Σ(x _i − θ ₁)² = Σ(x _i − x̄ + x̄ − θ ₁)² = Σ(x _i − x̄ )² + n (x̄ − θ ₁)², with and v = n − 1. Multiplying this by the prior yields the joint postsampling density

up to a normalization constant. Then using
(*)
obtained from by the substitution, x = A /θ, the marginal posterior density of θ ₁ is:

using (*) with A = vs ² + n (x̄ − θ ₁)² and u = (n + 2 )/2, whence
p (θ ₁|DI )∝(1 + t ²/v )^{½(v + 1)} to obtain:
with t = n ^½(x − θ ₁)/s. To find the normalization constant, one integrates on the right using the substitution, , with dt = ½v ^½x ^−½(1 − x )^−3/2dx and to obtain:

using the beta integral,

Hence, the posterior density of the mean, using Γ(½) = √π, is given by:
(15)
which is the density of the t-distribution with v = n − 1 degrees of freedom.

Thus, one has arrived in a few lines of routine calculation at the posterior (marginal) density of the mean when the variance is (completely) unknown. Were the variance known, the uniform prior of the mean leads, as was seen earlier, to a normal posterior distribution about the sample mean, x, with variance σ ²/n, or in symbols:
n ^½(θ ₁ − x̄ )/σ ∼ N (0, 1)
while the result, n ^½(θ 1 − x̄ )/s, of replacing the population s.d., s, by the sample s.d., s when the former is unknown, has the t-distribution with v = n − 1 d.f. The density (15) is, like the normal density, bell-shaped and symmetric, but has larger tails (i.e., does not approach its asymptote, the x-axis, as rapidly as the normal curve) and is thus less concentrated. For example, for v = 10 degrees of freedom (d.f.), the 95 percent central region of the t-distribution is (−2.228, 2.228) while that of the normal is (−1.96, 1.96) ≈ (−2, 2). Thus, Bayesian updating confirms one's intuition that the postsampling belief function should be less concentrated when the variance is (completely) unknown than when it is known. Moreover, the t-distribution approaches normality rather rapidly as v →∞, and so the difference in the two states of prior knowledge is quickly swamped by a large sample. Already at v = 20, the 95 percent central region of (15) is (−1.98, 1.98), which is almost indistinguishable from the normal.

Because of a mathematical quirk, (F) interval estimates (confidence intervals) for a normal mean with variance unknown are numerically indistinguishable from (B) interval estimates (credence intervals) obtained from the posterior density, although their interpretation is radically different. For a normal distribution, the sample mean, x̄, and sample variance, , are independent (for a proof, see de Groot 1986, §7.3). As the normal distribution is the only one for which this independence of sample mean and sample variance obtains, it may justly be called a quirk. One shows, next, that if Y ∼N (0, 1) and Z ∼χ²_n, then

has the t-distribution (15) with n degrees of freedom (§7.4). Now Y = n ^½ (x̄ − μ )/σ -N (0, 1) is standard normal, and it can be shown (de Groot 1986, pp. 391–392) that Z = s ²_n / σ ² - χ²_{(n −1)}, hence

has the t-distribution with n − 1 d.f. The crucial point is that σ ² cancels out when one divides Y by Z ^½ and so the distribution of U does not depend on the unknown variance. The nuisance is literally eliminated. Finally, (F) estimates of μ can be obtained from the distribution of U since −c ≤ U ≤ c just in case x̄ − cσ '/√n ≤ μ ≤ x̄ + cσ '/√n, writing . Notice, however, the different interpretation. One thinks of (x̄ − cσ ', x̄ + cσ ') as a random interval that contains m with the specified probability, or long-run relative frequency in an imagined sequence of repetitions of the experiment.

The first thing that strikes one is how much more complicated this derivation of the sampling distribution of the relevant statistic is than the (B) derivation of the postsampling distribution (15) of μ. Even the modern streamlined derivation given in de Groot's (1986) text occupies nearly ten pages. William Seeley Gossett guessed the distribution by an inspired piece of mathematical detective work (for some of the relevant history, see Hald 1998, §27.5). The first rigorous proof was given in 1912 by a bright Cambridge undergraduate named R. A. Fisher. Gossett began his 1908 paper by noting that earlier statisticians had simply assumed "a normal distribution about the mean of the sample with standard deviation equal to s /√n " but that for smaller and smaller samples "the value of the s.d. found from the sample … becomes itself subject to an increasing error, until judgments reached in this way may become altogether misleading" (Hald 1998, p. 665). Fisher never tired of extolling "Student" (Gossett's pen name) for his great discovery, as well he might, for it is safe to say that without it, the (F) approach to statistics would never have gotten off the ground. For (F) would not then have been able to address the inferential problems associated with sampling a normal population for the vital case of small samples of unknown precision.

In essence, the pre-Gossett practice of replacing σ in n ^½(x̄ − μ )/σ by its ML estimate, s _n, would be about the only option open to (F) or (L) if this nuisance parameter could not be eliminated. However, that is to treat the unknown parameter as if it were known to be equal to its estimated value—precisely what Gossett's predecessors had done. Complete ignorance of σ should result in a fuzzier belief state than when it is known (compare Royall 1997, p. 158). The only remedy (L) offers (p. 158) when nuisance parameters really are a nuisance (and cannot be eliminated) is to use the maximum of the likelihood function taken over all possible values of the relevant nuisance parameter(s). However, this is to equate a model with its best-fitting special case, which is to favor the more complicated of two models being compared. Moreover, in the real world outside of textbooks, normal samples do not come earmarked "variance known" or "variance unknown."

To borrow "Example 1" from Jaynes (1983, p. 157), the mean life of nine units supplied by manufacturer A is 42 hours with s.d. 7.48, while that of four units supplied by B is 50 hours with s.d. 6.48. (F) proceeds in such cases to test the null hypothesis that the two s.d.'s are equal using the F-test originated by Fisher (de Groot 1986, §8.8). When the null hypothesis is accepted, (F) then treats the two s.d.'s as equal and proceeds to a two sample t-test of the equality of the means (de Groot 1986, §8.9), which is predicated on the equality of the two (unknown) variances. In the present example, the hypothesis of equal s.d.'s is accepted at the 5 percent significance level, but then the two-sample t-test (unaccountably) accepts the hypothesis that the two means are equal at a 10 percent level. Jaynes calculates odds of 11.5 to 1 that B's components have a greater mean life, and without assuming equality of the variances. Then he asks, "Which statistician would you hire?"

The (F) solution extends to the case where independent samples are drawn from two normal populations of unknown variance, provided the variances are known to be equal or to stand in a given ratio. However, when the variances are known (or assumed) to be unequal, (F) fragments into a number of competing solutions with no general agreement as to which is best (Lindley 1965, pp. 94–95; Kendall and Stuart 1967, pp. 139ff). W.-U. Behrens proposed a solution in 1929 that Fisher rederived a few years later using his highly controversial fiducial argument (see the entry on R. A. Fisher). As Harold Jeffreys points out (1939, p. 115), the Behrens solution follows in a few lines from (2) using the Jeffreys prior for the unknown parameters (also see Lindley 1965, §6.3).

However, what of the intermediate cases where the variances are not known to be equal (the two-sample problem) and not known to be unequal (the Behrens-Fisher problem). In his definitive treatment, G. Larry Bretthorst (1993) takes up three problems: (1) determine if the two samples are from the same normal population, (2) if not, find how they differ, and (3) estimate the magnitude of the difference. Thus, if they differ, is it the mean or the variance (or both)?

Consider, once more, Jaynes's example. Bretthorst finds a probability of 0.58 that the s.d.'s, σ ₁ and σ ₂, are the same, given the sample s.d.'s of 7.48 and 6.48, the inconclusive verdict intuition expects. (F) is limited to an unnuanced approach where one or the other of these alternatives must be assumed. The posterior distribution Bretthorst computes is a weighted average of those premised on equal and unequal population variances and thus lies between them (Bretthorst 1993, p. 190, and figure 1). By marginalization, it yields a 72 percent probability that the parent means are different. The analysis is based on independent uniform and log-uniform priors for the means and variances truncated, respectively, at 34 = 46 − 12 and 58 = 46 + 12, and at σ _L = 3 and σ _H = 10. This is not just for the sake of greater realism but to ensure that the posterior density is normalizable (p. 191). Doubling the range of the means lowers the probability that the parent populations differ from 0.83 to 0.765, a change of roughly eight percent, while doubling the range of the s.d.'s makes about a 2 percent difference. Hence, the inference appears to be reasonably robust. Finally, the Bayesian solution smoothly extends the partial solutions (F) offers when the variances are unknown; the (F) solutions appear as the limiting cases of the (B) solution when the probability that the variances are equal is either zero or one. This makes it hard for an (F) theorist to reject the Bayesian solution.

The (F) solutions also correspond to (B) solutions based on an uninformative prior. This Bayes equivalence of (F) interval estimates or tests is more widespread than one might suppose (Jaynes 1983, pp. 168–171, 175), but is by no means universal. Generalizing from the case of known variances, it would seem to hold when sufficient estimators of the parameter(s) of interest exist, no nuisance parameters are present, and prior knowledge is vague or insubstantial.

Confidence intervals for a binomial success rate, θ, are harder to construct than the CI's for a normal mean because here the population variance, nθ (1 − θ ), depends on the parameter being estimated. The solution is to find for each value of θ, values p _L(θ ) and p _H(θ ), such that
P (p ≥ p _L|θ ) = ½(1 − α )
and
P (p ≤ p _H|θ ) = ½(1 − α )
as nearly as possible, where p is the proportion of successes in n trials. In other words, one finds a direct> 100(1 − α )% confidence interval for p for each value of the unknown success rate θ. Then the corresponding CI for θ comprises all those values whose direct CI contains the observed proportion p. For an example (n = 20) and a chart, see Kendall and Stuart (1967, pp. 103–105), whose obscure exposition makes this rather convoluted method seem even more mysterious. Plainly, finding such CI's is an undertaking, involving round off errors and approximations. By contrast, the Bayesian posterior density, given in the original memoir of Bayes, based on the uniform prior of θ,for r successes in n trials is

with mean (r + 1)/(n + 2) and variance, f (1 − f )/(n + 3), where f = r /n. Hence, the Bayesian credence intervals assume the simple form,

where k is 1.645, 1.96, and 2.57 for the 90, 95, and 99 percent intervals (using the normal approximation to the beta distribution). Jaynes finds (1983, p. 171) these Bayesian intervals are numerically indistinguishable from the CI's of the same confidence coefficient, leading him to wryly observe that the Bayesian solution Fisher denigrated as "founded on an error" delivers exactly the same interval estimates as the (F) solution at a fraction of the computational and mathematical effort. The reason for the equivalence is that, despite its great difference in motivation and interpretation, the (F) method of confidence intervals is based in this case on a sufficient statistic, the observed relative frequency f of success.

As Jaynes notes, the official doctrine of (F) is that CI's need not be based on sufficient statistics (Kendall and Stuart 1967, p. 153), and, indeed, the advertised confidence coefficient is valid regardless. Bayesian credence intervals, being based on the likelihood function, automatically take into account all the relevant information contained in the data, whether or not sufficient statistics exist. Thus, (F) methods not based on a sufficient statistic must perforce be wasting information, and the result one expects, given the optimality theorem, is a degradation of performance. The point is that the data may contain additional information that leads one to recognize that the advertised confidence coefficient is invalid (Loredo 1990, p. 117). The next several examples illustrate this and related points in rather striking fashion.

For a simple example (de Groot 1986, p. 400), let independent observations X ₁ and X ₂ be taken from a uniform distribution on the interval, (θ − ½, θ + θ ), with θ unknown. Then if Y ₁ = min (X ₁, X ₂) and Y ₂ = max (X ₁,X ₂), we have:
P (Y ₁ ≤ θ ≤ Y ₂) = P (X ₁ ≤ θ )P (X ₂ ≥ θ ) + P (X ₂ ≤ θ ) P (X ₁ ≥ θ )
= ½·½ + ½·½ = ½
Thus, if Y ₁ = y ₁ and Y ₂ = y ₂ is observed, (y ₁, y ₂) is a 50 percent CI for θ. However, what if y ₂ − y ₁ ≈ 1? Then (y ₁, y ₂) is virtually certain to contain θ ; indeed, one easily checks that it is certain to contain θ when y ₂ − y ₁ ≥ 1/2. Thus, one has a recognizable subset of the outcome space on which the 50 percent confidence coefficient is misleadingly conservative.

For an example of the opposite kind, where confidence is misplaced, one can turn to "Example 5" of Jaynes (1983, p. 172f). A chemical inhibitor that protects against failures wears off after an unknown time θ and decay is exponential (with mean one) beyond that point, so that a failure occurs at a time x with probability
f (x |θ ) = exp(θ − x )h (x, θ )
where h (x, θ ) = 1 if θ < x and is otherwise zero. Since this data distribution for n failure times factors as
f _n(x ₁,…, x _n|θ ) = exp[−Σx _i][e ^θh (y ₁, θ )
the factorization criterion (de Groot 1986, p. 358) shows that Y ₁ = min(X ₁,…, X _n) is a sufficient statistic. (Intuitively, the least time to a failure contains all the information in the n recorded failure times relevant to the grace period of θ.) With a uniform distribution of θ (which enters here as a positive location parameter), the posterior density of θ is proportional to exp [n (θ − y ₁)] and yields for three observations, (X ₁, X ₂, X ₃) = (12, 14, 16), a 90 percent credence interval of 11.23 > θ > 12.0, in good accord with qualitative intuition. However, (F) doctrine directs one to an unbiased estimator, and the point of the example is to show what can happen when a CI is not based on a sufficient statistic. Since

an unbiased estimator of θ is given by θ * = . Notice, however, that this can be negative for permitted (positive) failure times, even though θ is necessarily nonnegative. The shortest 90 percent CI based on this statistic's sampling distribution (found by computer, using an approximation) is θ *−0.8529 < θ < θ * + 0.8264, or, since θ * = 13 for the three observations,
12.1471 < θ < 13.8264
This consists entirely of values deductively excluded by the data! By contrast, the CI based on the sufficient statistic, the least of the failure times, is indistinguishable from its (B) counterpart.

Thus, Fisher was right to insist that his fiducial intervals be based on sufficient statistics. But, unfortunately, sufficient statistics do not always exist. A famous example is provided by the Cauchy distribution (the special case, v = 1, of Gossett's t-distribution), with density:

with θ a location parameter to be estimated. The Cauchy distribution has the peculiarity that the mean of any finite number of observations has the same (Cauchy) distribution as a single observation. Given, say, two observations, X ₁ and X ₂, the sampling distributions of either one or their mean, θ * = ½(X ₁ + X ₂), are all the same, and so, if one's choice of estimator is to be guided solely by the sampling distributions of the candidates, as (F) doctrine dictates, then any of these statistics is as good as another for the purpose of estimating θ. However, would anyone be willing to use just the first observation and throw away the second? Or doubt that their mean is a better estimator of θ than either observation taken alone? In fact, the mean is the optimal Bayes estimator for any loss function that is a monotonically increasing function of the absolute error, |θ ̂ − θ |, in the sense that it minimizes one's expected loss after sampling. (Lacking a prior for θ, (F) lacks any such clear-cut criterion of optimality.) Now, besides their mean, the two observations provide further information in the form of their range or half-range, Y = ½(X ₁ − X ₂). Jaynes then calculates the conditional distribution of θ * given Y, from which he calculates the probability that the 90 percent CI contains The true value of θ given the value of the half-range Y (1983, p. 279). The calculations show that for samples of small range, the .90 confidence coefficient is conservative: The CI for y ≤ 4 will cover the true θ more than 95 percent of the time. However, for samples of wide range, y ≥ 10, which comprise about 6.5 percent of the total, the CI covers θ less than 12 percent of the time.

By abandoning the principle of being guided only by the sampling distribution, (F) can also avail itself of the conditional distribution and base different estimates of θ on different values of Y, choosing for each observed y the shortest CI that, within that y -subclass, covers the true θ 90 percent of the time. For samples of narrow range, this delivers much shorter intervals than the standard 90 percent CI, while for samples of wide range, it covers the true θ more often with a join of two separate intervals. The resulting rule is uniformly reliable in never under or overstating its probability of covering the true θ, but by now one will have guessed that the uniformly reliable rule is the Bayesian rule!

A recurring theme of Jaynes's writings is that the (F) devotees of error probabilities and performance characteristics have never bothered to investigate the performance of the Bayesian solutions they denigrate as "founded on an error" or to compare their performance with their own preferred solutions. (B) methods based on uninformed priors capture Fisher's desideratum of "allowing the data to speak for themselves" as evidenced by their agreement with (F) methods based on sufficient statistics. It is then rather an onerous thesis to maintain that they fail to do this in cases where (F) lacks a solution or where, as it has just been seen, the (F) solution not so based leads to palpably absurd results or misleading statements of confidence. One can also sometimes criticize a frequentist solution as equivalent to a Bayesian solution based on an absurdly opinionated prior (see Jaynes 1983, p. 103).

Jaynes explains why (F) methods inevitably waste information as follows,"Orthodoxy requires us to choose a single estimator, b (D ) ≡ b (X ₁,…,X _n), before we have seen the data, and then use only b (D ) for the estimation" (2003, p. 510). The observed value of this statistic then places one on a manifold (or subspace) of n -dimensional space of dimension n − 1. If position on this manifold is irrelevant for θ, then b (D ) is a sufficient statistic, but if not, then D contains additional information relevant to θ that is not conveyed by specifying b (D ). (B) is then able to choose the optimal estimator for the present data set. The sampling distribution of b (D ) is simply not relevant, since one is free to choose different estimators or different CI's for different data sets.

Informed Priors and Entropy

Of the many approaches to constructing uninformed priors, group invariance has been stressed because of its intimate ties to consistency. The same rationale underwrites a powerful extension of (2) to a more general rule of minimal belief change that goes by minimizing the cross-entropy deviation from an initial (pre) distribution among all those satisfying empirically given distributional constraints (see the entry on information theory). Recall, the cross entropy or discrimination information of a distribution P = (p ₁,…, p _n) with respect to Q = (q ₁,…, q _n) is defined by

And when Q = (n ⁻¹,…, n ⁻¹) is a uniform distribution, the rule (MINXENT ) of minimizing cross entropy specializes to the rule (MAXENT ) of maximizing the (Shannon) entropy,

which is a measure of the uncertainty embodied in P. Entropy figures centrally in Claude Shannon's mathematical theory of communication (information theory), and looks to be a fundamental concept of probability theory as well. Thus, sufficient statistics, informally defined as "preserving all the information in the data relevant to inferences about θ," do actually preserve information in the sense of entropy (Jaynes 2003, §14.2). Also see Jaynes (§17.4) for further links between sufficiency, entropy, Fisher information, and the Cramer-Rao inequality.

When the psi-test discussed earlier leads one to expect a significant improvement in support by moving to an alternative (and possibly more complicated) model, MINXENT can lead one to it, as in the example of a biased die discussed in information theory entry. Thus, MINXENT literally enables one to carve a model out of empirically given measurements or mean values. Jaynes's original (1957) application to equilibrium thermodynamics (Jaynes 1983, chapters 1–6) with later extensions to nonequilibrium thermodynamics (chapter 10, §D) remains the exemplar, but a veritable floodtide of additional applications to all areas of scientific research have since followed, as recorded in the proceedings of workshops on Bayesian and maximum entropy methods held annually since 1981. The inferential problems this opens to attack lie even further beyond the range of (F) or (L).

Moreover, many classical models like the exponential or Gaussian arise most naturally as maxent distributions. Thus, the exponential, with density, f (x |θ ) = θexp (−θx ), is the maxent distribution of a positive continuous X of know mean; the normal (or Gaussian) that of a distribution whose first two moments are known. Jaynes (2003, p. 208) makes a serious case that this best accounts for the ubiquity of the Gaussian as a distribution of errors or noise, so that it is neither "an experimental fact" nor a "mathematical theorem,"but simply the most honest representation of what is typically known about one's errors, namely, their "scale" and that positive and negative ones tend to cancel each other.

MAXENT functions primarily, though, as a means of arriving at informed priors. The superiority of a Bayes solution will be more manifest, in general, when substantial prior knowledge is formally incorporated in the analysis. Research might disclose, for example, that horse 1 finished ahead of horse 2 in two-thirds of the races both entered. If that is all that is known, then one's prior for tonight's race must satisfy p ₁ = 2p ₂. (How should this information affect the odds on the other horses?) In the inventory example of Jaynes (2003, §14.7), successive pieces of information, bearing on the decision which of three available colors to paint the day's run of 200 widgets so as to ensure twenty-four-hour delivery, are assimilated, starting with the current stocks of each color, the average number of each color sold per day, the average size of an individual order for each color, and so on. This is not just an amusing and instructive example of entropy maximization, but, evidently, one with serious practical applications.

At the other extreme of uninformativeness, the Bayesian econometrician Arnold Zellner has used entropy to define a maximal data informative prior (MDIP) as one that maximizes
∫I (θ )p (θ )dθ − ∫p (θ )ln p (θ )dθ
the difference between the average information in the data density and the (variable) prior density, where
I (θ ) = ∫ f (x |θ )ln f (x |θ )dθ
measures the information conveyed by the data density, f (x |θ ). Such a prior also maximizes the expected log-ratio of the likelihood function to the prior density (for a number of examples and yet another derivation of the Jeffreys log-uniform prior, see Zellner and Min 1993).

To Lindley's oft-repeated question, "Why should one's knowledge or ignorance of a quantity depend on the experiment being used to determine it?" Jaynes answers that the prior should be based on all the available prior information and "the role a parameter plays in a sampling distribution is always a part of that information" (1983, p. 352). However, it should not depend on the size of the sample contemplated (pp. 379–382).

Apart from the satisfaction of seeing that variously motivated lines of attack all lead to the same priors in the best understood cases, like location or scale parameters or regression coefficients (for which see Jaynes 1983, pp. 195–196), different methods can be expected to generalize in different ways when harder problems are addressed. Obviously, there is room for much creative thought here in what might be described as the new epistemology, the endeavor to accurately represent whatever is known in probabilistic terms—what Jaynes calls "that great neglected half of probability theory." Such research can be expected to further the development of artificial intelligence and the formation of consensus priors in decision theoretic or policymaking contexts.

Summary: A Bayesian Revolution?

Is the much heralded Bayesian revolution a fait accompli ? In his account of scientific revolutions, Thomas Kuhn may have erred in some of the details but certainly convinced his readers that there is a pattern here, something to construct a theory of. That applies, in particular, to revolutionary theory change, the overthrow of an old paradigm in favor of the new. Brush (1994, p. 137) touches on several of the reasons that usually enter in speaking of the acceptance of wave mechanics. Based on what has been surveyed in this entry, Bayesians would lodge the following parallel claims:

(B) offers simpler solutions to the salient problems of the old (F) paradigm
(B) offers a unified approach to all inferential problems—indeed, to all three problems of evidence, belief, and decision mentioned at the outset
(B) is able to pose and solve problems of obvious importance that lie beyond the range of (F) or (L), among them problems involving nuisance parameters and those amenable to MINXENT.

(B) also lays claim to greater resolving power in the detection of periodicities in time series or in separating periodicities from trends (Jaynes 2003, p. 125 and chapter 17).

(B) views (2) as embodying the entire logic of science. It has demonstrated that (2), as well as its extension to MINXENT, is anchored in the bedrock of consistency and that the price of inconsistency is inefficiency, the waste of information present in the data. Finally, (B) claims to be able to ascertain the limits of validity of the methods of (F) by viewing them as approximations to Bayesian methods. That, too, is highly characteristic of the claims a new paradigm lodges against the old. In any case, many time-honored procedures of (F), like significance tests or chi-square tests, retain an honorable place in the Bayesian corpus as approximate Bayes procedures, and where the elements needed for a Bayesian solution are lacking, one may use Bayesian logic to find a useful surrogate.

Critics will allege that Bayesians have not solved the "problem of the hypothesis space," namely, to which hypotheses should one assign probabilities? Jaynesians admit they have not solved this problem, but neither has anyone else. Jaynes's point, rather, is that the only way to discover that we have not gone to a deep enough hypothesis space is to draw inferences from the one we have. We learn most when our predictions fail, but to be certain that failed predictions reflect inadequacies of our hypothesis space rather than poor reasoning, "those inferences [must be] our best inferences, which make full use of all the information we have" (Jaynes 2003, p. 326).

Bibliography

Bretthorst, G. Larry. "On the Difference in Means." In Physics and Probability: Essays in Honor of Edwin T. Jaynes, edited by W. T. Grandy Jr. and P. W. Milonni. New York: Cambridge University Press, 1993.

Brush, Stephen G. "Dynamics of Theory Change: The Role of Predictions." PSA 1994. Vol. 2, edited by David Hull, Micky Forbes, and Richard Burian. East Lansing, MI: Philosophy of Science Association, 1994.

Cox, Richard T. "Probability, Frequency, and Reasonable Expectation." American Journal of Physics 14 (1946): 1–13.

De Finetti, Bruno. "La Prevision: ses lois logiques, ses sources subjectives." Annales de l'Institut Henri Poincare 7 (1937), 1–68. Translated as "Prevision: Its Logical Laws, Its Subjective Sources." In Studies in Subjective Probability. 2nd ed., edited by Henry Kyburg and Howard Smokler. New York: Wiley, 1981.

De Finetti, Bruno. "Sur la condition d'equivalence partielle." Actualites scientifiques et industrielles, No. 739 (Colloque Geneve d'Octobre 1937 sur la Théorie des Probabilites, 6tieme partie.) Paris: Hermann, 1938. Translated as "On the Condition of Partial Exchangeability." In Studies in Inductive Logic and Probability, edited by Richard C. Jeffrey. Berkeley: University of California Press, 1980.

De Finetti, Bruno. Probability, Induction, and Statistics.New York: Wiley, 1972.

De Groot, Morris. Probability and Statistics. 2nd ed. Reading, MA: Addison-Wesley, 1986.

Edwards, Ward, Harold Lindman, and Leonard J. Savage. "Bayesian Statistical Inference for Psychological Research." In Readings in Mathematical Psychology. Vol. 2, edited by. R. Duncan Luce, R. Bush, and Eugene Galanter. New York: Wiley, 1965.

Festa, Roberto. Optimum Inductive Methods. Dordrecht, Netherlands: Kluwer Academic, 1993.

Fisher, Ronald A. The Design of Experiments. Edinburgh, Scotland: Oliver and Boyd, 1935.

Fisher, Ronald A. Statistical Methods and Scientific Inference. 2nd ed. Edinburgh, Scotland: Oliver and Boyd, 1959. All of Fisher's three main books on statistics have been reprinted by Oxford University Press (2003) under the title Statistical Methods, Experimental Design and Statistical Inference.

Giere, Ronald N. "Philosophy of Science Naturalized." Philosophy of Science 52 (1985): 331–356.

Good, I.J. Good Thinking. Minneapolis: University of Minnesota Press, 1983.

Hacking, Ian. Logic of Statistical Inference. New York: Cambridge University Press, 1965.

Hald, Anders. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley, 1998.

Hill, T. P. "Base Invariance Implies Benford's Law." Proceedings of the American Mathematical Society 123 (1995): 887–895.

Hodges, J. L., and Erich Lehmann. Basic Concepts of Probability and Statistics. 2nd ed. San Francisco: Holden-Day, 1970.

Jaynes, E. T. Papers on Probability, Statistics, and Statistical Physics, edited by R. D. Rosenkrantz. Dordrecht, Netherlands: D. Reidel, 1983.

Jaynes, E. T. Probability Theory: The Logic of Science, edited by G. Larry Bretthorst. New York: Cambridge University Press, 2003.

Jeffreys, Harold. Theory of Probability. Oxford, U.K.: Clarendon Press, 1939.

Kalbfleish, John G., and D. A. Sprott. "On Tests of Significance." In Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science. Vol. 2, edited by William L. Harper and Clifford A. Hooker. Dordrecht, Netherlands: D. Reidel, 1976.

Kendall, Maurice G., and Alan Stuart. Advanced Theory of Statistics. 2nd ed. New York: Hafner, 1967.

Lehmann, Erich. Testing Statistical Hypotheses. New York: Wiley, 1959.

Lindley, Dennis V. Introduction to Probability and Statistics. Vol. 2, Inference. Cambridge, U.K.: Cambridge University Press, 1965.

Loredo, T. J. "From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics." In Maximum Entropy and Bayesian Methods, edited by Paul Fougere. Dordrecht, Netherlands: Kluwer Academic, 1990.

Mayo, Deborah G. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press, 1996.

Morrison, D., and R. Henkel, eds. The Significance Test Controversy. Chicago: Aldine Press, 1970.

Pearson, Egon S. The Selected Papers of E. S. Pearson. Berkeley: University of California Press, 1966.

Rosenkrantz, R. D. Inference, Method, and Decision. Dordrecht, Netherlands: D. Reidel, 1977.

Rosenkrantz, R. D. "The Justification of Induction." Philosophy of Science 59 (1992): 527–539.

Royall, Richard M. Statistical Evidence: A Likelihood Paradigm. London: Chapman and Hall, 1997.

Savage, L.J. The Foundations of Statistics. New York: Wiley, 1954.

Smith, C. Ray, and Gary J. Erickson. "Probability Theory and the Associativity Equation." In Maximum Entropy and Bayesian Methods, edited by Paul Fougere. Dordrecht, Netherlands: Kluwer Academic, 1990.

Zabell, S. L. "Symmetry and Its Discontents." In Causation, Chance, and Credence, edited by Brian Skyrms and W. L. Harper, 155–190. Dordrecht, Netherlands: Kluwer Academic, 1988.

Zellner, Arnold, and Chung-ki Min. "Bayesian Analysis, Model Selection, and Prediction." In Physics and Probability: Essays in Honor of Edwin T. Jaynes, edited by W. T. Grandy Jr. and P. W. Milonni. New York: Cambridge University Press, 1993.

Zellner, Arnold, Hugo A. Keuzenkamp, and Michael McAleer, eds. Simplicity, Inference, and Modeling: Keeping It Sophisticatedly Simple. New York: Cambridge University Press, 2001.

Roger D. Rosenkrantz

Encyclopedia of Philosophy