1 Hypothesis testing

1.1 χ2 and Pearson’s tests

It is possible to estimate the probability that a set of lab data does not contradict the assumed form of the PDF for a random variable such as a measured quantity. The fundamental notion is that one can find a statistic κ representing the deviation between the data set and the assumed theoretical distribution, this quantity is then compute for the data set resulting in some value κd, such that the probability that κ would exceed this value is some small confidence limit d

℘( κ >  κd)  =  d
Suppose that an experiment produces a collection of data
x1, x2,  ⋅⋅⋅ xN
and we wish to determine whether or not this data backs up a theoretical prediction. Suppose also that the theoretical values of xi are
μ  , μ  , ⋅⋅⋅ μ
  1    2       N
given by some curve that we hope the data fits fairly well. Statistical analysis of the validity of the hypothesis (that the data fits the theoretical curve) is obtained by assuming that each measurement xi is a random variable with a Normal distribution about the theoretical value, and so the probability that an experiment will produce measurement xi is
                       (ξi--μi)2
d℘xi(-ξi)    ∘--1---- -  2σ2i
   dξ     =    2π σ2 e
     i             i
What is an appropriate σi? From our discussion of the various PDFs seen in physics, we realize that a good experiment should involve computing experimental values from averages of many data, and so we could use the standard deviations of the means of the xi, or settle on a standard, such as the variance of a binomial distribution σi2 = ai = μi.
Then the probability that one experiment will result in the full data set
{x1,  x2,  ⋅⋅⋅ xN }
with ξi xi ξi + i is is
                                                        2
     dn ℘           N∏  d ℘xi(ξi)    ∏N     1     - (ξi-2μσi2)-           1          - χ2
-----------------=     --------- =      ∘-----2-e      i  =  ------N-----------e   2
dξ1 dξ2 ⋅⋅⋅d ξN     i=1    d ξi      i=1   2π σi              (2 π) 2 σ1 ⋅ ⋅⋅σN
where we define a new random variable chi-squared
  2   ∑N  (ξi---μi)2-    N∑  (-ξi --μi)2
χ  =            2     =
      i=1     σ i        i=1      μi
We can change variables into polar coordinates in the N-dimensional data space, integrate out the N - 1 angular variables, and compute the probability distribution for chi-squared, which is the probability that an experiment will have an outcome in which the variable χ2 is between χ02 and χ02 + 2
∫         N                             ∫        2
   ------d--℘-------                          - χ2-    N - 1
   dξ  dξ  ⋅⋅⋅ dξ   dξ1 dξ2 ⋅ ⋅⋅dξN  =    C  e    (χ)     d χ
     1    2      N
(for example 1 2 3 = χ2 sinθdθdφ with ξ1 = χ sinθ cosφ, ξ2 = χ sinθ sinφ, and ξ3 = χ cosθ) and we can integrate over all of N-dimensional space to find the normalization constant C;
     ∫ ∞     - χ2   2 N-1-dχ2        N-   N
1 =      C  e  2 (χ  ) 2  -----=  C2  2 Γ (--)
      0                   2 χ              2
We have arrived at the PDF for this variable
                 1     - ξ  N- 1
fχ2,N(ξ)  =  -N----N--e  2ξ 2
             2 2 Γ (2 )
and so the probability of the statistic χ2 being greater than some value χd2 (for example that calculated from your data) is
                     ∫
     2     2           ∞ ----1----- - ξ2 N2 - 1
℘( χ   ≥  χd, N ) =   χ2   N2-  N-  e   ξ    d ξ
                       d 2  Γ ( 2 )

PIC

Think about it, this represents the probability that a given set of data will have a deviation from a theoretical predicted form greater than the deviation measure of your data.
If this probability is nearly one, your fit is very good.
From the figure (or extremizing the integrand) you can see that the most probable value for ξ = χ2 is N - 2, for N statistical degrees of freedom.
I suppose that it is up to you how low you wish to go in regards to declaring a fit good.

Consider now the simple problem of determining whether or not, and to what degree of confidence, a set of data points {(x1,y1),(x2,y2),⋅⋅⋅,(xN,yN)} conforms to a hypothesis such as y = ax + b for example, a linear fit. We take the following steps.
1. Find a curve of best fit fbest(x) = y. This is done by the method of minimizing the deviation of the data from the curve with respect to its parameters. The deviation is

       N∑
σ2  =     (yi - ybest(xi))2
       i=1

2. We next construct a standard statistic, the χd2 for our data, and determine the probability that for the χ2 probability distribution, the value of χ2 should exceed our experimental value χd2. We decide upon an acceptable significance level.
A good fit will have a small χ2 (a perfect fit has no deviation from theoretical), and so will have N(χ2) = 1. Generally we are willing to accept much less than this. I think that a very reasonable goodness of fit is that χ2 < N - 2, the most likely value of the variable.

Example
We have a set of data that we hope conforms to the theoretical curve

y  = 1 ⋅ x +  0
The set is {(2.0,2.0),(3.0,3.1),(4.0,3.9),(5.0,5.0),(6.0,6.0)}. The fit is clearly very good. Lets apply the test. Our chi-squared statistic is
       (2.0 - 2.0)2   (3.1  - 3.0)2   (3.9 -  4.0)2   (5.0 -  5.0)2   (6.0 -  6.0)2
χ2 =   -------------+ --------------+ --------------+ --------------+ -------------- = 0.00583
            2.0            3.0             4.0             5.0             6.0
We have five degrees of freedom, expended none to get the slope, so we need
℘(χ2  ≥  0.00583)  =  α    for    k =  N  - 1  = 5 -  1 =  4   degrees  of  freedom
A short computation gives us



χd2 (χ2 χd2)


0.001000 1.000000
0.050950 0.999681
0.100900 0.998769
0.150850 0.997295
0.200800 0.995285


which is pretty much as we expected; a good fit between experiment and theory will have α close to one.

Example
Suppose we take the exact same data, and run a linear regression on it to find the slope a and intercept b of the line that best fits the data, and use that line as the theoretical curve. We are now measuring a goodness of fit rather than testing a hypothesis. We obtain a line of best fit y = 0.039999 + 0.990000x, and construct the chi-squared statistic





x y y - (a + bx)(y-(a+bx))2 a+bx




2.0000002.000000 -0.020000 0.000198




3.0000003.100000 0.090000 0.002691




4.0000003.900000 -0.100000 0.002500




5.0000005.000000 0.010000 0.000020




6.0000006.000000 0.020000 0.000067




obtaining χ2 = 0.005476, and the goodness of fit for χ2 distribution with 2 = N - 1 - 2 (we have exhausted two degrees of freedom from our data to get slope and intercept) parameters is;

                        ∫
℘( χ2  ≥  0.005476)   =   ∞      ----1--- x22- 1e- x2dx  = α  =  0.997266
                          0.0054762 22Γ (2)
                                       2
using a 16 point Gaussian quadrature. Again no surprises, α is nearly one corresponding to a good fit.

Based on these two examples, we will call α computed in this way our confidence in the agreement between experiment and theory for our data set. In the last example we are %99.7 percent confident that the hypothesis is supported by our data.

Example
Consider the data for a nuclear counting experiment, each over 10s,













counts j 0 1 2 3 4 5 6 7 8 9 10












frequency mj57205383525532408273139452716












The total number of events counted is 𝒩 = j=0mj = 2608. We suppose that this data agrees with the proposition that

                       aj  - a
mj  =  𝒩 ℘(j, a)  = 𝒩  ---e
                        j!
and wish to test this hypothesis, being willing to call a confidence of 10% an acceptable basis for concluding that the hypothesis is supported.
PIC
First we determine the Poissonian that best fits our data, aest = ã, which in this case is the unbiased estimator
        ∞∑   j m
aest =      ----j-=  3.870.
        j=0  𝒩
This is illustrated, along with the actual data superimposed, as a histogram style plot to the left.
We can see by visual inspection that the hypothesis is actually pretty well supported by this data, but we should still obtain some sort of numerical estimate of how well it is supported. To do this, we compute from the best-fit Poissonian the theoretical frequencies mj,theor = 𝒩(j,aest), and construct a χ2 exponent using the fact that a Poissonian has mean
     ∞∑                   ∑∞    aj           d
ˉj =     j ℘(j, a) =  e- a    j ---=  e- a a---ea  = a
     j=0                  j=0   j!          da
and standard deviation
  2    ∞∑          2  - aaj
σ   =     (j -  a) e   --- =  a
       j=0              j!
The Pearson χ2 statistic is to construct the squared deviation of the data from the theoretical prediction
 2    ∑L  (mj  -  𝒩 ℘(j))2
χd =      ------------------
      j=0      𝒩  ℘(j)
in which the data has been sorted into N classes ( here N = 11 )






j (j,aest)𝒩(j,aest)mj -𝒩(j,aest)(mj-𝒩(j,aest))2 𝒩(j,aest)





0 0.02085854.398627 2.601373 0.124399
1 0.080722210.522688-5.522688 0.144878
2 0.156197407.361402-24.361402 1.456883
3 0.201494525.496208-0.496208 0.000469
4 0.194945508.41758223.582418 1.093846
5 0.150888393.51520814.484792 0.533167
6 0.097323253.81730919.182691 1.449766
7 0.053805140.324712-1.324712 0.012506
8 0.02602867.882080 -22.882080 7.713222
9 0.01119229.189294 -2.189294 0.164204
100.00433111.296257 4.703743 1.958631





In this case we find that χd2 = 14.651970. Pearson’s test consists of computing the probability that a standard χ2 probability will exceed this value, namely we compute

α =  ℘( χ2 ≥  χ2 )
                d
for a χ2 probability function with k degrees of freedom,
k  = N  -  r -  1
and if this is greater than our significance level, then the deviations of the data from the proposed theoretical curve are deemed insignificant; we have good agreement and a valid hypothesis. The number r is the number of parameters for a theoretical best fit hypothesis curve that we have computed from the data, in this case r = 1. Therefore we need to compute for a k = 9 chi2 distribution the probability that the statistic will exceed χd2 = 14.651970
                             1    ∫ ∞         9      χ
℘( χ2 ≥  14.651970)   =  -------9           χ 2- 1 e- 2 dχ
                         Γ (92)2 2  14.651970
A very short table of χ2 for nine degrees of freedom computed with the program chi2_table_gen provided in the support software section is as follows;



χd2 (χ2 χd2)


13.500000 0.141256
13.750000 0.131500
14.000000 0.122325
14.250000 0.113706
14.500000 0.105618
14.750000 0.098036
15.000000 0.090936
15.250000 0.084294
15.500000 0.078086
15.750000 0.072289


Since we will accept a confidence of %10, and the probability that χ2 exceeds our value 14.651970 is approximately 0.1 (or %10) from the table, according to Sveshnikov’s statement of the Goodness of Fit Criterion, we could conclude that deviation from a predicted Poisson distribution is insignificant; we have a reasonably good fit under our acceptance criteria.
Why were we willing to go so low? We only have 2608 events, a pretty small set from which to build up a frequency plot. If we had tens of thousands, we could get a much smoother frequency plot that is less sensitive to the statistical fluctuations that overwhelm small data sets. Large data sets are good, small are not; you should always be thinking about the Central Limit theorem.

1.2 Suggested reading

Material relevant to hypothesis testing and the χ2 test can be found in A. A. Sveshnikov Problems in Probability Theory, Mathematical Statistics and the Theory of Random Functions, Dover (1968). The radioactive counting example is Example 43.1 of this reference.

1.3 Problems

21. Examine the formula for the chi-squared statistic. Which experimental data points will dominate this formula (contribute most to it)? Is that realistic or appropriate in some way? Explain this carefully since this issue is actually quite important.

22. Prove that the mode of the chi-squared distribution for N degrees of freedom is N - 2.

23. Examine the decay-counting example due to Sveshnikov studied in this section. Which data points are most strongly affected by the experimenter forgetting to account for ambient background radiation? Explain the relevance of this question to the problem of setting up a reasonable value of α to use in deciding on whether or not a hypothesis is supported by experimental data.

1.4 An experiment in hypothesis testing

The purpose of this experiment is to derive an understanding of the fact that a result of a measurement is a random number distributed about some mean (average), with some standard deviation. In addition we will test this principle as a hypothesis using the χ2-test.

The apparatus is simply a collection of ten pennies.

When a handful of pennies is tossed, there is no way that you could predict with complete certainty what the outcome will be, but we accept as a hypothesis is that the outcome of the measurement is a random number, that has some distribution about a mean value. In this case the distribution is very simple, it is binomial. When you toss a single penny, you have a probability of p that it will turn up “heads”, 1 -p for “tails”. We certainly know p = 0.5 but lets keep it variable. The generating function for an N-coin toss is

                                       (   )
                             N     N∑     N    m         N - m  m  N - m
φ(h,  t) = (ph  +  (1 - p)t)   =       (   ) p  (1 -  p)      h  t
                                  m=0    m
and the probability of m heads (and N - m tails) in the set is the coefficient of hmtN-m,
          (   )                       N (   )
℘(m)   =  (N  ) pm(1  -  p)N - m =  1-  (N  )
            m                       2    m
These numbers form our theoretical probabilities of getting m heads in a ten-coin toss (N = 10)
               (  )
                10
℘theor(m)   =  -m10-
               2
What is the mean of this distribution of probabilities? You have computed this before
          (   )
ˉ     N∑    N     m        N- m       m    -d-                 d--            N
h =       (   )p   (1- p)      m (1)   =      φ(h, 1)∣h=1 =      (ph+1  - p)  ∣h=1  = pN   =  5
     m=0   m                              dh                  dh

1.4.1 Experimental procedure

Perform 20 tosses of all of the coins at once into a box, each time recording the number of heads.

Now sort your data; determine the number of tosses N(m) that resulted in m heads, for m = 0,1,2,⋅⋅⋅,10, and compute the experimental probability of getting m heads;

             N-(m)--
℘exp(m)   =
               20

Do these experimental measurements of the probabilities confirm the hypothesis that the probability is given by theor(m) = (10 m) 210 ? To decide this, compute χ2 for your data

         1∑0  (℘exp(m)   -  ℘theor(m))2
χ2exp =       ---------------------------
        m=0          ℘theor(m)
Your results confirm the hypothesis with a 95% certainty if
     2     2
℘( χ  ≥  χ exp) =  0.95
You should compute the unbiased estimator of the average of your data
          10                  10
hˉ    =  ∑   ℘    (m)  m  =  ∑   N--(m)- m
  exp    m=0   exp           n=0   20
How does this compare with the theoretical mean? Technically this is a measurement of a derived quantity; you do not measure the mean directly, you measure items upon which it depends, and calculate it from these values. You are measuring the mean value of the theoretical probability distribution for ten-coin tosses.

Perform the χ2 Pearson test. With what confidence does your data support the hypothesis? This experiment is a grossly simplified version of the nuclear counting experiment that you will eventually perform. I would suggest writing up your experiment and producing a lab report, with cover page, table of raw data, tables of processed data, calculations and conclusions. Include a graph of the binomial distribution (theoretical distribution) with your data points superimposed on it.