#!/usr/bin/env python # coding: utf-8 # In[2]: from IPython.display import Image Image('../../../python_for_probability_statistics_and_machine_learning.jpg') # [Python for Probability, Statistics, and Machine Learning](https://www.springer.com/fr/book/9783319307152) # In[3]: from __future__ import division get_ipython().run_line_magic('pylab', 'inline') # It is sometimes very difficult to unequivocally attribute outcomes to causal # factors. For example, did your experiment generate the outcome you were hoping # for or not? Maybe something did happen, but the effect is not pronounced # enough to separate it from inescapable measurement errors or other # factors in the ambient environment? Hypothesis testing is a powerful # statistical method to address these questions. Let's begin by again # considering our coin-tossing experiment with unknown parameter $p$. Recall # that the individual coin-flips are Bernoulli distributed. The first step is # to establish separate hypotheses. First, $H_0$ is the so-called null # hypothesis. In our case this can be # # $$ # H_0 \colon \theta < \frac{1}{2} # $$ # and the alternative hypothesis is then # $$ # H_1 \colon \theta \geq \frac{1}{2} # $$ # With this set up, the question now boils down to figuring out which # hypothesis the data is most consistent with. To choose between these, we need # a statistical test that is a function, $G$, of the sample set # $\mathbf{X}_n=\left\{ X_i \right\}_n $ into the real line, where $X_i$ is the # heads or tails outcome ($X_i \in \lbrace 0,1 \rbrace$). In other words, we # compute $G(\mathbf{X}_n)$ and check if it exceeds a threshold $c$. If not, then # we declare $H_0$ (otherwise, declare $H_1$). Notationally, this is the # following: # $$ # \begin{align*} # G(\mathbf{X}_n) < c & \Rightarrow H_0 \\\ # G(\mathbf{X}_n) \geq c & \Rightarrow H_1 # \end{align*} # $$ # In summary, we have the observed data $\mathbf{X}_n$ and a function # $G$ that maps that data onto the real line. Then, using the # constant $c$ as a threshold, the inequality effectively divides the real line # into two parts, one corresponding to each of the hypotheses. # # Whatever this test $G$ is, it will make mistakes of two types --- false # negatives and false positives. The false positives arise from the case where we # declare $H_0$ when the test says we should declare $H_1$. This is # summarized in the Table ref{tbl:decision}. # #

# # $$ # \begin{table} # \footnotesize # \centering # \begin{tabular}{l|p{1.3in}|p{1.3in}} # \multicolumn{1}{c}{ } & \multicolumn{1}{c}{Declare $H_0$ } & \multicolumn{1}{c}{ Declare $H_1$ } \\ # \hline # $H_0\:$ True & Correct & False positive (Type I error) \\ # $H_1\:$ True & False negative (Type II error) & Correct (true-detect) \\ # \hline # \end{tabular} # \caption{Truth table for hypotheses testing.} # \label{tbl:decision} \tag{1} # \end{table} # $$ # For this example, here are the false positives (aka false alarms): # $$ # P_{FA} = \mathbb{P}\left( G(\mathbf{X}_n) > c \mid \theta \leq \frac{1}{2} \right) # $$ # Or, equivalently, # $$ # P_{FA} = \mathbb{P}\left( G(\mathbf{X}_n) > c \mid H_0 \right) # $$ # Likewise, the other error is a false negative, which we can write # analogously as # $$ # P_{FN} = \mathbb{P}\left( G(\mathbf{X}_n) < c \vert H_1\right) # $$ # By choosing some acceptable values for either of these errors, # we can solve for the other one. The practice is usually to pick a value of # $P_{FA}$ and then find the corresponding value of $P_{FN}$. Note that it is # traditional in engineering to speak about *detection probability*, which is # defined as # $$ # P_{D} = 1- P_{FN} = \mathbb{P}\left( G(\mathbf{X}_n) > c \mid H_1\right) # $$ # In other words, this is the probability of declaring $H_1$ when the # test exceeds the threshold. This is otherwise known as the *probability of a # true detection* or *true-detect*. # # ## Back to the Coin Flipping Example # # In our previous maximum likelihood discussion, we wanted to derive an # estimator for the *value* of the probability of heads for the coin # flipping experiment. For hypthesis testing, we want to ask a softer # question: is the probability of heads greater or less than $\nicefrac{1}{2}$? As we # just established, this leads to the two hypotheses: # $$ # H_0 \colon \theta < \frac{1}{2} # $$ # versus, # $$ # H_1 \colon \theta > \frac{1}{2} # $$ # Let's assume we have five observations. Now we need the $G$ function # and a threshold $c$ to help pick between the two hypotheses. Let's count the # number of heads observed in five observations as our # criterion. Thus, we have # $$ # G(\mathbf{X}_5) := \sum_{i=1}^5 X_i # $$ # and, suppose further that we pick $H_1$ only if exactly five out of # five observations are heads. We'll call this the *all-heads* test. # # Now, because all of the $X_i$ are random variables, so is $G$ and we must # find the corresponding probability mass function for $G$. Assuming the # individual coin tosses are independent, the probability of five heads is $\theta^5$. # This means that the probability of rejecting the $H_0$ hypothesis (and choosing # $H_1$, because there are only two choices here) based on the unknown underlying # probability is $\theta^5$. In the parlance, this is known and the *power function* # as in denoted by $\beta$ as in # $$ # \beta(\theta) = \theta^5 # $$ # Let's get a quick plot this in [Figure](#fig:Hypothesis_testing_001). # # # In[4]: get_ipython().run_line_magic('matplotlib', 'inline') from matplotlib.pylab import subplots import numpy as np fig,ax=subplots() fig.set_size_inches((6,3)) xi = np.linspace(0,1,50) _=ax.plot(xi, (xi)**5,'-k',label='all heads') _=ax.set_xlabel(r'$\theta$',fontsize=22) _=ax.plot(0.5,(0.5)**5,'ko') fig.tight_layout() #fig.savefig('fig-statistics/Hypothesis_Testing_001.png') # # #

# #

Power function for the all-heads test. The dark circle indicates the value of the function indicating $\alpha$.

# # # # # Now, we have the following false alarm probability, # $$ # P_{FA} = \mathbb{P}( G(\mathbf{X}_n)= 5 \vert H_0) =\mathbb{P}( \theta^5 \vert H_0) # $$ # Notice that this is a function of $\theta$, which means there are # many false alarm probability values that correspond to this test. To be on the # conservative side, we'll pick the supremum (i.e., maximum) of this function, # which is known as the *size* of the test, traditionally denoted by $\alpha$, # $$ # \alpha = \sup_{\theta \in \Theta_0} \beta(\theta) # $$ # with domain $\Theta_0 = \lbrace \theta < 1/2 \rbrace$ which in our case is # $$ # \alpha = \sup_{\theta < \frac{1}{2}} \theta^5 = \left(\frac{1}{2}\right)^5 = 0.03125 # $$ # Likewise, for the detection probability, # $$ # \mathbb{P}_{D}(\theta) = \mathbb{P}( \theta^5 \vert H_1) # $$ # which is again a function of the parameter $\theta$. The problem with # this test is that the $P_{D}$ is pretty low for most of the domain of # $\theta$. For instance, values in the nineties for $P_{D}$ # only happen when $\theta > 0.98$. In other words, if the coin produces # heads 98 times out of 100, then we can detect $H_1$ reliably. Ideally, we want # a test that is zero for the domain corresponding to $H_0$ (i.e., $\Theta_0$) and # equal to one otherwise. Unfortunately, even if we increase the length of the # observed sequence, we cannot escape this effect with this test. You can try # plotting $\theta^n$ for larger and larger values of $n$ to see this. # # ### Majority Vote Test # # Due to the problems with the detection probability in the all-heads test, maybe # we can think of another test that will have the performance we want? Suppose we # reject $H_0$ if the majority of the observations are heads. Then, using the # same reasoning as above, we have # $$ # \beta(\theta) = \sum_{k=3}^5 \binom{5}{k} \theta^k(1-\theta)^{5-k} # $$ # [Figure](#fig:Hypothesis_testing_002) shows the power function # for both the majority vote and the all-heads tests. # In[5]: fig,ax=subplots() fig.set_size_inches((6,3)) from sympy.abc import theta,k # get some variable symbols import sympy as S xi = np.linspace(0,1,50) expr=S.Sum(S.binomial(5,k)*theta**(k)*(1-theta)**(5-k),(k,3,5)).doit() _=ax.plot(xi, (xi)**5,'-k',label='all heads') _=ax.plot(xi, S.lambdify(theta,expr)(xi),'--k',label='majority vote') _=ax.plot(0.5, (0.5)**5,'ko') _=ax.plot(0.5, S.lambdify(theta,expr)(0.5),'ko') _=ax.set_xlabel(r'$\theta$',fontsize=22) _=ax.legend(loc=0) fig.tight_layout() #fig.savefig('fig-statistics/Hypothesis_Testing_002.png') # # #

# #

Compares the power function for the all-heads test with that of the majority-vote test.

# # # # # In this case, the new test has *size* # $$ # \alpha = \sup_{\theta < \frac{1}{2}} \theta^{5} + 5 \theta^{4} \left(- \theta + 1\right) + 10 \theta^{3} \left(- \theta + 1\right)^{2} = \frac{1}{2} # $$ # As before we only get to upwards of 90% for detection # probability only when the underlying parameter $\theta > 0.75$. # Let's see what happens when we consider more than five samples. For # example, let's suppose that we have $n=100$ samples and we want to # vary the threshold for the majority vote test. For example, let's have # a new test where we declare $H_1$ when $k=60$ out of the 100 trials # turns out to be heads. What is the $\beta$ function in this case? # $$ # \beta(\theta) = \sum_{k=60}^{100} \binom{100}{k} \theta^k(1-\theta)^{100-k} # $$ # This is too complicated to write by hand, but the statistics module # in Sympy has all the tools we need to compute this. # In[6]: from sympy.stats import P, Binomial theta = S.symbols('theta',real=True) X = Binomial('x',100,theta) beta_function = P(X>60) print beta_function.subs(theta,0.5) # alpha 0.0176001001088524 print beta_function.subs(theta,0.70) 0.979011423996075 # These results are much better than before because the $\beta$ # function is much steeper. If we declare $H_1$ when we observe 60 out of 100 # trials are heads, then we wrongly declare heads approximately 1.8% of the # time. Otherwise, if it happens that the true value for $p>0.7$, we will # conclude correctly approximately 97% of the time. A quick simulation can sanity # check these results as shown below: # In[7]: from scipy import stats rv=stats.bernoulli(0.5) # true p = 0.5 # number of false alarms ~ 0.018 print sum(rv.rvs((1000,100)).sum(axis=1)>60)/1000. # The above code is pretty dense so let's unpack it. In the first line, we use the `scipy.stats` module to define the # Bernoulli random variable for the coin flip. Then, we use the `rvs` method of # the variable to generate 1000 trials of the experiment where each trial # consists of 100 coin flips. This generates a $1000 \times 100$ matrix where the # rows are the individual trials and the columns are the outcomes of each # respective set of 100 coin flips. The `sum(axis=1)` part computes the sum across the # columns. Because the values of the embedded matrix are only `1` or `0` this # gives us the count of flips that are heads per row. The next `>60` part # computes the boolean 1000-long vector of values that are bigger than 60. The # final `sum` adds these up. Again, because the entries in the array are `True` # or `False` the `sum` computes the count of times the number of heads has # exceeded 60 per 100 coin flips in each of 1000 trials. Then, dividing this # number by 1000 gives a quick approximation of false alarm probability we # computed above for this case where the true value of $p=0.5$. # # ## Receiver Operating Characteristic # # Because the majority vote test is a binary test, we can compute the *Receiver # Operating Characteristic* (ROC) which is the graph of the $(P_{FA}, # P_D)$. The term comes from radar systems but is a very general method for # consolidating all of these issues into a single graph. Let's consider a typical # signal processing example with two hypotheses. In $H_0$, there is noise but no # signal present at the receiver, # $$ # H_0 \colon X = \epsilon # $$ # where $\epsilon \sim \mathcal{N}(0,\sigma^2)$ represents additive # noise. In the alternative hypothesis, there is a deterministic signal at the receiver, # $$ # H_1 \colon X = \mu + \epsilon # $$ # Again, the problem is to choose between these two hypotheses. For # $H_0$, we have $X \sim \mathcal{N}(0,\sigma^2)$ and for $H_1$, we have $ X \sim # \mathcal{N}(\mu,\sigma^2)$. Recall that we only observe values for $x$ and # must pick either $H_0$ or $H_1$ from these observations. Thus, we need a # threshold, $c$, to compare $x$ against in order to distinguish the two # hypotheses. [Figure](#fig:Hypothesis_testing_003) shows the probability density # functions under each of the hypotheses. The dark vertical line is the threshold # $c$. The gray shaded area is the probability of detection, $P_D$ and the shaded # area is the probability of false alarm, $P_{FA}$. The test evaluates every # observation of $x$ and concludes $H_0$ if $x --> # #

# #

The two density functions for the $H_0$ and $H_1$ hypotheses. The shaded gray area is the detection probability and the shaded blue area is the probability of false alarm. The vertical line is the decision threshold.

# # # # # **Programming Tip.** # # The shading shown in [Figure](#fig:Hypothesis_testing_003) comes from # Matplotlib's `fill_between` function. This function has a `where` keyword # argument to specify which part of the plot to apply shading with specified # `color` keyword argument. Note there is also a `fill_betweenx` function that # fills horizontally. The `text` function can place formatted # text anywhere in the plot and can utilize basic \LaTeX{} formatting. # See the IPython notebook corresponding to this section for the source code. # # # # As we slide the threshold left and right along the horizontal axis, we naturally change the corresponding areas under # each of the curves shown in [Figure](#fig:Hypothesis_testing_003) and thereby # change the values of $P_D$ and $P_{FA}$. The contour that emerges from sweeping # the threshold this way is the ROC as shown in [Figure](#fig:Hypothesis_testing_004). This figure also shows the diagonal line which # corresponds to making decisions based on the flip of a fair coin. Any # meaningful test must do better than coin flipping so the more the ROC bows up # to the top left corner of the graph, the better. Sometimes ROCs are quantified # into a single number called the *area under the curve* (AUC), which varies from # 0.5 to 1.0 as shown. In our example, what separates the two probability density # functions is the value of $\mu$. In a real situation, this would be determined # by signal processing methods that include many complicated trade-offs. The key # idea is that whatever those trade-offs are, the test itself boils down to the # separation between these two density functions --- good tests separate the two # density functions and bad tests do not. Indeed, when there is no separation, we # arrive at the diagonal-line coin-flipping situation we just discussed. # # What values for $P_D$ and $P_{FA}$ are considered *acceptable* depends on the # application. For example, suppose you are testing for a fatal disease. It could # be that you are willing to except a relatively high $P_{FA}$ value if that # corresponds to a good $P_D$ because the test is relatively cheap to administer # compared to the alternative of missing a detection. On the other hand, # may be a false alarm triggers an expensive response, so that minimizing # these alarms is more important than potentially missing a detection. These # trade-offs can only be determined by the application and design factors. # # # #

# #

The Receiver Operating Characteristic (ROC) corresponding to [Figure](#fig:Hypothesis_testing_003).

# # # # # ## P-Values # # There are a lot of moving parts in hypothesis testing. What we need # is a way to consolidate the findings. The idea is that we want to find # the minimum level at which the test rejects $H_0$. Thus, the p-value # is the probability, under $H_0$, that the test-statistic is at least # as extreme as what was actually observed. Informally, this means # that smaller values imply that $H_0$ should be rejected, although # this doesn't mean that large values imply that $H_0$ should be # retained. This is because a large p-value can arise from either $H_0$ # being true or the test having low statistical power. # # If $H_0$ is true, the p-value is uniformly distributed in the interval $(0,1)$. # If $H_1$ is true, the distribution of the p-value will concentrate closer to # zero. For continuous distributions, this can be proven rigorously and implies # that if we reject $H_0$ when the corresponding p-value is less than $\alpha$, # then the probability of a false alarm is $\alpha$. Perhaps it helps to # formalize this a bit before computing it. Suppose $\tau(X)$ is a test # statistic that rejects $H_0$ as it gets bigger. Then, for each sample $x$, # corresponding to the data we actually have on-hand, we define # $$ # p(x) = \sup_{\theta \in \Theta_0} \mathbb{P}_{\theta}(\tau(X) > \tau(x)) # $$ # This equation states that the supremum (i.e., maximum) # probability that the test statistic, $\tau(X)$, exceeds the value for # the test statistic on this particular data ($\tau(x)$) over the # domain $\Theta_0$ is defined as the p-value. Thus, this embodies a # worst-case scenario over all values of $\theta$. # # Here's one way to think about this. Suppose you rejected $H_0$, and someone # says that you just got *lucky* and somehow just drew data that happened to # correspond to a rejection of $H_0$. What p-values provide is a way to address # this by capturing the odds of just a favorable data-draw. Thus, suppose that # your p-value is 0.05. Then, what you are showing is that the odds of just # drawing that data sample, given $H_0$ is in force, is just 5%. This means that # there's a 5% chance that you somehow lucked out and got a favorable draw of # data. # # Let's make this concrete with an example. Given, the majority-vote rule above, # suppose we actually do observe three of five heads. Given the $H_0$, the # probability of observing this event is the following: # $$ # p(x) =\sup_{\theta \in \Theta_0} \sum_{k=3}^5\binom{5}{k} \theta^k(1-\theta)^{5-k} = \frac{1}{2} # $$ # For the all-heads test, the corresponding computation is the following: # $$ # p(x) =\sup_{\theta \in \Theta_0} \theta^5 = \frac{1}{2^5} = 0.03125 # $$ # From just looking at these p-values, you might get the feeling that the second # test is better, but we still have the same detection probability issues we # discussed above; so, p-values help in summarizing some aspects of our # hypothesis testing, but they do *not* summarize all the salient aspects of the # *entire* situation. # # ## Test Statistics # # As we have seen, it is difficult to derive good test statistics for hypothesis # testing without a systematic process. The Neyman-Pearson Test is derived from # fixing a false-alarm value ($\alpha$) and then maximizing the detection # probability. This results in the Neyman-Pearson Test, # $$ # L(\mathbf{x}) = \frac{f_{X|H_1}(\mathbf{x})}{f_{X|H_0}(\mathbf{x})} \stackrel[H_0]{H_1}{\gtrless} \gamma # $$ # where $L$ is the likelihood ratio and where the threshold # $\gamma$ is chosen such that # $$ # \int_{x:L(\mathbf{x})>\gamma} f_{X|H_0}(\mathbf{x}) d\mathbf{x}=\alpha # $$ # The Neyman-Pearson Test is one of a family of tests that use # the likelihood ratio. # # **Example.** Suppose we have a receiver and we want to distinguish # whether just noise ($H_0$) or signal pluse noise ($H_1$) is received. # For the noise-only case, we have $x\sim \mathcal{N}(0,1)$ and for the # signal pluse noise case we have $x\sim \mathcal{N}(1,1)$. In other # words, the mean of the distribution shifts in the presence of the # signal. This is a very common problem in signal processing and # communications. The Neyman-Pearson Test then boils down to the # following, # $$ # L(x)= e^{-\frac{1}{2}+x}\stackrel[H_0]{H_1}{\gtrless}\gamma # $$ # Now we have to find the threshold $\gamma$ that solves the # maximization problem that characterizes the Neyman-Pearson Test. Taking # the natural logarithm and re-arranging gives, # $$ # x\stackrel[H_0]{H_1}{\gtrless} \frac{1}{2}+\log\gamma # $$ # The next step is find $\gamma$ corresponding to the desired # $\alpha$ by computing it from the following, # $$ # \int_{1/2+\log\gamma}^{\infty} f_{X|H_0}(x)dx = \alpha # $$ # For example, taking $\alpha=1/100$, gives # $\gamma\approx 6.21$. To summarize the test in this case, we have, # $$ # x\stackrel[H_0]{H_1}{\gtrless} 2.32 # $$ # Thus, if we measure $X$ and see that its value # exceeds the threshold above, we declare $H_1$ and otherwise # declare $H_0$. The following code shows how to # solve this example using Sympy and Scipy. First, we # set up the likelihood ratio, # In[8]: import sympy as S from sympy import stats s = stats.Normal('s',1,1) # signal+noise n = stats.Normal('n',0,1) # noise x = S.symbols('x',real=True) L = stats.density(s)(x)/stats.density(n)(x) # Next, to find the $\gamma$ value, # In[9]: g = S.symbols('g',positive=True) # define gamma v=S.integrate(stats.density(n)(x), (x,S.Rational(1,2)+S.log(g),S.oo)) # **Programming Tip.** # # Providing additional information regarding the Sympy variable by using the # keyword argument `positive=True` helps the internal simplification algorithms # work faster and better. This is especially useful when dealing with complicated # integrals that involve special functions. Furthermore, note that we used the # `Rational` function to define the `1/2` fraction, which is another way of # providing hints to Sympy. Otherwise, it's possible that the floating-point # representation of the fraction could disguise the simple fraction and # thereby miss internal simplification opportunities. # # # # We want to solve for `g` in the above expression. Sympy has some # built-in numerical solvers as in the following, # In[10]: print S.nsolve(v-0.01,3.0) # approx 6.21 # Note that in this situation it is better to use the numerical # solvers because Sympy `solve` may grind along for a long time to # resolve this. # # ### Generalized Likelihood Ratio Test # # The likelihood ratio test can be generalized using the following statistic, # $$ # \Lambda(\mathbf{x})= \frac{\sup_{\theta\in\Theta_0} L(\theta)}{\sup_{\theta\in\Theta} L(\theta)}=\frac{L(\hat{\theta}_0)}{L(\hat{\theta})} # $$ # where $\hat{\theta}_0$ maximizes $L(\theta)$ subject to # $\theta\in\Theta_0$ and $\hat{\theta}$ is the maximum likelihood estimator. # The intuition behind this generalization of the Likelihood Ratio Test is that # the denomimator is the usual maximum likelihood estimator and the numerator is # the maximum likelihood estimator, but over a restricted domain ($\Theta_0$). # This means that the ratio is always less than unity because the maximum # likelihood estimator over the entire space will always be at least as maximal # as that over the more restricted space. When this $\Lambda$ ratio gets small # enough, it means that the maximum likelihood estimator over the entire domain # ($\Theta$) is larger which means that it is safe to reject the null hypothesis # $H_0$. The tricky part is that the statistical distribution of $\Lambda$ is # usually eye-wateringly difficult. Fortunately, Wilks Theorem says that with # sufficiently large $n$, the distribution of $-2\log\Lambda$ is approximately # chi-square with $r-r_0$ degrees of freedom, where $r$ is the number of free # parameters for $\Theta$ and $r_0$ is the number of free parameters in # $\Theta_0$. With this result, if we want an approximate test at level # $\alpha$, we can reject $H_0$ when $-2\log\Lambda \ge \chi^2_{r-r_0}(\alpha)$ # where $\chi^2_{r-r_0}(\alpha)$ denotes the $1-\alpha$ quantile of the # $\chi^2_{r-r_0}$ chi-square distribution. However, the problem with this # result is that there is no definite way of knowing how big $n$ should be. The # advantage of this generalized likelihood ratio test is that it # can test multiple hypotheses simultaneously, as illustrated # in the following example. # # **Example.** Let's return to our coin-flipping example, except now we have # three different coins. The likelihood function is then, # $$ # L(p_1,p_2,p_3) = \texttt{binom}(k_1;n_1,p_1)\texttt{binom}(k_2;n_2,p_2)\texttt{binom}(k_3;n_3,p_3) # $$ # where $\texttt{binom}$ is the binomial distribution with # the given parameters. For example, # $$ # \texttt{binom}(k;n,p) =\sum_{k=0}^n \binom{n}{k} p^k(1-p)^{n-k} # $$ # The null hypothesis is that all three coins have the # same probability of heads, $H_0:p=p_1=p_2=p_3$. The alternative hypothesis is # that at least one of these probabilites is different. Let's consider the # numerator of the $\Lambda$ first, which will give us the maximum likelihood # estimator of $p$. Because the null hypothesis is that all the $p$ values are # equal, we can just treat this as one big binomial distribution with # $n=n_1+n_2+n_3$ and $k=k_1+k_2+k_3$ is the total number of heads observed for # any coin. Thus, under the null hypothesis, the distribution of $k$ is binomial # with parameters $n$ and $p$. Now, what is the maximum likelihood estimator for # this distribution? We have worked this problem before and have the following, # $$ # \hat{p}_0= \frac{k}{n} # $$ # In other words, the maximum likelihood estimator under the null # hypothesis is the proportion of ones observed in the sequence of $n$ trials # total. Now, we have to substitute this in for the likelihood under the null # hypothesis to finish the numerator of $\Lambda$, # $$ # L(\hat{p}_0,\hat{p}_0,\hat{p}_0) = \texttt{binom}(k_1;n_1,\hat{p}_0)\texttt{binom}(k_2;n_2,\hat{p}_0)\texttt{binom}(k_3;n_3,\hat{p}_0) # $$ # For the denomimator of $\Lambda$, which represents the case of maximizing over # the entire space, the maximum likelihood estimator for each separate binomial # distribution is likewise, # $$ # \hat{p}_i= \frac{k_i}{n_i} # $$ # which makes the likelihood in the denominator the following, # $$ # L(\hat{p}_1,\hat{p}_2,\hat{p}_3) = \texttt{binom}(k_1;n_1,\hat{p}_1)\texttt{binom}(k_2;n_2,\hat{p}_2)\texttt{binom}(k_3;n_3,\hat{p}_3) # $$ # for each of the $i\in \lbrace 1,2,3 \rbrace$ binomial distributions. Then, the # $\Lambda$ statistic is then the following, # $$ # \Lambda(k_1,k_2,k_3) = \frac{L(\hat{p}_0,\hat{p}_0,\hat{p}_0)}{L(\hat{p}_1,\hat{p}_2,\hat{p}_3)} # $$ # Wilks theorems states that $-2\log\Lambda$ is chi-square # distributed. We can compute this example with the statistics tools in Sympy and # Scipy. # In[11]: from scipy.stats import binom, chi2 import numpy as np # some sample parameters p0,p1,p2 = 0.3,0.4,0.5 n0,n1,n2 = 50,180,200 brvs= [ binom(i,j) for i,j in zip((n0,n1,n2),(p0,p1,p2))] def gen_sample(n=1): 'generate samples from separate binomial distributions' if n==1: return [i.rvs() for i in brvs] else: return [gen_sample() for k in range(n)] # **Programming Tip.** # # Note the recursion in the definition of the `gen_sample` function where a # conditional clause of the function calls itself. This is a quick way to reusing # code and generating vectorized output. Using `np.vectorize` is another way, but # the code is simple enough in this case to use the conditional clause. In # Python, it is generally bad for performance to have code with nested recursion # because of how the stack frames are managed. However, here we are only # recursing once so this is not an issue. # # # # Next, we compute the logarithm of the numerator of the $\Lambda$ # statistic, # In[12]: from __future__ import division np.random.seed(1234) # In[13]: k0,k1,k2 = gen_sample() print k0,k1,k2 pH0 = sum((k0,k1,k2))/sum((n0,n1,n2)) numer = np.sum([np.log(binom(ni,pH0).pmf(ki)) for ni,ki in zip((n0,n1,n2),(k0,k1,k2))]) print numer # Note that we used the null hypothesis estimate for the $\hat{p}_0$. # Likewise, for the logarithm of the denominator we have the following, # In[14]: denom = np.sum([np.log(binom(ni,pi).pmf(ki)) for ni,ki,pi in zip((n0,n1,n2),(k0,k1,k2),(p0,p1,p2))]) print denom # Now, we can compute the logarithm of the $\Lambda$ statistic as # follows and see what the corresponding value is according to Wilks theorem, # In[15]: chsq=chi2(2) logLambda =-2*(numer-denom) print logLambda print 1- chsq.cdf(logLambda) # Because the value reported above is less than the 5% significance # level, we reject the null hypothesis that all the coins have the same # probability of heads. Note that there are two degrees of freedom because the # difference in the number of parameters between the null hypothesis ($p$) and # the alternative ($p_1,p_2,p_3$) is two. We can build a quick Monte # Carlo simulation to check the probability of detection for this example using # the following code, which is just a combination of the last few code blocks, # In[16]: c= chsq.isf(.05) # 5% significance level out = [] for k0,k1,k2 in gen_sample(100): pH0 = sum((k0,k1,k2))/sum((n0,n1,n2)) numer = np.sum([np.log(binom(ni,pH0).pmf(ki)) for ni,ki in zip((n0,n1,n2),(k0,k1,k2))]) denom = np.sum([np.log(binom(ni,pi).pmf(ki)) for ni,ki,pi in zip((n0,n1,n2),(k0,k1,k2),(p0,p1,p2))]) out.append(-2*(numer-denom)>c) print np.mean(out) # estimated probability of detection # The above simulation shows the estimated probability of # detection, for this set of example parameters. This relative low # probability of detection means that while the test is unlikely (i.e., # at the 5% significance level) to mistakenly pick the null hypothesis, # it is likewise missing many of the $H_1$ cases (i.e., low probability # of detection). The trade-off between which is more important is up to # the particular context of the problem. In some situations, we may # prefer additional false alarms in exchange for missing fewer $H_1$ # cases. # # # ### Permutation Test # # # # # # # # The Permutation Test is good way to test whether or not # samples samples come from the same distribution. For example, suppose that # $$ # X_1, X_2, \ldots, X_m \sim F # $$ # and also, # $$ # Y_1, Y_2, \ldots, Y_n \sim G # $$ # That is, $Y_i$ and $X_i$ come from different distributions. Suppose # we have some test statistic, for example # $$ # T(X_1,\ldots,X_m,Y_1,\ldots,Y_n) = \vert\overline{X}-\overline{Y}\vert # $$ # Under the null hypothesis for which $F=G$, any of the # $(n+m)!$ permutations are equally likely. Thus, suppose for # each of the $(n+m)!$ permutations, we have the computed # statistic, # $$ # \lbrace T_1,T_2,\ldots,T_{(n+m)!} \rbrace # $$ # Then, under the null hypothesis, each of these values is equally # likely. The distribution of $T$ under the null hypothesis is the *permutation # distribution* that puts weight $1/(n+m)!$ on each $T$-value. Suppose $t_o$ is # the observed value of the test statistic and assume that large $T$ rejects the # null hypothesis, then the p-value for the permutation test is the following, # $$ # P(T>t_o)= \frac{1}{(n+m)!} \sum_{j=1}^{(n+m)!} I(T_j>t_o) # $$ # where $I()$ is the indicator function. For large $(n+m)!$, we can # sample randomly from the set of all permutations to estimate this p-value. # # **Example.** Let's return to our coin-flipping example from last time, but # now we have only two coins. The hypothesis is that both coins # have the same probability of heads. We can use the built-in # function in Numpy to compute the random permutations. # In[17]: x=binom(10,0.3).rvs(5) # p=0.3 y=binom(10,0.5).rvs(3) # p=0.5 z = np.hstack([x,y]) # combine into one array t_o = abs(x.mean()-y.mean()) out = [] # output container for k in range(1000): perm = np.random.permutation(z) T=abs(perm[:len(x)].mean()-perm[len(x):].mean()) out.append((T>t_o)) print 'p-value = ', np.mean(out) # Note that the size of total permutation space is # $8!=40320$ so we are taking relatively few (i.e., 100) random # permutations from this space. # # ### Wald Test # # The Wald Test is an asympotic test. Suppose we have $H_0:\theta=\theta_0$ and # otherwise $H_1:\theta\ne\theta_0$, the corresponding statistic is defined as # the following, # $$ # W=\frac{\hat{\theta}_n-\theta_0}{\texttt{se}} # $$ # where $\hat{\theta}$ is the maximum likelihood estimator and # $\texttt{se}$ is the standard error, # $$ # \texttt{se} = \sqrt{\mathbb{V}(\hat{\theta}_n)} # $$ # Under general conditions, $W\overset{d}{\to} \mathcal{N}(0,1)$. # Thus, an asympotic test at level $\alpha$ rejects when $\vert W\vert> # z_{\alpha/2}$ where $z_{\alpha/2}$ corresponds to $\mathbb{P}(\vert # Z\vert>z_{\alpha/2})=\alpha$ with $Z \sim \mathcal{N}(0,1)$. For our favorite # coin-flipping example, if $H_0:\theta=\theta_0$, then # $$ # W = \frac{\hat{\theta}-\theta_0}{\sqrt{\hat{\theta}(1-\hat{\theta})/n}} # $$ # We can simulate this using the following code at the usual # 5% significance level, # In[18]: from scipy import stats theta0 = 0.5 # H0 k=np.random.binomial(1000,0.3) theta_hat = k/1000. # MLE W = (theta_hat-theta0)/np.sqrt(theta_hat*(1-theta_hat)/1000) c = stats.norm().isf(0.05/2) # z_{alpha/2} print abs(W)>c # if true, reject H0 # This rejects $H_0$ because the true $\theta=0.3$ and the null hypothesis # is that $\theta=0.5$. Note that $n=1000$ in this case which puts us well inside the # asympotic range of the result. We can re-do this example to estimate # the detection probability for this example as in the following code, # In[19]: theta0 = 0.5 # H0 c = stats.norm().isf(0.05/2.) # z_{alpha/2} out = [] for i in range(100): k=np.random.binomial(1000,0.3) theta_hat = k/1000. # MLE W = (theta_hat-theta0)/np.sqrt(theta_hat*(1-theta_hat)/1000.) out.append(abs(W)>c) # if true, reject H0 print np.mean(out) # detection probability # ## Testing Multiple Hypotheses # # Thus far, we have focused primarily on two competing hypotheses. Now, we # consider multiple comparisons. The general situation is the following. We test # the null hypothesis against a sequence of $n$ competing hypotheses $H_k$. We # obtain p-values for each hypothesis so now we have multiple p-values to # consider $\lbrace p_k \rbrace$. To boil this sequence down to a single # criterion, we can make the following argument. Given $n$ independent hypotheses # that are all untrue, the probability of getting at least one false alarm is the # following, # $$ # P_{FA} = 1-(1-p_0)^n # $$ # where $p_0$ is the individual p-value threshold (say, 0.05). The # problem here is that $P_{FA}\rightarrow 1$ as $n\rightarrow\infty$. If we want # to make many comparisons at once and control the overall false alarm rate the # overall p-value should be computed under the assumption that none of the # competing hypotheses is valid. The most common way to address this is with the # Bonferroni correction which says that the individual significance level should # be reduced to $p/n$. Obviously, this makes it much harder to declare # significance for any particular hypothesis. The natural consequence of this # conservative restriction is to reduce the statistical power of the experiment, # thus making it more likely the true effects will be missed. # # In 1995, Benjamini and Hochberg devised a simple method that tells which # p-values are statistically significant. The procedure is to sort the list of # p-values in ascending order, choose a false-discovery rate (say, $q$), and then # find the largest p-value in the sorted list such that $p_k \le k q/n$, where # $k$ is the p-value's position in the sorted list. Finally, declare that $p_k$ # value and all the others less than it statistically significant. This procedure # guarantees that the proportion of false-positives is less than $q$ (on # average). The Benjamini-Hochberg procedure (and its derivatives) is fast and # effective and is widely used for testing hundreds of primarily false hypotheses # when studying genetics or diseases. Additionally, this # procedure provides better statistical power than the Bonferroni correction. # # # # # # # # # In this section, we discussed the structure of statistical hypothesis testing # and defined the various terms that are commonly used for this process, along # with the illustrations of what they mean in our running coin-flipping example. # From an engineering standpoint, hypothesis testing is not as common as # confidence-intervals and point estimates. On the other hand, hypothesis testing # is very common in social and medical science, where one must deal with # practical constraints that may limit the sample size or other aspects of the # hypothesis testing rubric. In engineering, we can usually have much more # control over the samples and models we employ because they are typically # inanimate objects that can be measured repeatedly and consistently. This is # obviously not so with human studies, which generally have other ethical and # legal considerations.