Hypothesis testing

We can interpret data by assuming a specific structure of our outcome and use statistical methods to confirm or reject the assumption. The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

Whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results in applied machine learning, we must rely on statistical hypothesis tests.

Hypothesis testing is a statistical approach that is used with experimental data to make statistical decisions. This is used to determine whether an experiment performed offers ample evidence to reject a proposal.

A Null Hypothesis implies that there is no strong difference in a given set of observations.

The basic assumption of a statistical test is called the null hypothesis, and we can quantify and interpret statistical measurements to determine if the null hypothesis should be accepted or not.

We are interested to learn whether there is an actual or statistically meaningful difference between the two models when choosing models based on their estimated skills.

Terminology

Null Hypothesis: the hypothesis that sample observations result purely from chance. The null hypothesis tends to state that there’s no change.
Alternative Hypothesis: the hypothesis that sample observations are influenced by some non-random cause.
P-value: the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct; a smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.
Alpha: the significance level; the probability of rejecting the null hypothesis when it is true — also known as Type 1 error (can be considered as threshold value beyond which the null hypothesis is considered false).

Take an example

Imagine that you and your friend play a game. If a coin lands on heads, you win $5 and if it lands on tails he wins $5.

Let’s say the first two coin tosses landed on tails, meaning your friend won $10. Should you be worried that he’s using a rigged coin? Well, the probability of the coin landing on tails two times in a row is 25% (see above) which is not unlikely.

What if the coin landed on tails six times in a row? The probability of that occurring is approximately 1.56% (see above), which is highly unlikely. At this point, it would be fair to assume that the coin is rigged. Typically, one would set a threshold, usually 5%, to determine if an event occurred by chance or not (if you learned this before, this is known as the alpha!)

Now using the coin example again so that we can understand these terms better:

The null hypothesis in our example is that the coin is a fair coin and that the observations are purely from chance.