Hypothesis Tests

Although most chemical measurements are quantitative, most questions we ask are qualitative. For example, measurements may be employed in court cases. The general public or a lawyer or a legislator is not really interested in whether a spectral peak has an area of 22.7 AU*nm, he or she is interested in how this can be interpreted. Is there evidence that an athlete has a significant level of dope in their urine? Does it look like the concentration of a carcinogenic chemical in the air of a factory is above a safe limit? Is the orange juice adulterated? At the scene of the crime does it look as the dead person was poisoned?

What normally happens is that we make a number of measurements, and from these we try to determine some qualitative information. What we are really after is whether the measurements fit a hypothesis. As a forensic scientist we may wish to look at a blood sample and measure whether a poison is in the blood and if so whether its level is sufficiently above safe and natural limits (some poisons are quite subtle and hard to detect), and so whether the chemical evidence is consistent with a crime. Hypothesis tests are used to formulate this information statistically, and can also be called significance tests.

What we do first is to set up a hypothesis. Many people start with the null hypothesis, for example, in the absence of any specific evidence let us assume that the level of a stimulant in an athlete's urine is not significant. To disprove our null hypothesis, we require strong evidence that there really is a lot of stimulant in the sample. This is not always so easy to prove. For example, many chemicals also have a natural origin, so one has to demonstrate that the compound is at a higher level than would naturally be found. To do this one may take a set of urine samples from a series of individuals without the stimulant added, perform some measurements (which will have a spread of results due to each person's metabolism) and then see whether the level of stimulant in the contested sample differs significantly. In addition there are other factors such as breakdown of metabolites and long term storage of samples, for example, if we wish to return to perform more tests in a few months' time if the athlete takes us to court to contest the decision. A significance test usually has a probability associated with it. We may want to be more than 99 % sure that the concentration of the metabolite is so high that it could not have occurred through normal metabolic processes, so that the chance we are falsely accusing the athlete is only 1 %; in other words we are 99 % sure that the null hypothesis is wrong. Some people use different significance levels and a lot depends on whether we want to do screening where a lower level might simply be used to eliminate the samples that appear alright, or really water-tight confirmatory analysis, where a higher level might be advisable.

An understanding of the normal distribution can help us here. Let us say we knew that the concentration of a compound in normal subjects had an average of 5.32 ppm and a standard deviation of 0.91 ppm. We could reformulate the problem as follows. If we assume the errors are normally distributed, we can use a normal distribution (Section 3.4) and find that

99 % of measurements should be less than the mean plus 2.33 standard deviations [for Excel users the function NORMINV(0.99,0,1) can be used for this calculation], so if a reading is more than 5.32 + 2.33 x 0.91 = 7.44 ppm, it could only have arisen by chance in the background population in 1 % of the cases. Of course life is not always so simple. For example, the original set of samples might be small and so the mean and standard deviation may not have been estimated very well. In order to make these adjustments special statistical tables are used to provide probabilities corrected for sample size, as will be discussed below. We would probably use what is called a t-test if there are not many samples, however, if there are a large number of samples in the original data this tends to a normal distribution. Note that the t-test still assumes that the errors are normally distributed in a background population, but if we measure only a few samples, this perfect distribution will become slightly distorted, and there will be significant chances that the mean and standard deviation of the samples differ from the true ones.

There are a number of such hypothesis tests (most based on errors being normally distributed), the chief ones being the F- and t-test. These tests are either one- or two-tailed. A two-tailed test might provide us information about whether two measurements differ significantly, for example is a measurement significantly different from a standard sample. In this case, we do not mind whether it is less or more. For example we might be looking at a food from a given supplier. From several years we know the concentrations expected of various constituents. If we are testing an unknown sample (an example may be customs that inspect imports with a given label of authenticity), we would be surprised if the concentrations differ outside given limits, but deviations both above and below the expected range are equally interesting. A one-tailed test just looks at deviations in a single direction. The example of a potentially doped athlete is a case in question: we may be only interested if it appears that the level of dope is above a given level, we are not going to prosecute the athlete if it is too low. One-tailed tests are quite common when looking at significance of errors, and in ANOVA analysis (Section 2.3).

The use of statistical tests in a variety of applications has developed over more than

100 years, and many scientists of yesteryear would be armed with statistical tables, that had been laboriously calculated by hand. The vast majority of texts still advocate the use of such tables, and it is certainly important to be able to appreciate how to use them, but like books of logarithm tables and slide rules it is likely that their days are numbered, although it may take several years before this happens. We have already introduced the normal distribution table (Section 3.4) and will present some further tables below. It is important to recognize that common spreadsheets and almost all good scientific software will generate this information easily and more flexibly because any probability level can be used (tables are often restricted to specific probabilities). Because of the widespread availability of Excel, we will also show how this information can be obtained using spreadsheet commands.


Consider trying to determine whether a river downstream from a factory is polluted. One way of doing this might be to measure the concentration of a heavy metal, perhaps 10 times over a working week. Let us say we choose Cd, and do this one week in January and find that the mean is 13.1 ppb. We then speak to the factory manager who says that he or she will attempt to reduce the emissions within a month. In February we return and we take eight measurements and find that their mean is now 10.4 ppb. Are we confident that this really indicates a reduction in Cd content?

There are several factors that influence the concentration estimates, such as the quality of our instrument and measurement process, and the sampling process (Section 3.2). A difference in means may not necessarily imply that we have detected a significant reduction in the amount of Cd. Can we be sure that the new mean really represents a lower concentration?

Each series of measurements has a set of errors associated with it, and the trick is to look also at the standard deviation of the measurements. If, for example, our measurements have a standard deviation of 0.1 ppb a reduction in the mean by 2.7 ppb is huge (27 standard deviations) and we can be really confident that the new value is lower. If, however, the standard deviation is 3 ppb, we may not have very much confidence.

To understand how we can determine the significance, return to the example of the normal distribution: we have already come across a simple example, of looking at the chances that a measurement can arise from population that forms a known normal distribution. We assume that the overall distribution is known very well, for example we may have recorded 100 measurements and are extremely confident about the mean, standard deviation and normality, but our new test sample consists only of a single measurement. Perhaps we performed 100 tests in January to be sure, and then come out to obtain a single sample in February just to check the situation. Then all we do is use the normal distribution. If the mean and standard deviation of the original readings were 13.1 and 1.4, February's single reading of 10.4 is (13.1 - 10.4)/1.3 = 2.077 standard deviations below the mean. Using the Excel function N0RMDIST(-2.077,0,1,FALSE) we find that only 1.9% of the measurements are expected to fall below 2.077 standard deviations below the mean (this can also be verified from normal distribution tables). In other words, we only would expect a reading of 10.4 or less to arise about once in 50 times in the original population sampled in January, assuming the null hypothesis that there is no significant difference. This is quite low and we can, therefore, on the available evidence be more than 98 % sure that the new measurement is significantly lower than the previous measurements. We are, in fact, approximately using what is often called a one-tailed t-test for significance, as described below.

In practice we usually do not have time to perform hundreds of measurements for our reference, and often usually take several measurements of the new test population. This means that we may not be completely confident of either of the means (in January or February), both of which carry a level of uncertainty. The difference in means depends on several factors:

• Whether there is actually a true underlying difference.

• How many samples were used to obtain statistics on both the original and the new data.

• What the standard deviations of the measurements are.

Then the means of the two datasets can be compared, and a significance attached to the difference. This is the principle of the t-test.

Let us assume that we obtain the following information.

January. We take 10 measurements with a mean of 13.1, and a standard deviation of 1.8.

February. We take 8 measurements with a mean of 10.4 and a standard deviation of 2.1.

How can we use a statistical test of significance? The following steps are performed:

First we determine what is called a 'pooled variance'. This is the average variance (the square of the standard deviation: see Section 3.3) from both groups and defined by:

This equation takes into account the fact that different numbers of samples may be used in each set of measurements, so it would be unfair to weight a standard deviation obtained using four measurements as equally useful as one arising from 40. In our case the 'pooled variance' is 3.75, the corresponding standard deviation being 1.94. Next we calculate the difference between the means (mi - m2) or 2.7. This should be calculated as a positive number, subtracting the largest mean from the smallest. The third step is to calculate the t -statistic which is a ratio between the two values above, adjusted according to sample size, calculated by:

A final step is used to convert this to a probability. If we are interested only in whether the new value is significantly lower than the previous one, we obtain this using a one-tailed t-test, because our question is not to answer whether the new value of Cd is different to the previous one, but whether it is significantly lower: we are interested only that the pollution has decreased, as this is the claim of the factory owner. We first have to determine the number of degrees of freedom for the overall distribution, which equals «1 + n2 - 2 or 16. Then, in Excel, simply calculate TDIST(2.939,16,1) to give an answer 0.0048. This implies that the chances that there really is no reduction in the Cd content is less than 0.5 % so we are sure that there has been an achievable difference.

What happens if we are not certain that there is a difference between the means? There may, of course, really be no difference, but it also may lie in our measurement technique (too much uncertainty) or the number of samples we have recorded (take more measurements).

The t-test can be used in many different ways. A common situation is where there is a well established standard (perhaps an international reference sample) and we want to see whether our sample or analytical method results in a significantly different value from the reference method, to see whether our technique really is measuring the same thing. In this case we use the two-tailed t-test. We would be concerned if our mean is either significantly higher or lower than the reference. If you use Excel, and the TDIST function, replace the '1' by a '2' in the last parameter.

Another common reason for using a t-test is to compare analytical techniques or technicians or even laboratories. Sometimes people have what is called 'bias', that is their technique consistently over- or underestimates a value. This might involve problems with poorly calibrated apparatus, for example all balances should be carefully checked regularly, or it may involve problems with baselines. In many laboratories if a new technique is to be adopted it is good practice to first check its results are compatible with historic data.

In many books, t-test statistics are often presented in tabular form, an example being Table 3.4 of the one-tailed t-distribution. This presents the t-statistic for several probability levels. If, for example, we calculate a t-statistic of 2.650 with 13 degrees of freedom we can be 99 % certain that the means differ (using a one-tailed test - the probabilities in Table 3.4 are halved for a two-tailed test). Notice that for an infinite number of degrees of freedom the t-statistic is the same as the normal distribution: see the normal distribution table (Table 3.2), for example a 95% probability is around 1.645 standard deviations from the mean, corresponding to the t-statistic at 5 % probability level in Table 3.4.

The t-test for significance is often very useful, but it is always important not to apply such tests blindly. Data do have to be reasonably homogeneous and fit an approximate normal distribution. Often this is so, but a major experimental problem, to be discussed in Section 3.12 involves outliers, or atypical points. These could arise from rogue measurements and can

Table 3.4 One-tailed t-test


Was this article helpful?

0 0

Post a comment