If you have been following this guide from page one, you will know that the following output and interpretation relates to the Mann-Whitney U test results when your two distributions have a different shape, such that you are comparing mean ranks rather than medians. This is what happens when your data has violated Assumption

Nội dung chính Show

4 of the Mann-Whitney U test. The output is also based on the use of the Legacy Dialogs > 2 Independent Samples procedure in SPSS Statistics. If you have used the Nonparametric Tests > Independent Samples procedure in SPSS Statistics or you need to know how to interpret medians because your data has met Assumption

4 of the Mann-Whitney U test, we explain how to do this in our enhanced Mann-Whitney U test guide, which you can access by subscribing to Laerd Statistics.

In the SPSS Statistics output below, we show you how to report the Mann-Whitney U test using mean ranks. To do this, SPSS Statistics produces three tables of output:

Descriptives

The Descriptive Statistics table looks as follows:

Published with written permission from SPSS Statistics, IBM Corporation.

Although we have decided to show you how you can get SPSS Statistics to provide descriptive statistics for the Mann-Whitney U test, they are not actually very useful. The reason for this is twofold. Firstly, in order to compare the groups, we need the individual group values, not the amalgamated ones. This table does not provide us with this vital information, so we cannot compare any possible differences between the exercise and diet groups. Secondly, we chose the Mann-Whitney U test because one of the individual groups (exercise group) was not normally distributed. However, we have not tested to see if the amalgamation of the two groups results in the larger group being normally distributed. Therefore, we do not know whether to use the mean and standard deviation or the median and interquartile range (IQR). The IQR is the 25th to 75th percentile. This will act as a surrogate to the standard deviation we would otherwise report if the data were normally distributed. For these reasons, we recommend that you ignore this table.

Ranks Table

The Ranks table is the first table that provides information regarding the output of the actual Mann-Whitney U test. It shows mean rank and sum of ranks for the two groups tested (i.e., the exercise and diet groups):

Published with written permission from SPSS Statistics, IBM Corporation.

The table above is very useful because it indicates which group can be considered as having the higher cholesterol concentrations, overall; namely, the group with the highest mean rank. In this case, the diet group had the highest cholesterol concentrations.

Test Statistics Table

This table shows us the actual significance value of the test. Specifically, the Test Statistics table provides the test statistic, U statistic, as well as the asymptotic significance (2-tailed) p-value.

Published with written permission from SPSS Statistics, IBM Corporation.

From this data, it can be concluded that cholesterol concentration in the diet group was statistically significantly higher than the exercise group (U = 110, p = .014). Depending on the size of your groups, SPSS Statistics will produce both exact and asymptotic statistical significance levels. Understanding which one to use is explained in our enhanced guide.

In our enhanced Mann-Whitney U test guide, we show you: (a) how to use SPSS Statistics to determine whether your two distributions have the same shape or a different shape; (b) the two procedures – Nonparametric Tests > Independent Samples and Legacy Dialogs > 2 Independent Samples – that you can use to carry out a Mann-Whitney U test; (c) how to use SPSS Statistics to generate medians for the Mann-Whitney U test if your two distributions have the same shape; and (d) how to fully write up the results of the Mann-Whitney U test procedure whether you are comparing mean ranks or medians. We do this using the Harvard and APA styles. You can access our enhanced Mann-Whitney U test guide, as well as all of our SPSS Statistics content, by subscribing to Laerd Statistics, or learn more about our enhanced content in general on our Features: Overview page.

The Mann-Whitney (or Wilcoxon-Mann-Whitney) test is sometimes used for comparing the efficacy of two treatments in clinical trials. It is often presented as an alternative to a t test when the data are not normally distributed. Whereas a t test is a test of population means, the Mann-Whitney test is commonly regarded as a test of population medians. This is not strictly true, and treating it as such can lead to inadequate analysis of data.

Summary points

The Mann-Whitney test is used as an alternative to a t test when the data are not normally distributed
The test can detect differences in shape and spread as well as just differences in medians
Differences in population medians are often accompanied by equally important differences in shape
Researchers should describe the clinically important features of data and not just quote a P value

Use of Mann-Whitney test

The Mann-Whitney test is a test of both location and shape. Given two independent samples, it tests whether one variable tends to have values higher than the other. As Altman states, one form of the test statistic is an estimate of the probability that one variable is less than the other, although this statistic is not output by many statistical packages. In the case where the only distributional difference is a shift in location, this can indeed be described as a difference in medians. Hence, for example, the online help facility in Minitab 10.51 states that the Mann-Whitney test is “a two-sample rank test for the difference between two population medians . . . It assumes that the data are independent random samples from two populations that have the same shape.” Figure 1 shows two distributions for which this is the case. One distribution is shifted 0.75 units to the right: the medians differ by 0.75 units but the shapes are identical.

![An external file that holds a picture, illustration, etc. Object name is hara5169.f1.jpg](https://https://i0.wp.com/www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/bin/hara5169.f1.jpg)

Two distributions with a difference in median but no difference in shape and spread

Theoretically, in large samples the Mann-Whitney test can detect differences in spread even when the medians are very similar. However, an alternative form of the test is better than the standard Mann-Whitney test for this purpose. The alternative test, however, is not very efficient when population medians are unequal and is not widely available in statistical packages.

Differences in population medians are often accompanied by other differences in spread and shape. Moreover, the difference in medians may not be the most striking or the most clinically important difference. It is important to look at distributional differences and discuss them. Figure 2 shows an example in which the median values are 0.65 and 1.14 units. The distribution with the larger median also has larger spread. The spread is shown clearly in figure 3 , which shows box plots of samples of 25 drawn from these two distributions. (The P value from the Mann-Whitney test is 0.02.) If the difference is assumed to be merely a difference in medians other clinically important information could be ignored.

![An external file that holds a picture, illustration, etc. Object name is hara5169.f2.jpg](https://https://i0.wp.com/www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/bin/hara5169.f2.jpg)

Two distributions with different medians and different shapes. The distribution with the larger median also has a greater spread

![An external file that holds a picture, illustration, etc. Object name is hara5169.f3.jpg](https://https://i0.wp.com/www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/bin/hara5169.f3.jpg)

Box plots of samples of size 25 drawn from the distributions in figure 2. Vertical lines indicate the medians and boxes the interquartile range

Methods

I examined the use of the Mann-Whitney test in papers published in the BMJ between September 1999 and August 2000. I did an online search of the electronic text of the journal using the keywords Wilcoxon, Mann, and Whitney. I identified five papers that had used the Mann-Whitney test but where, in my judgment, the information given suggested that there might be important distributional differences other than a shift in location. These are described briefly below.

Examples

Grande et al studied the impact on place of death of a hospital at home service for palliative care. The authors noted a significant difference among patients randomised to hospital at home care: “Patients in the hospital at home group who were admitted to the service survived significantly longer after referral than hospital at home patients who were not admitted (16 v 8 days).” There were 112 patients admitted to the service (median survival 16 days, interquartile range 5-42.5) and 73 patients who were not admitted (8, 3-18 days). The striking feature about these three sets of summary statistics is that each in the former group is about twice that for the second group. This suggests that the difference between the two distributions might not be just a shift of 8 days: the difference might be multiplicative, not additive—that is, patients who were admitted might survive twice as long as those who were not admitted.

Williams et al did a cost effectiveness study of open access follow up for inflammatory bowel disease. One of the measures was the total cost of secondary care, and this was compared for two groups: open access and routine visit. The mean (SD) cost was £582 (£807.94) for the 77 patients in the open access group and £611 (£475.47) for the 78 patients in the routine visit group. Although the mean is higher in the second group, the standard deviation is much higher in the first. There must, therefore, have been some very large values in the first group. Without further information it is difficult to be sure, but there seem to be distributional differences between the two groups. The choice of a Mann-Whitney test for these economic data has been criticised elsewhere. If total expenditure is the aspect of prime interest then a t test would have been more appropriate. If the interest lay in the distributions, it is unlikely that the medians alone would adequately have described the differences.

Lux et al studied responses of local research ethics committees. A conclusion was that “The required number of complete copies of protocols and documents . . . was significantly lower for the local committees that used a fast track system.” The 44 committees in the fast track group required a median of three copies (95% percentiles 2 and 13) compared with 11 (1 and 15) copies for the 55 committees in the standard group. Not only are the medians different, the distributions must also be different. About half of the fast track committees asked for two or three copies, whereas about half of the other committees asked for 11-15 copies. These differences, which the authors did not comment on, relate to shape as well as location of the distributions.

Macleod et al studied women with breast cancer from affluent and deprived areas. One of their conclusions is “The time between the date of the referral letter and the first clinic was one day shorter in women from affluent areas.” The median (interquartile range) time was 6 (1-13) days in the affluent area and 7 (4-20) days in the deprived area. Although the medians differ by one day, the summary statistics suggest that the data for the deprived group are more right skewed, and differences between the two groups might be much more pronounced for the higher waiting times. It would have been helpful to discuss this in the paper.

A similar feature is even more evident in data from a study of pain in blood glucose testing. A visual analogue scale was used to record pain at the ear or thumb. The authors report “The median pain score was 2 mm in the ear group and 8.5 mm in the thumb group . . . the difference in median pain score is small.” Although this is true, the box plots in the paper show that the spread of scores in the thumb group is much greater than for the ear group. In particular, at least three out of 30 people in the thumb group report a score that is at least twice the highest value in the ear group. Overall, values seem much higher in the thumb group. This is important because patients are likely to be more concerned with the worst pain they might experience than the median value.

Recommendations

Researchers should take care to describe their data and to be clear about the features that are most clinically important. They should use the statistical test that is most relevant for their hypotheses, and describe the features of the data that are likely to have caused a hypothesis to be rejected. As is always the case, it is not sufficient merely to report a P value. In the case of the Mann-Whitney test, differences in spread may sometimes be as clinically important as differences in medians, and these need to be made clear to the reader.