Mon 28 Feb 2005
One statistical technique is the Cochran’s Q test, which is designed to assess differences across conditions in which there is a dichotomous outcome. For usability data, users might generate success data in multiple conditions or across multiple tasks. Cochran’s Q can test whether the responding across the conditions are significantly different from each other. It can be effective for data sets with a small number of users. In fact, for designs with dichotomous responses (0’s and 1’s), a sample size of 16 or greater allows the analyst to safely conduct traditional analysis of variance (ANOVA). Cochran’s Q is effective with smaller samples, especially when the probability of a response (success for example) is approximately .5.
Jusque là ça va.
Each user generates a dichotomous response, success or failure, for each task. It also should be noted that in Nielsen’s data, he had coded some tasks as partial successes, which have been recoded here as failures for purposes of exposition. This example speaks to the value of a formal statistical test in that the pattern of data may not be obvious from simple success rates, especially with only four users. The analyst may want to determine if the six tasks differ amongst themselves.
A formal statistical test can be conducted by positing the following null hypothesis:
π1=π2=π3=π4=π5=π6
where πk is the probability of success on the kth task. Note in this case, there are k=6 tasks.
In the Cochran’s Q test, the null hypothesis is that the probability of the target response is equal across all groups. For those unfamiliar with formal statistical tests, the question being asked is the following: “What is the probability I would have obtained my result assuming the null hypothesis is true?” If the obtained results are likely assuming the null hypothesis, then the analyst concludes there is no difference among the groups.
Je suis déjà un peu perdue, là…
Assuming S represents users completing a different tasks, each denoted as a level of the factor A, the Q statistic is defined as:
Q=SSA/MSA/S
where SSA (sums of squares for factor A) is computed by the mean at the jth level of A and subtracting the grand mean from it, and then squaring that quantity. Note that “j” is an index for the a levels of the factor A. For Nielsen’s data, note that a=6 because there are six tasks. This squared quantity is computed for each subject and summed across each subject. See Appendix A for the equations that go into the Q statistic. MSA/S (mean squared A within S) is simply the average variance of the scores within a subject across the levels of A, and then averaged across subjects. In other words, the variance of the a scores averaged across each of the n users. Recall a refers to the number of levels of the factor A. In addition, n refers to the number of users. Note that for Nielsen’s data, the number of users is n=4. Recall that factor A is the different tasks. The n users compose the S source of variability.
The Q statistic is distributed as a c2 statistic with degrees of freedom equal to the number of levels (e.g., tasks) in the experiment minus one (i.e., a-1). It should be noted that the spreadsheet contains the critical values of the c2 distribution for degrees of freedom ranging from 1 to 20 (most usability tests will not have more than 20 tasks). The critical values cut off the 95th and 90th percentile of the c2 distribution so that Type I error rate will be at the conventional .05 and also the more liberal .1. The more liberal criterion of .1 might be indicated for usability testing with few users.
Là c’est sûr, je vais arrêter la linguistique et faire un doctorat en pipeau!
Laisser un commentaire
Vous devez vous enregistrer pour laiser une commentaire.

























