Because there are various measures for comparing methods, researchers in human-computer interaction generally find it difficult to make solid conclusions about a particular method. We consider some candidate measures (e.g., thoroughness, validity, reliability) of effectiveness and then provide a summary of studies that have comapred usability evaluation methods (UEMs) using one or more of these measures. We find that studies do not always provide the appropriate descriptive statistics to make solid conclusions, especially in terms of validity. In addition, studies do not always compare UEMs to a standard yardstick such as end-user testing to establish an appropriate validity score. Finally, we provide some possible ways to address criterion deficiency and contamination; two important considerations for researchers attempting to optimize the balance between ultimate and actual criteria.
展开▼