您现在的位置:首页>美国卫生研究院文献>Applied Psychological Measurement

期刊信息

  • 期刊名称:

    -

  • 刊频: Eight no. a year, 2008-
  • NLM标题:
  • iso缩写: -
  • ISSN: -

年度选择

更多>>

  • 排序:
  • 显示:
  • 每页:
全选(0
<10/11>
209条结果
  • 机译 Mantel–Haenszel差异项目检验的功率公式运作中
    • 作者:Zhushan Li
    • 刊名:Applied Psychological Measurement
    • 2015年第5期
    摘要:The asymptotic power of the Mantel–Haenszel (MH) test for the differential item function (DIF) is derived. The formula describes the behavior of the power when the number of items is large, so that the measured latent trait can be considered as the matching variable in the MH test. As shown in the derived formula, the power is related to the sample size, effect size of DIF, the item response function (IRF), and the distribution of the latent trait in the reference and the focal groups. The formula provides an approximation of the power of the MH test in practice and thus provides a guideline for DIF detection in practice. It also suggests analytical explanations of the behavior of the MH test as observed in many previous simulation studies. Based on the formula, this study shows how to conduct the sample size calculation. The power of MH test under some practical models such as the two-parameter logistic (2PL) and three-parameter logistic (3PL) item response theory (IRT) models is discussed.
  • 机译 在计算机分类测试中利用响应时间
    摘要:A well-known approach in computerized mastery testing is to combine the Sequential Probability Ratio Test (SPRT) stopping rule with item selection to maximize Fisher information at the mastery threshold. This article proposes a new approach in which a time limit is defined for the test and examinees’ response times are considered in both item selection and test termination. Item selection is performed by maximizing Fisher information per time unit, rather than Fisher information itself. The test is terminated once the SPRT makes a classification decision, the time limit is exceeded, or there is no remaining item that has a high enough probability of being answered before the time limit. In a simulation study, the new procedure showed a substantial reduction in average testing time while slightly improving classification accuracy compared with the original method. In addition, the new procedure reduced the percentage of examinees who exceeded the time limit.
  • 机译 跨尺度效应结转的项目响应理论模型
    摘要:It is common in educational and psychological tests or social surveys that the same statement is judged on multiple scales. These multiple responses are linked by the same statement, which may cause local dependence. Considering the way a statement is judged on multiple scales, a new class of item response theory (IRT) models is developed to account for the nonrecursive carry-over effect, in which a response can be affected only by its preceding response rather than by a subsequent response. The parameters of the models can be estimated with the freeware WinBUGS. Two simulation studies were conducted to evaluate the parameter recovery of the new models and the consequences of model misspecification. Results showed that the parameters of the new models were recovered fairly well; fitting unnecessarily complicated models to data that did not have the carry-over effect did little harm to parameter estimation; and ignoring the carry-over effect by fitting standard IRT models yielded biased estimates for the item parameters, the correlation between latent traits, and the test reliability. Two empirical examples with parallel design and sequential design are provided to demonstrate the implications and applications of the new models.
  • 机译 项目剩余异质性如何影响差异项目的测试运作中
    摘要:Differential item functioning (DIF) occurs when people with the same proficiency have different probabilities of giving a certain response to an item. The present study focused on an assumption implicit in popular methods for DIF testing that has received little attention in published literature (item residual homogeneity). The assumption is explained, a strategy for detecting violations of it (i.e., item residual heterogeneity) is illustrated with empirical data, and simulations are carried out to evaluate the performance of binary logistic regression, two-group item response theory (IRT), and the Mantel–Haenszel (MH) test in the presence of item residual heterogeneity. Results indicated that heterogeneity inflated Type I error and attenuated power for logistic regression, and attenuated power and produced biased estimates of the latent focal group mean and standard deviation for two-group IRT. The MH test was robust to item residual heterogeneity, probably because it does not use the logistic function.
  • 机译 猜测对系数α和可靠性的影响的调查
    • 作者:Insu Paek
    • 刊名:Applied Psychological Measurement
    • 2015年第4期
    摘要:Guessing is known to influence the test reliability of multiple-choice tests. Although there are many studies that have examined the impact of guessing, they used rather restrictive assumptions (e.g., parallel test assumptions, homogeneous inter-item correlations, homogeneous item difficulty, and homogeneous guessing levels across items) to evaluate the relation between guessing and test reliability. Based on the item response theory (IRT) framework, this study investigated the extent of the impact of guessing on reliability under more realistic conditions where item difficulty, item discrimination, and guessing levels actually vary across items with three different test lengths (TL). By accommodating multiple item characteristics simultaneously, this study also focused on examining interaction effects between guessing and other variables entered in the simulation to be more realistic. The simulation of the more realistic conditions and calculations of reliability and classical test theory (CTT) item statistics were facilitated by expressing CTT item statistics, coefficient α, and reliability in terms of IRT model parameters. In addition to the general negative impact of guessing on reliability, results showed interaction effects between TL and guessing and between guessing and test difficulty.
  • 机译 自适应掌握测试中的随机限制
    摘要:A well-known stopping rule in adaptive mastery testing is to terminate the assessment once the examinee’s ability confidence interval lies entirely above or below the cut-off score. This article proposes new procedures that seek to improve such a variable-length stopping rule by coupling it with curtailment and stochastic curtailment. Under the new procedures, test termination can occur earlier if the probability is high enough that the current classification decision remains the same should the test continue. Computation of this probability utilizes normality of an asymptotically equivalent version of the maximum likelihood ability estimate. In two simulation sets, the new procedures showed a substantial reduction in average test length while maintaining similar classification accuracy to the original method.
  • 机译 通过多变量相关性比较表面和底层本地依赖性级别
    摘要:Item response theory (IRT) is a set of psychometric models used in the social and behavioral sciences. As part of applying these models in practice, a number of assumptions are made. A large literature exists assessing the extent to which these assumptions are satisfied in a given data set. One of these assumptions, local independence, is the focus of the research described here. When the local independence assumption is violated, there is said to be local dependence (LD). Several different models of LD have been proposed, and a number of studies have been conducted examining the performance of different methods at detecting LD. Underlying LD (ULD) and surface LD (SLD) were proposed as two possible mechanisms underlying observed LD in an early exploration of detection procedures. In a number of previous studies, it appears as though ULD is more difficult to detect than SLD. In this article, the authors demonstrate a procedure to examine comparability of induced LD and present results, which suggest a re-interpretation of existing studies on LD detection.
  • 机译 基于可靠性的特征权重,用于自动论文评分
    • 作者:Yigal Attali
    • 刊名:Applied Psychological Measurement
    • 2015年第4期
    摘要:From their earliest origins, automated essay scoring systems strived to emulate human essay scores and viewed them as their ultimate validity criterion. Consequently, the importance (or weight) and even identity of computed essay features in the composite machine score were determined by statistical techniques that sought to optimally predict human scores from essay features. However, it is evident that machine evaluation of essays is fundamentally different from human evaluation and therefore is not likely to measure the same set of writing skills. As a consequence, feature weights of human-prediction machine scores (reflecting their importance in the composite scores) are bound to reflect statistical artifacts. This article suggests alternative feature weighting schemes based on the premise of maximizing reliability and internal consistency of the composite score. The article shows, in the context of a large-scale writing assessment, that these alternative weighting schemes are significantly different from human-prediction weights and give rise to comparable or even superior reliability and validity coefficients.
  • 机译 混合格式许可测试的组合可靠性,分类一致性和分类准确性的评估方法
    摘要:The purpose of this study was to propose extensions of reliability estimation methods that could be used to determine the conditions under which single scoring for constructed-response (CR) items is as effective as double scoring in mixed-format licensure tests. Multivariate generalizability theory methods traditionally used to estimate overall composite score reliability were extended with simulations so that classification consistency and classification accuracy estimates could also be obtained. Composite score reliabilities, classification consistencies, and accuracies were estimated based on the double and single scoring of the CR items of three licensure tests. Composite score reliabilities, classification consistencies, and accuracies were also estimated in decision studies considering varied testing situations such as different numbers of CR items and different CR section weights.
  • 机译 锚定方法的框架和迭代递推方法DIF检测
    摘要:In differential item functioning (DIF) analysis, a common metric is necessary to compare item parameters between groups of test-takers. In the Rasch model, the same restriction is placed on the item parameters in each group to define a common metric. However, the question how the items in the restriction—termed anchor items—are selected appropriately is still a major challenge. This article proposes a conceptual framework for categorizing anchor methods: The anchor class to describe characteristics of the anchor methods and the anchor selection strategy to guide how the anchor items are determined. Furthermore, the new iterative forward anchor class is proposed. Several anchor classes are implemented with different anchor selection strategies and are compared in an extensive simulation study. The results show that the new anchor class combined with the single-anchor selection strategy is superior in situations where no prior knowledge about the direction of DIF is available.
  • 机译 实时组装多级自适应测试
    摘要:Recently, multistage testing (MST) has been adopted by several important large-scale testing programs and become popular among practitioners and researchers. Stemming from the decades of history of computerized adaptive testing (CAT), the rapidly growing MST alleviates several major problems of earlier CAT applications. Nevertheless, MST is only one among all possible solutions to these problems. This article presents a new adaptive testing design, “on-the-fly assembled multistage adaptive testing” (OMST), which combines the benefits of CAT and MST and offsets their limitations. Moreover, OMST also provides some unique advantages over both CAT and MST. A simulation study was conducted to compare OMST with MST and CAT, and the results demonstrated the promising features of OMST. Finally, the “Discussion” section provides suggestions on possible future adaptive testing designs based on the OMST framework, which could provide great flexibility for adaptive tests in the digital future and open an avenue for all types of hybrid designs based on the different needs of specific tests.
  • 机译 比较两种校准受限算法非补偿多维IRT模型
    摘要:The non-compensatory class of multidimensional item response theory (MIRT) models frequently represents the cognitive processes underlying a series of test items better than the compensatory class of MIRT models. Nevertheless, few researchers have used non-compensatory MIRT in modeling psychological data. One reason for this lack of use is because non-compensatory MIRT item parameters are notoriously difficult to accurately estimate. In this article, we propose methods to improve the estimability of a specific non-compensatory model. To initiate the discussion, we address the non-identifiability of the explored non-compensatory MIRT model by suggesting that practitioners use an item-dimension constraint matrix (namely, a Q-matrix) that results in model identifiability. We then compare two promising algorithms for high-dimensional model calibration, Markov chain Monte Carlo (MCMC) and Metropolis–Hastings Robbins–Monro (MH-RM), and discuss, via analytical demonstrations, the challenges in estimating model parameters. Based on simulation studies, we show that when the dimensions are not highly correlated, and when the Q-matrix displays appropriate structure, the non-compensatory MIRT model can be accurately calibrated (using the aforementioned methods) with as few as 1,000people. Based on the simulations, we conclude that the MCMC algorithm is betterable to estimate model parameters across a variety of conditions, whereas theMH-RM algorithm should be used with caution when a test displays complexstructure and when the latent dimensions are highly correlated.
  • 机译 关于参数估计可比性的注意事项
    摘要:The use of mixture item response theory modeling is exemplified typically by comparing item profiles across different latent groups. The comparisons of item profiles presuppose that all model parameter estimates across latent classes are on a common scale. This note discusses the conditions and the model constraint issues to establish a common scale across latent classes.
  • 机译 比较简单计分与IRT人格计分措施
    摘要:This article analyzes data from U.S. Navy sailors (N = 8,956), with the central measure being the Navy Computer Adaptive Personality Scales (NCAPS). Analyses and results from this article extend and qualify those from previous research efforts by examining the properties of the NCAPS and its adaptive structure in more detail. Specifically, this article examines item exposure rates, the efficiency of item use based on item response theory (IRT)–based Expected A Posteriori (EAP) scoring, and a comparison of IRT-EAP scoring with much more parsimonious scoring methods that appear to work just as well (stem-level scoring and dichotomous scoring). The cutting-edge nature of adaptive personality testing will necessitate a series of future efforts like this: to examine the benefits of adaptive scoring schemes and novel measurement methods continually, while pushing testing technology further ahead.
  • 机译 BMIRT TOOLKIT的评论
    摘要:A software review was conducted for BMIRT (Bayesian Multivariate IRT) that implements a variety of multidimensional models using Markov chain Monte Carlo (MCMC) methods. The review describes its basic functionality, implementation of the program, and the operating environment.
  • 机译 MCMC GGUM
    摘要:
  • 机译 认知诊断计算机自适应测试的新项目选择方法
    摘要:This article introduces two new item selection methods, the modified posterior-weighted Kullback–Leibler index (MPWKL) and the generalized deterministic inputs, noisy “and” gate (G-DINA) model discrimination index (GDI), that can be used in cognitive diagnosis computerized adaptive testing. The efficiency of the new methods is compared with the posterior-weighted Kullback–Leibler (PWKL) item selection index using a simulation study in the context of the G-DINA model. The impact of item quality, generating models, and test termination rules on attribute classification accuracy or test length is also investigated. The results of the study show that the MPWKL and GDI perform very similarly, and have higher correct attribute classification rates or shorter mean test lengths compared with the PWKL. In addition, the GDI has the shortest implementation time among the three indices. The proportion of item usage with respect to the required attributes across the different conditions is also tracked and discussed.
  • 机译 基于蒙特卡洛方法的贝叶斯方法来测量定性量表
    摘要:Agreement analysis has been an active research area whose techniques have been widely applied in psychology and other fields. However, statistical agreement among raters has been mainly considered from a classical statistics point of view. Bayesian methodology is a viable alternative that allows the inclusion of subjective initial information coming from expert opinions, personal judgments, or historical data. A Bayesian approach is proposed by providing a unified Monte Carlo–based framework to estimate all types of measures of agreement in a qualitative scale of response. The approach is conceptually simple and it has a low computational cost. Both informative and non-informative scenarios are considered. In case no initial information is available, the results are in line with the classical methodology, but providing more information on the measures of agreement. For the informative case, some guidelines are presented to elicitate the prior distribution. The approach has been applied to two applications related to schizophrenia diagnosis and sensory analysis.
  • 机译 检查等式的内核平滑中潜在的边界偏差效应
    摘要:Test equating is a method of making the test scores from different test forms of the same assessment comparable. In the equating process, an important step involves continuizing the discrete score distributions. In traditional observed-score equating, this step is achieved using linear interpolation (or an unscaled uniform kernel). In the kernel equating (KE) process, this continuization process involves Gaussian kernel smoothing. It has been suggested that the choice of bandwidth in kernel smoothing controls the trade-off between variance and bias. In the literature on estimating density functions using kernels, it has also been suggested that the weight of the kernel depends on the sample size, and therefore, the resulting continuous distribution exhibits bias at the endpoints, where the samples are usually smaller. The purpose of this article is (a) to explore the potential effects of atypical scores (spikes) at the extreme ends (high and low) on the KE method in distributions with different degrees of asymmetry using the randomly equivalent groups equating design (Study I), and (b) to introduce the Epanechnikov and adaptive kernels as potential alternative approaches to reducing boundary bias in smoothing (Study II). The beta-binomial model is used to simulate observed scores reflecting a range of different skewed shapes.
  • 机译 评估适合认知诊断评估的人
    摘要:Methods evaluating person fit for cognitive diagnostic assessment are an important area of research because failing to detect misfitting responses can lead to the misinterpretation of students’ attribute profiles, which may result in faulty remediation decisions. This article aims to examine ways of detecting person misfit for cognitive diagnostic assessments. The authors first investigated whether the well-known lz statistic, developed under the framework of item response theory, can be extended for use in the context of cognitive diagnostic models. The authors also introduce a new person fit statistic, response conformity index (RCI), developed for detecting misfitting response patterns for cognitive diagnostic assessments. The authors conduct both simulation and real data studies to compare the detection rates of lz and our new statistic.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号