您现在的位置:首页>美国卫生研究院文献>Applied Psychological Measurement

期刊信息

  • 期刊名称:

    -

  • 刊频: Eight no. a year, 2008-
  • NLM标题:
  • iso缩写: -
  • ISSN: -

年度选择

更多>>

  • 排序:
  • 显示:
  • 每页:
全选(0
<1/11>
209条结果
  • 机译 跨组比较态度:基于IRT的项目拟合统计量度分析
    摘要:Questionnaires for the assessment of attitudes and other psychological traits are crucial in educational and psychological research, and item response theory (IRT) has become a viable tool for scaling such data. Many international large-scale assessments aim at comparing these constructs across countries, and the invariance of measures across countries is thus required. In its most recent cycle, the Programme for International Student Assessment (PISA 2015) implemented an innovative approach for testing the invariance of IRT-scaled constructs in the context questionnaires administered to students, parents, school principals, and teachers. On the basis of a concurrent calibration with equal item parameters across all groups (i.e., languages within countries), a group-specific item-fit statistic (root mean square deviance [RMSD]) was used as a measure for the invariance of item parameters for individual groups. The present simulation study examines the statistic’s distribution under different types and extents of (non)invariance in polytomous items. Responses to five 4-point Likert-type items were generated under the generalized partial credit model (GPCM) for 1,000 simulees in 50 groups each. For one of the five items, either location or discrimination parameters were drawn from a normal distribution. In addition to the type of noninvariance, the extent of noninvariance was varied by manipulating the variation of these distributions. The results indicate that the RMSD statistic is better at detecting noninvariance related to between-group differences in item location than in item discrimination. The study’s findings may be used as a starting point to sensitivity analysis aiming to define cutoff values for determining (non)invariance.
  • 机译 GGUM:适用于广义渐变展开模型的R包
    摘要:In this article, the newly created GGUM R package is presented. This package finally brings the generalized graded unfolding model (GGUM) to the front stage for practitioners and researchers. It expands the possibilities of fitting this type of item response theory (IRT) model to settings that, up to now, were not possible (thus, beyond the limitations imposed by the widespread GGUM2004 software). The outcome is therefore a unique software, not limited by the dimensions of the data matrix or the operating system used. It includes various routines that allow fitting the model, checking model fit, plotting the results, and also interacting with GGUM2004 for those interested. The software should be of interest to all those who are interested in IRT in general or to ideal point models in particular.
  • 机译 固定精度多维计算机化自适应测试的测量效率:使用示例库比较健康测量和教育测试
    摘要:It is currently not entirely clear to what degree the research on multidimensional computerized adaptive testing (CAT) conducted in the field of educational testing can be generalized to fields such as health assessment, where CAT design factors differ considerably from those typically used in educational testing. In this study, the impact of a number of important design factors on CAT performance is systematically evaluated, using realistic example item banks for two main scenarios: health assessment (polytomous items, small to medium item bank sizes, high discrimination parameters) and educational testing (dichotomous items, large item banks, small- to medium-sized discrimination parameters). Measurement efficiency is evaluated for both between-item multidimensional CATs and separate unidimensional CATs for each latent dimension. In this study, we focus on fixed-precision (variable-length) CATs because it is both feasible and desirable in health settings, but so far most research regarding CAT has focused on fixed-length testing. This study shows that the benefits associated with fixed-precision multidimensional CAT hold under a wide variety of circumstances.
  • 机译 多元概化理论在评估子评分质量中的应用
    摘要:Conventional methods for evaluating the utility of subscores rely on reliability and correlation coefficients. However, correlations can overlook a notable source of variability: variation in subtest means/difficulties. Brennan introduced a reliability index for score profiles based on multivariate generalizability theory, designated as G, which is sensitive to variation in subtest difficulty. However, there has been little, if any, research evaluating the properties of this index. A series of simulation experiments, as well as analyses of real data, were conducted to investigate G under various conditions of subtest reliability, subtest correlations, and variability in subtest means. Three pilot studies evaluated G in the context of a single group of examinees. Results of the pilots indicated that G indices were typically low; across the 108 experimental conditions, G ranged from .23 to .86, with an overall mean of 0.63. The findings were consistent with previous research, indicating that subscores often do not have interpretive value. Importantly, there were many conditions for which the correlation-based method known as proportion reduction in mean-square error (PRMSE; Haberman, 2006) indicated that subscores were worth reporting, but for which values of G fell into the .50s, .60s, and .70s. The main study investigated G within the context of score profiles for examinee subgroups. Again, not only G indices were generally low, but it was also found that G can be sensitive to subgroup differences when PRMSE is not. Analyses of real data and subsequent discussion address how G can supplement PRMSE for characterizing the quality of subscores.
  • 机译 使用赔率检测差异项功能
    摘要:Differential item functioning (DIF) makes test scores incomparable and substantially threatens test validity. Although conventional approaches, such as the logistic regression (LR) and the Mantel–Haenszel (MH) methods, have worked well, they are vulnerable to high percentages of DIF items in a test and missing data. This study developed a simple but effective method to detect DIF using the odds ratio (OR) of two groups’ responses to a studied item. The OR method uses all available information from examinees’ responses, and it can eliminate the potential influence of bias in the total scores. Through a series of simulation studies in which the DIF pattern, impact, sample size (equal/unequal), purification procedure (with/without), percentages of DIF items, and proportions of missing data were manipulated, the performance of the OR method was evaluated and compared with the LR and MH methods. The results showed that the OR method without a purification procedure outperformed the LR and MH methods in controlling false positive rates and yielding high true positive rates when tests had a high percentage of DIF items favoring the same group. In addition, only the OR method was feasible when tests adopted the item matrix sampling design. The effectiveness of the OR method with an empirical example was illustrated.
  • 机译 构建多阶段自适应测试的混合策略
    • 作者:Xinhui Xiong
    • 刊名:Applied Psychological Measurement
    • 2018年第8期
    摘要:How to effectively construct multistage adaptive test (MST) panels is a topic that has spurred recent advances. The most commonly used approaches for MST assembly use one of two strategies: bottom-up and top-down. The bottom-up approach splits the whole test into several modules, and each module is built first, then all modules are compiled to obtain the whole test, while the top-down approach follows the opposite direction. Both methods have their pros and cons, and sometimes neither is convenient for practitioners. This study provides an innovative hybrid strategy to build optimal MST panels efficiently most of the time. Empirical data and results by using this strategy will be provided.
  • 机译 评估高阶项目响应理论模型的项目级别拟合
    摘要:Testing item-level fit is important in scale development to guide item revision/deletion. Many item-level fit indices have been proposed in literature, yet none of them were directly applicable to an important family of models, namely, the higher order item response theory (HO-IRT) models. In this study, chi-square-based fit indices (i.e., Yen’s Q1, McKinley and Mill’s G2, Orlando and Thissen’s S-X2, and S-G2) were extended to HO-IRT models. Their performances are evaluated via simulation studies in terms of false positive rates and correct detection rates. The manipulated factors include test structure (i.e., test length and number of dimensions), sample size, level of correlations among dimensions, and the proportion of misfitting items. For misfitting items, the sources of misfit, including the misfitting item response functions, and misspecifying factor structures were also manipulated. The results from simulation studies demonstrate that the S-G2 is promising for higher order items.
  • 机译 Q矩阵中缺少响应的调查验证方式
    摘要:Missing data can be a serious issue for practitioners and researchers who are tasked with Q-matrix validation analysis in implementation of cognitive diagnostic models. The article investigates the impact of missing responses, and four common approaches (treat as incorrect, logistic regression, listwise deletion, and expectation–maximization [EM] imputation) for dealing with them, on the performance of two major Q-matrix validation methods (the EM-based δ-method and the nonparametric Q-matrix refinement method) across multiple factors. Results of the simulation study show that both validation methods perform better when missing responses are imputed using EM imputation or logistic regression instead of being treated as incorrect and using listwise deletion. The nonparametric Q-matrix validation method outperforms the EM-based δ-method in most conditions. Higher missing rates yield poorer performance of both methods. Number of attributes and items have an impact on performance of both methods as well. Results of a real data example are also discussed in the study.
  • 机译 多维计分项目的多维计算机自适应测试中的项目选择方法
    摘要:Multidimensional computerized adaptive testing (MCAT) has been developed over the past decades, and most of them can only deal with dichotomously scored items. However, polytomously scored items have been broadly used in a variety of tests for their advantages of providing more information and testing complicated abilities and skills. The purpose of this study is to discuss the item selection algorithms used in MCAT with polytomously scored items (PMCAT). Several promising item selection algorithms used in MCAT are extended to PMCAT, and two new item selection methods are proposed to improve the existing selection strategies. Two simulation studies are conducted to demonstrate the feasibility of the extended and proposed methods. The simulation results show that most of the extended item selection methods for PMCAT are feasible and the new proposed item selection methods perform well. Combined with the security of the pool, when two dimensions are considered (Study 1), the proposed modified continuous entropy method (MCEM) is the ideal of all in that it gains the lowest item exposure rate and has a relatively high accuracy. As for high dimensions (Study 2), results show that mutual information (MUI) and MCEM keep relatively high estimation accuracy, and the item exposure rates decrease as the correlation increases.
  • 机译 计算机自适应测试中项目暴露控制的连续a分层指数
    摘要:The method of a-stratification aims to reduce item overexposure in computerized adaptive testing, as items that are administered at very high rates may threaten the validity of test scores. In existing methods of a-stratification, the item bank is partitioned into a fixed number of nonoverlapping strata according to the items’a, or discrimination, parameters. This article introduces a continuous a-stratification index which incorporates exposure control into the item selection index itself and thus eliminates the need for fixed discrete strata. The new continuous a-stratification index is compared with existing stratification methods via simulation studies in terms of ability estimation bias, mean squared error, and control of item exposure rates.
  • 机译 在可变长度自适应测试中构造阴影测试
    • 作者:Qi DiaoHao Ren
    • 刊名:Applied Psychological Measurement
    • 2018年第7期
    摘要:Imposing content constraints is very important in most operational computerized adaptive testing (CAT) programs in educational measurement. Shadow test approach to CAT (Shadow CAT) offers an elegant solution to imposing statistical and nonstatistical constraints by projecting future consequences of item selection. The original form of Shadow CAT presumes fixed test lengths. The goal of the current study was to extend Shadow CAT to tests under variable-length termination conditions and evaluate its performance relative to other content balancing approaches. The study demonstrated the feasibility of constructing Shadow CAT with variable test lengths and in operational CAT programs. The results indicated the superiority of the approach compared with other content balancing methods.
  • 机译 估计项目得分可靠性的方法
    摘要:Reliability is usually estimated for a test score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the item’s contribution to the test score’s reliability, for identifying unreliable scores in aberrant item-score patterns in person-fit analysis, and for selecting the most reliable item from a test to use as a single-item measure. Four methods were discussed for estimating item-score reliability: the Molenaar–Sijtsma method (method MS), Guttman’s method λ6, the latent class reliability coefficient (method LCRC), and the correction for attenuation (method CA). A simulation study was used to compare the methods with respect to median bias, variability (interquartile range [IQR]), and percentage of outliers. The simulation study consisted of six conditions: standard, polytomous items, unequal α parameters, two-dimensional data, long test, and small sample size. Methods MS and CA were the most accurate. Method LCRC showed almost unbiased results, but large variability. Method λ6 consistently underestimated item-score reliabilty, but showed a smaller IQR than the other methods.
  • 机译 零膨胀Box-Cox正常单极项目反应模型,用于测量心理病理学构造
    摘要:This research introduces a latent class item response theory (IRT) approach for modeling item response data from zero-inflated, positively skewed, and arguably unipolar constructs of psychopathology. As motivating data, the authors use 4,925 responses to the Patient Health Questionnaire (PHQ-9), a nine Likert-type item depression screener that inquires about a variety of depressive symptoms. First, Lucke’s log-logistic unipolar item response model is extended to accommodate polytomous responses. Then, a nontrivial proportion of individuals who do not endorse any of the symptoms are accounted for by including a nonpathological class that represents those who may be absent on or at some floor level of the latent variable that is being measured by the PHQ-9. To enhance flexibility, a Box-Cox normal distribution is used to empirically determine a transformation parameter that can help characterize the degree of skewness in the latent variable density. A model comparison approach is used to test the necessity of the features of the proposed model. Results suggest that (a) the Box-Cox normal transformation provides empirical support for using a log-normal population density, and (b) model fit substantially improves when a nonpathological latent class is included. The parameter estimates from the latent class IRT model are used to interpret the psychometric properties of the PHQ-9, and a method of computing IRT scale scores that reflect unipolar constructs is described, focusing on how these scores may be used in clinical contexts.
  • 机译 书评:应用测试等同方法:使用R
    • 作者:Michela Battauz
    • 刊名:Applied Psychological Measurement
    • 2018年第7期
    摘要:
  • 机译 部分信用模型中的响应样式
    摘要:In the modeling of ordinal responses in psychological measurement and survey-based research, response styles that represent specific answering patterns of respondents are typically ignored. One consequence is that estimates of item parameters can be poor and considerably biased. The focus here is on the modeling of a tendency to extreme or middle categories. An extension of the partial credit model is proposed that explicitly accounts for this specific response style. In contrast to existing approaches, which are based on finite mixtures, explicit person-specific response style parameters are introduced. The resulting model can be estimated within the framework of generalized mixed linear models. It is shown that estimates can be seriously biased if the response style is ignored. In applications, it is demonstrated that a tendency to extreme or middle categories is not uncommon. A software tool is developed that makes the model easy to apply.
  • 机译 规模分离的可靠性:这是什么意思比较判断?
    摘要:Comparative judgment (CJ) is an alternative method for assessing competences based on Thurstone’s law of comparative judgment. Assessors are asked to compare pairs of students work (representations) and judge which one is better on a certain competence. These judgments are analyzed using the Bradly–Terry–Luce model resulting in logit estimates for the representations. In this context, the Scale Separation Reliability (SSR), coming from Rasch modeling, is typically used as reliability measure. But, to the knowledge of the authors, it has never been systematically investigated if the meaning of the SSR can be transferred from Rasch to CJ. As the meaning of the reliability is an important question for both assessment theory and practice, the current study looks into this. A meta-analysis is performed on 26 CJ assessments. For every assessment, split-halves are performed based on assessor. The rank orders of the whole assessment and the halves are correlated and compared with SSR values using Bland–Altman plots. The correlation between the halves of an assessment was compared with the SSR of the whole assessment showing that the SSR is a good measure for split-half reliability. Comparing the SSR of one of the halves with the correlation between the two respective halves showed that the SSR can alsobe interpreted as an interrater correlation. Regarding SSR as expressing acorrelation with the truth, the results are mixed.
  • 机译 基于EM的Q矩阵验证方法
    摘要:With the purpose to assist the subject matter experts in specifying their Q-matrices, the authors used expectation–maximization (EM)–based algorithm to investigate three alternative Q-matrix validation methods, namely, the maximum likelihood estimation (MLE), the marginal maximum likelihood estimation (MMLE), and the intersection and difference (ID) method. Their efficiency was compared, respectively, with that of the sequential EM-based δ method and its extension (ς2), the γ method, and the nonparametric method in terms of correct recovery rate, true negative rate, and true positive rate under the deterministic-inputs, noisy “and” gate (DINA) model and the reduced reparameterized unified model (rRUM). Simulation results showed that for the rRUM, the MLE performed better for low-quality tests, whereas the MMLE worked better for high-quality tests. For the DINA model, the ID method tended to produce better quality Q-matrix estimates than other methods for large sample sizes (i.e., 500 or 1,000). In addition, the Q-matrix was more precisely estimated under the discrete uniform distribution than under the multivariate normal threshold model for all the above methods. On average, the ς2 and ID method with higher true negative rates are better for correctingmisspecified Q-entries, whereas the MLE with higher true positive rates isbetter for retaining the correct Q-entries. Experiment results on real data setconfirmed the effectiveness of the MLE.
  • 机译 潜在类别分析的互信息可靠性
    摘要:Latent class models are powerful tools in psychological and educational measurement. These models classify individuals into subgroups based on a set of manifest variables, assisting decision making in a diagnostic system. In this article, based on information theory, the authors propose a mutual information reliability (MIR) coefficient that summaries the measurement quality of latent class models, where the latent variables being measured are categorical. The proposed coefficient is analogous to a version of reliability coefficient for item response theory models and meets the general concept of measurement reliability in the Standards for Educational and Psychological Testing. The proposed coefficient can also be viewed as an extension of the McFadden’s pseudo R-square coefficient, which evaluates the goodness-of-fit of logistic regression model, to latent class models. Thanks to several information-theoretic inequalities, the MIR coefficient is unitless, lies between 0 and 1, and receives good interpretation from a measurement point of view. The coefficient can be applied to both fixed and computerized adaptive testing designs. The performance of the MIR coefficient is demonstrated by simulated examples.
  • 机译 解决问题项目中重复事件的潜在类分析
    摘要:Computer-based assessment of complex problem-solving abilities is becoming more and more popular. In such an assessment, the entire problem-solving process of an examinee is recorded, providing detailed information about the individual, such as behavioral patterns, speed, and learning trajectory. The problem-solving processes are recorded in a computer log file which is a time-stamped documentation of events related to task completion. As opposed to cross-sectional response data from traditional tests, process data in log files are massive and irregularly structured, calling for effective exploratory data analysis methods. Motivated by a specific complex problem-solving item “Climate Control” in the 2012 Programme for International Student Assessment, the authors propose a latent class analysis approach to analyzing the events occurred in the problem-solving processes. The exploratory latent class analysis yields meaningful latent classes. Simulation studies are conducted to evaluate the proposed approach.
  • 机译 哪种信息最有效?:路由比较方法
    摘要:There are many item selection methods proposed for computerized adaptive testing (CAT) applications. However, not all of them have been used in computerized multistage testing (ca-MST). This study uses some item selection methods as a routing method in ca-MST framework. These are maximum Fisher information (MFI), maximum likelihood weighted information (MLWI), maximum posterior weighted information (MPWI), Kullback–Leibler (KL), and posterior Kullback–Leibler (KLP). The main purpose of this study is to examine the performance of these methods when they are used as a routing method in ca-MST applications. These five information methods under four ca-MST panel designs and two test lengths (30 items and 60 items) were tested using the parameters of a real item bank. Results were evaluated with overall findings (mean bias, root mean square error, correlation between true and estimated thetas, and module exposure rates) and conditional findings (conditional absolute bias, standard error of measurement, and root mean square error). It was found that test length affected the outcomes much more than other study conditions. Under 30-item conditions, 1-3 designs outperformed other panel designs. Under 60-item conditions, 1-3-3 designs were better than other panel designs. Each routing method performed well underparticular conditions; there was no clear best method in the studied conditions.The recommendations for routing methods in any particular condition wereprovided for researchers and practitioners as well as the limitations of theseresults.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号