您现在的位置:首页>美国卫生研究院文献>Applied Psychological Measurement

期刊信息

  • 期刊名称:

    -

  • 刊频: Eight no. a year, 2008-
  • NLM标题:
  • iso缩写: -
  • ISSN: -

年度选择

更多>>

  • 排序:
  • 显示:
  • 每页:
全选(0
<2/11>
209条结果
  • 机译 比较偏好法则:配对比较设计中个人偏好模型与非个人判断模型的区别
    摘要:The pair comparison design for distinguishing between stimuli located on the same natural or hypothesized linear continuum is used both when the response is a personal preference and when it is an impersonal judgment. Appropriate models which complement the different responses have been proposed. However, the models most appropriate for impersonal judgments have also been described as modeling choice, which may imply personal preference. This leads to potential confusion in interpretation of scale estimates of the stimuli, in particular whether they reflect a substantive order on the variable or reflect a characteristic of the sample which is different from the substantive order on the variable. Using Thurstone’s concept of a discriminal response when a person engages with each stimulus, this article explains the overlapping and distinctive relationships between models for pair comparison designs when used for preference and judgment. In doing so, it exploits the properties of the relatively new hyperbolic cosine model which is not only appropriate for modeling personal preferences but has an explicit mathematical relationship with models for impersonal judgments. The hyperbolic cosine model is shown to be a special case of a more general form, referred to in parallel with Thurstone’s Law of Comparative Judgment, as a specific . Analyses of two real data sets illustrate the differences between the models most appropriate for personal preferences and impersonal judgments in a pair comparison design.
  • 机译 用于多重响应的通用展开式IRT模型款式
    摘要:It is commonly known that respondents exhibit different response styles whenresponding to Likert-type items. For example, some respondents tend to selectthe extreme categories (e.g., strongly disagree and strongly agree), whereassome tend to select the middle categories (e.g., disagree, neutral, and agree).Furthermore, some respondents tend to disagree with every item (e.g., stronglydisagree and disagree), whereas others tend to agree with every item (e.g.,agree and strongly agree). In such cases, fitting standard unfolding itemresponse theory (IRT) models that assume no response style will yield a poor fitand biased parameter estimates. Although there have been attempts to developdominance IRT models to accommodate the various response styles, such models areusually restricted to a specific response style and cannot be used for unfoldingdata. In this study, a general unfolding IRT model is proposed that can becombined with a softmax function to accommodate various response styles viascoring functions. The parameters of the new model can be estimated usingBayesian Markov chain Monte Carlo algorithms. An empirical data set is used fordemonstration purposes, followed by simulation studies to assess the parameterrecovery of the new model, as well as the consequences of ignoring the impact ofresponse styles on parameter estimators by fitting standard unfolding IRTmodels. The results suggest the new model to exhibit good parameter recovery andseriously biased estimates when the response styles are ignored.
  • 机译 态度的两极分化:展现矛盾的含意
    • 作者:Joshua A. McGrane
    • 刊名:Applied Psychological Measurement
    • 2019年第3期
    摘要:Recently, some attitude researchers have argued that the traditional bipolar model of attitudes should be replaced, claiming that a bivariate model is superior in several ways, foremost of which is its ability to account for ambivalent attitudes. This study argues that ambivalence is not at odds with bipolarity per se, but rather the conventional view of bipolarity, and that the psychometric evidence supporting a bivariate interpretation has been flawed. To demonstrate this, a scale developed out of the bivariate approach was examined using a unidimensional unfolding item response theory model: general hyperbolic cosine model for polytomous responses. The results were consistent with a bipolar interpretation, providing support for the argument that ambivalent evaluations are the correct middle-point of a bipolar evaluative dimension. Thus, it is argued that attitudinal ambivalence does not necessitate moving beyond bipolarity, but rather, moving beyond the conventional conceptualization and assessment of attitudes.
  • 机译 多维强制选择三元组的GGUM-RANK语句和人员参数估计
    摘要:Historically, multidimensional forced choice (MFC) measures have been criticized because conventional scoring methods can lead to ipsativity problems that render scores unsuitable for interindividual comparisons. However, with the recent advent of item response theory (IRT) scoring methods that yield normative information, MFC measures are surging in popularity and becoming important components in high-stake evaluation settings. This article aims to add to burgeoning methodological advances in MFC measurement by focusing on statement and person parameter recovery for the GGUM-RANK (generalized graded unfolding-RANK) IRT model. Markov chain Monte Carlo (MCMC) algorithm was developed for estimating GGUM-RANK statement and person parameters directly from MFC rank responses. In simulation studies, it was examined that how the psychometric properties of statements composing MFC items, test length, and sample size influenced statement and person parameter estimation; and it was explored for the benefits of measurement using MFC triplets relative to pairs. To demonstrate this methodology, an empirical validity study was then conducted using an MFC triplet personality measure. The results and implications of these studies for future research and practice are discussed.
  • 机译 多元概化理论在评估子评分质量中的应用
    摘要:Conventional methods for evaluating the utility of subscores rely on reliability and correlation coefficients. However, correlations can overlook a notable source of variability: variation in subtest means/difficulties. Brennan introduced a reliability index for score profiles based on multivariate generalizability theory, designated as G, which is sensitive to variation in subtest difficulty. However, there has been little, if any, research evaluating the properties of this index. A series of simulation experiments, as well as analyses of real data, were conducted to investigate G under various conditions of subtest reliability, subtest correlations, and variability in subtest means. Three pilot studies evaluated G in the context of a single group of examinees. Results of the pilots indicated that G indices were typically low; across the 108 experimental conditions, G ranged from .23 to .86, with an overall mean of 0.63. The findings were consistent with previous research, indicating that subscores often do not have interpretive value. Importantly, there were many conditions for which the correlation-based method known as proportion reduction in mean-square error (PRMSE; Haberman, 2006) indicated that subscores were worth reporting, but for which values of G fell into the .50s, .60s, and .70s. The main study investigated G within the context of score profiles for examinee subgroups. Again, not only G indices were generally low, but it was also found that G can be sensitive to subgroup differences when PRMSE is not. Analyses of real data and subsequent discussion address how G can supplement PRMSE for characterizing the quality of subscores.
  • 机译 使用赔率检测差异项功能
    摘要:Differential item functioning (DIF) makes test scores incomparable and substantially threatens test validity. Although conventional approaches, such as the logistic regression (LR) and the Mantel–Haenszel (MH) methods, have worked well, they are vulnerable to high percentages of DIF items in a test and missing data. This study developed a simple but effective method to detect DIF using the odds ratio (OR) of two groups’ responses to a studied item. The OR method uses all available information from examinees’ responses, and it can eliminate the potential influence of bias in the total scores. Through a series of simulation studies in which the DIF pattern, impact, sample size (equal/unequal), purification procedure (with/without), percentages of DIF items, and proportions of missing data were manipulated, the performance of the OR method was evaluated and compared with the LR and MH methods. The results showed that the OR method without a purification procedure outperformed the LR and MH methods in controlling false positive rates and yielding high true positive rates when tests had a high percentage of DIF items favoring the same group. In addition, only the OR method was feasible when tests adopted the item matrix sampling design. The effectiveness of the OR method with an empirical example was illustrated.
  • 机译 构建多阶段自适应测试的混合策略
    • 作者:Xinhui Xiong
    • 刊名:Applied Psychological Measurement
    • 2018年第8期
    摘要:How to effectively construct multistage adaptive test (MST) panels is a topic that has spurred recent advances. The most commonly used approaches for MST assembly use one of two strategies: bottom-up and top-down. The bottom-up approach splits the whole test into several modules, and each module is built first, then all modules are compiled to obtain the whole test, while the top-down approach follows the opposite direction. Both methods have their pros and cons, and sometimes neither is convenient for practitioners. This study provides an innovative hybrid strategy to build optimal MST panels efficiently most of the time. Empirical data and results by using this strategy will be provided.
  • 机译 评估高阶项目响应理论模型的项目级别拟合
    摘要:Testing item-level fit is important in scale development to guide item revision/deletion. Many item-level fit indices have been proposed in literature, yet none of them were directly applicable to an important family of models, namely, the higher order item response theory (HO-IRT) models. In this study, chi-square-based fit indices (i.e., Yen’s Q1, McKinley and Mill’s G2, Orlando and Thissen’s S-X2, and S-G2) were extended to HO-IRT models. Their performances are evaluated via simulation studies in terms of false positive rates and correct detection rates. The manipulated factors include test structure (i.e., test length and number of dimensions), sample size, level of correlations among dimensions, and the proportion of misfitting items. For misfitting items, the sources of misfit, including the misfitting item response functions, and misspecifying factor structures were also manipulated. The results from simulation studies demonstrate that the S-G2 is promising for higher order items.
  • 机译 Q矩阵中缺少响应的调查验证方式
    摘要:Missing data can be a serious issue for practitioners and researchers who are tasked with Q-matrix validation analysis in implementation of cognitive diagnostic models. The article investigates the impact of missing responses, and four common approaches (treat as incorrect, logistic regression, listwise deletion, and expectation–maximization [EM] imputation) for dealing with them, on the performance of two major Q-matrix validation methods (the EM-based δ-method and the nonparametric Q-matrix refinement method) across multiple factors. Results of the simulation study show that both validation methods perform better when missing responses are imputed using EM imputation or logistic regression instead of being treated as incorrect and using listwise deletion. The nonparametric Q-matrix validation method outperforms the EM-based δ-method in most conditions. Higher missing rates yield poorer performance of both methods. Number of attributes and items have an impact on performance of both methods as well. Results of a real data example are also discussed in the study.
  • 机译 多维计分项目的多维计算机自适应测试中的项目选择方法
    摘要:Multidimensional computerized adaptive testing (MCAT) has been developed over the past decades, and most of them can only deal with dichotomously scored items. However, polytomously scored items have been broadly used in a variety of tests for their advantages of providing more information and testing complicated abilities and skills. The purpose of this study is to discuss the item selection algorithms used in MCAT with polytomously scored items (PMCAT). Several promising item selection algorithms used in MCAT are extended to PMCAT, and two new item selection methods are proposed to improve the existing selection strategies. Two simulation studies are conducted to demonstrate the feasibility of the extended and proposed methods. The simulation results show that most of the extended item selection methods for PMCAT are feasible and the new proposed item selection methods perform well. Combined with the security of the pool, when two dimensions are considered (Study 1), the proposed modified continuous entropy method (MCEM) is the ideal of all in that it gains the lowest item exposure rate and has a relatively high accuracy. As for high dimensions (Study 2), results show that mutual information (MUI) and MCEM keep relatively high estimation accuracy, and the item exposure rates decrease as the correlation increases.
  • 机译 计算机自适应测试中项目暴露控制的连续a分层指数
    摘要:The method of a-stratification aims to reduce item overexposure in computerized adaptive testing, as items that are administered at very high rates may threaten the validity of test scores. In existing methods of a-stratification, the item bank is partitioned into a fixed number of nonoverlapping strata according to the items’a, or discrimination, parameters. This article introduces a continuous a-stratification index which incorporates exposure control into the item selection index itself and thus eliminates the need for fixed discrete strata. The new continuous a-stratification index is compared with existing stratification methods via simulation studies in terms of ability estimation bias, mean squared error, and control of item exposure rates.
  • 机译 在可变长度自适应测试中构造阴影测试
    • 作者:Qi DiaoHao Ren
    • 刊名:Applied Psychological Measurement
    • 2018年第7期
    摘要:Imposing content constraints is very important in most operational computerized adaptive testing (CAT) programs in educational measurement. Shadow test approach to CAT (Shadow CAT) offers an elegant solution to imposing statistical and nonstatistical constraints by projecting future consequences of item selection. The original form of Shadow CAT presumes fixed test lengths. The goal of the current study was to extend Shadow CAT to tests under variable-length termination conditions and evaluate its performance relative to other content balancing approaches. The study demonstrated the feasibility of constructing Shadow CAT with variable test lengths and in operational CAT programs. The results indicated the superiority of the approach compared with other content balancing methods.
  • 机译 估计项目得分可靠性的方法
    摘要:Reliability is usually estimated for a test score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the item’s contribution to the test score’s reliability, for identifying unreliable scores in aberrant item-score patterns in person-fit analysis, and for selecting the most reliable item from a test to use as a single-item measure. Four methods were discussed for estimating item-score reliability: the Molenaar–Sijtsma method (method MS), Guttman’s method λ6, the latent class reliability coefficient (method LCRC), and the correction for attenuation (method CA). A simulation study was used to compare the methods with respect to median bias, variability (interquartile range [IQR]), and percentage of outliers. The simulation study consisted of six conditions: standard, polytomous items, unequal α parameters, two-dimensional data, long test, and small sample size. Methods MS and CA were the most accurate. Method LCRC showed almost unbiased results, but large variability. Method λ6 consistently underestimated item-score reliabilty, but showed a smaller IQR than the other methods.
  • 机译 零膨胀Box-Cox正常单极项目反应模型,用于测量心理病理学构造
    摘要:This research introduces a latent class item response theory (IRT) approach for modeling item response data from zero-inflated, positively skewed, and arguably unipolar constructs of psychopathology. As motivating data, the authors use 4,925 responses to the Patient Health Questionnaire (PHQ-9), a nine Likert-type item depression screener that inquires about a variety of depressive symptoms. First, Lucke’s log-logistic unipolar item response model is extended to accommodate polytomous responses. Then, a nontrivial proportion of individuals who do not endorse any of the symptoms are accounted for by including a nonpathological class that represents those who may be absent on or at some floor level of the latent variable that is being measured by the PHQ-9. To enhance flexibility, a Box-Cox normal distribution is used to empirically determine a transformation parameter that can help characterize the degree of skewness in the latent variable density. A model comparison approach is used to test the necessity of the features of the proposed model. Results suggest that (a) the Box-Cox normal transformation provides empirical support for using a log-normal population density, and (b) model fit substantially improves when a nonpathological latent class is included. The parameter estimates from the latent class IRT model are used to interpret the psychometric properties of the PHQ-9, and a method of computing IRT scale scores that reflect unipolar constructs is described, focusing on how these scores may be used in clinical contexts.
  • 机译 书评:应用测试等同方法:使用R
    • 作者:Michela Battauz
    • 刊名:Applied Psychological Measurement
    • 2018年第7期
    摘要:
  • 机译 部分信用模型中的响应样式
    摘要:In the modeling of ordinal responses in psychological measurement and survey-based research, response styles that represent specific answering patterns of respondents are typically ignored. One consequence is that estimates of item parameters can be poor and considerably biased. The focus here is on the modeling of a tendency to extreme or middle categories. An extension of the partial credit model is proposed that explicitly accounts for this specific response style. In contrast to existing approaches, which are based on finite mixtures, explicit person-specific response style parameters are introduced. The resulting model can be estimated within the framework of generalized mixed linear models. It is shown that estimates can be seriously biased if the response style is ignored. In applications, it is demonstrated that a tendency to extreme or middle categories is not uncommon. A software tool is developed that makes the model easy to apply.
  • 机译 规模分离的可靠性:这是什么意思比较判断?
    摘要:Comparative judgment (CJ) is an alternative method for assessing competences based on Thurstone’s law of comparative judgment. Assessors are asked to compare pairs of students work (representations) and judge which one is better on a certain competence. These judgments are analyzed using the Bradly–Terry–Luce model resulting in logit estimates for the representations. In this context, the Scale Separation Reliability (SSR), coming from Rasch modeling, is typically used as reliability measure. But, to the knowledge of the authors, it has never been systematically investigated if the meaning of the SSR can be transferred from Rasch to CJ. As the meaning of the reliability is an important question for both assessment theory and practice, the current study looks into this. A meta-analysis is performed on 26 CJ assessments. For every assessment, split-halves are performed based on assessor. The rank orders of the whole assessment and the halves are correlated and compared with SSR values using Bland–Altman plots. The correlation between the halves of an assessment was compared with the SSR of the whole assessment showing that the SSR is a good measure for split-half reliability. Comparing the SSR of one of the halves with the correlation between the two respective halves showed that the SSR can alsobe interpreted as an interrater correlation. Regarding SSR as expressing acorrelation with the truth, the results are mixed.
  • 机译 基于EM的Q矩阵验证方法
    摘要:With the purpose to assist the subject matter experts in specifying their Q-matrices, the authors used expectation–maximization (EM)–based algorithm to investigate three alternative Q-matrix validation methods, namely, the maximum likelihood estimation (MLE), the marginal maximum likelihood estimation (MMLE), and the intersection and difference (ID) method. Their efficiency was compared, respectively, with that of the sequential EM-based δ method and its extension (ς2), the γ method, and the nonparametric method in terms of correct recovery rate, true negative rate, and true positive rate under the deterministic-inputs, noisy “and” gate (DINA) model and the reduced reparameterized unified model (rRUM). Simulation results showed that for the rRUM, the MLE performed better for low-quality tests, whereas the MMLE worked better for high-quality tests. For the DINA model, the ID method tended to produce better quality Q-matrix estimates than other methods for large sample sizes (i.e., 500 or 1,000). In addition, the Q-matrix was more precisely estimated under the discrete uniform distribution than under the multivariate normal threshold model for all the above methods. On average, the ς2 and ID method with higher true negative rates are better for correctingmisspecified Q-entries, whereas the MLE with higher true positive rates isbetter for retaining the correct Q-entries. Experiment results on real data setconfirmed the effectiveness of the MLE.
  • 机译 潜在类别分析的互信息可靠性
    摘要:Latent class models are powerful tools in psychological and educational measurement. These models classify individuals into subgroups based on a set of manifest variables, assisting decision making in a diagnostic system. In this article, based on information theory, the authors propose a mutual information reliability (MIR) coefficient that summaries the measurement quality of latent class models, where the latent variables being measured are categorical. The proposed coefficient is analogous to a version of reliability coefficient for item response theory models and meets the general concept of measurement reliability in the Standards for Educational and Psychological Testing. The proposed coefficient can also be viewed as an extension of the McFadden’s pseudo R-square coefficient, which evaluates the goodness-of-fit of logistic regression model, to latent class models. Thanks to several information-theoretic inequalities, the MIR coefficient is unitless, lies between 0 and 1, and receives good interpretation from a measurement point of view. The coefficient can be applied to both fixed and computerized adaptive testing designs. The performance of the MIR coefficient is demonstrated by simulated examples.
  • 机译 解决问题项目中重复事件的潜在类分析
    摘要:Computer-based assessment of complex problem-solving abilities is becoming more and more popular. In such an assessment, the entire problem-solving process of an examinee is recorded, providing detailed information about the individual, such as behavioral patterns, speed, and learning trajectory. The problem-solving processes are recorded in a computer log file which is a time-stamped documentation of events related to task completion. As opposed to cross-sectional response data from traditional tests, process data in log files are massive and irregularly structured, calling for effective exploratory data analysis methods. Motivated by a specific complex problem-solving item “Climate Control” in the 2012 Programme for International Student Assessment, the authors propose a latent class analysis approach to analyzing the events occurred in the problem-solving processes. The exploratory latent class analysis yields meaningful latent classes. Simulation studies are conducted to evaluate the proposed approach.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号