首页> 外文学位 >Automatic acquisition of lexical semantic knowledge from large corpora: The identification of semantically related words, markedness, polarity, and antonymy.
【24h】

Automatic acquisition of lexical semantic knowledge from large corpora: The identification of semantically related words, markedness, polarity, and antonymy.

机译:从大型语料库自动获取词汇语义知识:识别与语义相关的单词,标记,极性和反义词。

获取原文
获取原文并翻译 | 示例

摘要

Lexical semantic knowledge is useful, even indispensable, for many natural language processing applications. Yet, traditional approaches for acquiring this knowledge manually are expensive and cannot easily handle the requisite domain dependence. In this dissertation, I address four closely related problems from lexical semantics, describing a fully automatic system that extracts information about semantic groups and scales from large free-text corpora. The system forms groups of semantically related terms such as {dollar}{lcub}{dollar}cold, warm, hot{dollar}{rcub}, {lcub}{dollar}final, preliminary{dollar}{rcub},{dollar} and {dollar}{lcub}{dollar}court, jury, law, regulation{dollar}{rcub}.{dollar} Using gradability indicators, it identifies those of the groups that are actually linguistic scales, i.e., contain terms that can be linearly ordered on the basis of semantic strength. Scalar groups are further partitioned into two subgroups according to evaluative orientation, distinguishing between positively loaded terms (e.g., beautiful, ingenious, unbiased) and their negative counterparts (e.g., ugly, plain, lazy). Finally, the semantic orientation of each subgroup is identified. Combining the above four stages results in an automatic method for the retrieval of possibly domain-dependent pairs of antonyms. All this information is actively learned from the corpus; the system does not access any type of stored information about words such as dictionaries, thesauri, or similar databases.; I have adopted a statistical approach that combines both supervised and unsupervised learning methods and is informed by linguistic models of the data and the tasks at hand. I rely on robust, non-parametric statistical methods; multiple knowledge sources justified by linguistic analyses; and shallow syntactic and morphological processing during information extraction. I describe and justify the linguistic sources, and present the results (sometimes quite unexpected) of experimental studies that are designed to validate related hypotheses made in the linguistics literature. I also present a novel evaluation method which simultaneously employs multiple reference models without inducing a single "best" model, and results produced for several collections of adjectives and nouns. Finally, I present evidence of strengths of the hybrid linguistic-statistical approach, and discuss applications of the system's output to language problems.
机译:词汇语义知识对于许多自然语言处理应用程序都是有用的,甚至是必不可少的。然而,用于手动获得该知识的传统方法是昂贵的,并且不能轻易地处理必要的域依赖性。在这篇论文中,我从词汇语义学上解决了四个紧密相关的问题,描述了一个全自动系统,该系统从大型自由文本语料库中提取有关语义组和尺度的信息。系统形成语义相关的术语组,例如{dollar} {lcub} {dollar}冷,暖,热{dollar} {rcub},{lcub} {dollar}最终,初步{dollar} {rcub},{dollar} {dollar} {lcub} {dollar}法院,陪审团,法律,法规{dollar} {rcub}。{dollar}使用可分级性指标,它可以识别出实际上是语言等级的人群,即包含可以根据语义强度线性排序。标量组根据评估方向进一步分为两个子组,以区分正负项(例如,漂亮,精巧,公正)和负向项(例如,丑陋,平淡,懒惰)。最后,确定每个子组的语义方向。结合以上四个阶段,可以得到一种自动方法,用于检索可能依赖于域的反义词对。所有这些信息都是从语料库中主动学习的;该系统不访问任何有关单词的存储信息,例如字典,叙词表或类似的数据库。我采用了一种统计方法,该方法结合了有监督的学习方法和无监督的学习方法,并从数据和手头任务的语言模型中获悉。我依靠健壮的非参数统计方法;通过语言分析证明多种知识来源;信息提取过程中的浅层句法和形态处理。我描述并证明了语言学的来源,并提出了旨在验证语言学文献中相关假设的实验研究结果(有时非常出乎意料)。我还提出了一种新颖的评估方法,该方法同时采用多个参考模型而不会产生单个“最佳”模型,并且为数个形容词和名词集合产生了结果。最后,我提供了混合语言统计方法的优势的证据,并讨论了系统输出在语言问题中的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号