首页> 外文学位 >Noun phrases in documents: Preprocessing, automatic extraction, and statistical analysis in different categories of text.
【24h】

Noun phrases in documents: Preprocessing, automatic extraction, and statistical analysis in different categories of text.

机译:文档中的名词短语:对不同类别的文本进行预处理,自动提取和统计分析。

获取原文
获取原文并翻译 | 示例

摘要

The primary objective of this study is to analyze noun phrase patterns in full-text documents. Knowledge of noun phrase patterns could facilitate the automatic indexing of electronic documents stored in an information retrieval system.; This dissertation examines several different questions concerning the identification of noun phrase patterns with the aid of natural language processing techniques. First, how does the preprocessing stage (preparing raw text for automated linguistic analysis) affect noun phrase extraction? Second, are some natural language processing techniques more effective than others in extracting noun phrases? And finally, do the properties of text such as subject domain (e.g., humanities, social sciences, engineering), genre (academic research articles vs. newspaper articles), research method (quantitative vs. qualitative), effects on noun phrase patterns?; To investigate these questions, a data set consisting of 1,099 full-text documents (450 academic research articles and 649 newspaper articles) was developed for this study. An examination of the raw data set revealed significant inconsistencies in document format and content in different subject domains. Besides presenting practical problems to overcome, these inconsistencies provide additional evidence for the existence of domain-specific textual characteristics and the need for domain-specific methods of term identification and extraction.; A comparative evaluation of three different automatic language analysis tools (a probabilistic parser, a rule-based tagger, and a statistical tagger) showed that all had comparable effectiveness rates (97–99%). The statistical tagger was chosen for noun phrase identification in the remainder of this study because of its combination of effectiveness and efficiency.; A statistical analysis of the document data set found that both subject domain and research method influence noun phrase patterns in academic documents. The statistical frequency of noun phrases and the proportion of different types of noun phrases differ from one domain to the next.; The findings confirmed the significance of domain-specific characteristics in text. They suggest that different document types, with different textual characteristics, may require different methods of content representation in information retrieval system design to optimize performance.
机译:这项研究的主要目的是分析全文文档中的名词短语模式。名词短语模式的知识可以促进信息检索系统中存储的电子文档的自动索引。本文利用自然语言处理技术研究了名词短语模式识别的几个不同问题。首先,预处理阶段(为自动语言分析准备原始文本)如何影响名词短语提取?其次,某些自然语言处理技术在提取名词短语方面是否比其他方法更有效?最后,诸如主题领域(例如,人文科学,社会科学,工程学),体裁(学术研究文章与报纸文章),研究方法(定量与定性)之类的文本属性是否会对名词短语模式产生影响?为了调查这些问题,为该研究开发了一个包含1,099篇全文文档(450篇学术研究文章和649篇报纸文章)的数据集。对原始数据集的检查发现,不同主题领域的文档格式和内容存在显着不一致。这些矛盾除了提出需要克服的实际问题外,还为存在特定领域的文本特征以及需要使用特定领域的术语识别和提取方法提供了额外的证据。对三种不同的自动语言分析工具(概率分析器,基于规则的标记器和统计标记器)进行的比较评估表明,所有工具均具有可比较的有效率(97-99%)。在本研究的其余部分中,由于统计标记器的有效性和效率相结合而被选择用于名词短语识别。对文档数据集的统计分析发现,学科领域和研究方法都会影响学术文档中的名词短语模式。名词短语的统计频率和不同类型名词短语的比例在一个域与另一个域之间是不同的。这些发现证实了文本中特定于域的特征的重要性。他们建议,具有不同文本特征的不同文档类型在信息检索系统设计中可能需要不同的内容表示方法来优化性能。

著录项

  • 作者

    Kim, Youngin.;

  • 作者单位

    University of California, Berkeley.;

  • 授予单位 University of California, Berkeley.;
  • 学科 Information Science.; Library Science.
  • 学位 Ph.D.
  • 年度 2002
  • 页码 222 p.
  • 总页数 222
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 信息与知识传播;图书馆学、图书馆事业;
  • 关键词

  • 入库时间 2022-08-17 11:46:10

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号