首页> 外国专利> Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

机译:进行预处理,以识别要提取到自然语言处理系统语料库中的文档中的废话

摘要

A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.
机译:在数据处理系统中提供了一种机制,用于识别被摄入到语料库中的文档中的废话段落。被配置为在数据处理系统中执行的自然语言处理管线接收要被摄取到语料库中的输入文档。自然语言处理管道将输入文档分为多个输入通道。自然语言处理管线的过滤器组件基于根据一组特征计数确定的度量的值来识别每个输入段落是否是废话。自然语言处理流水线基于输入通道是否被识别为废话通道来过滤多个输入通道中的每个输入通道,以形成过滤后的多个输入通道。自然语言处理管道将过滤后的多个输入通道添加到语料库中。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号