首页>
外国专利>
Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system
Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system
展开▼
机译:进行预处理,以识别要提取到自然语言处理系统语料库中的文档中的废话
展开▼
页面导航
摘要
著录项
相似文献
摘要
A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.
展开▼