首页> 外文期刊>Scientific programming >Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources
【24h】

Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

机译:使用自然语言预处理架构(NLPA)为大数据文本源

获取原文
           

摘要

During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers.
机译:在过去几年中,大数据分析已成为利用多种(最初无价值)来源的流行手段,以找到关于真实域的相关知识。但是,大量大数据源提供了文本非结构化数据。适当的分析需要能够充分结合大数据和文本分析技术的工具。请记住这一点,我们将管道线框架(BDP4J(Java大数据流水线)组合使用了一组文本预处理技术,以创建NLPA(自然语言预处理架构),可扩展的开源插件实现预处理可以轻松组合以创建管道的步骤。另外,NLPA包括使用基于数据的基于令牌的代表或基于SYNSEN的数据集的基于数据集的基于数据集的基于数据集来结合使用的可能性,这些数据集将使用语义信息(即,使用本体)进一步处理。这项工作提出了一种案例研究,涵盖了原始异构大数据转换为不同的数据集表示(Synsets和令牌),并使用Weka应用程序编程接口(API)来启动两个公知的分类器的NLPA操作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号