首页> 外文会议>Grid Computing >Parallel and distributed approach for processing large-scale XML datasets
【24h】

Parallel and distributed approach for processing large-scale XML datasets

机译:处理大规模XML数据集的并行和分布式方法

获取原文

摘要

An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively affected. We present both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved. We have adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective. We also present an analysis of parallelism using our Piximal toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. Our parallelization approach for a multi-core node is to employ a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML data sets.
机译:一个新兴的趋势是使用XML作为许多分布式科学应用程序的数据格式,这些文档的大小从几十兆字节到几百兆字节不等。我们较早的基准测试结果表明,大多数广泛使用的XML处理工具包都无法很好地扩展大型XML数据。在设计用于科学应用程序的XML处理时,必须进行重大转换,以免对整个应用程序的周转时间造成负面影响。我们提供了一种并行和分布式方法来分析如何实现基于XML的大规模数据处理的可伸缩性和性能要求。为了使分布式解决方案有效,我们对Hadoop实施进行了调整,以确定每个节点所需的阈值数据大小和计算工作。我们还使用Piximal工具包对并行性进行了分析,以处理大型XML数据集,该工具集利用了新兴多核体系结构中可用的并行性功能。预计多核处理器将在研究集群和科学台式机中广泛使用,并且至关重要的是利用中间件中并行性的机会,而不是将任务交给应用程序程序员。我们针对多核节点的并行化方法是采用基于DFA的解析器,该解析器可识别XML规范的有用子集,并将DFA转换为可应用于输入的任意子集的NFA。推测性NFA被安排在节点中的可用核心上,以有效利用处理能力并获得整体性能提升。我们根据可以为代表性XML数据集实现的潜在加速来评估此方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号