首页> 外文会议>Grid Computing >Parallel and distributed approach for processing large-scale XML datasets

【24h】

Parallel and distributed approach for processing large-scale XML datasets

机译：处理大规模XML数据集的并行和分布式方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively affected. We present both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved. We have adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective. We also present an analysis of parallelism using our Piximal toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. Our parallelization approach for a multi-core node is to employ a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML data sets.

机译：一个新兴的趋势是使用XML作为许多分布式科学应用程序的数据格式，这些文档的大小从几十兆字节到几百兆字节不等。我们较早的基准测试结果表明，大多数广泛使用的XML处理工具包都无法很好地扩展大型XML数据。在设计用于科学应用程序的XML处理时，必须进行重大转换，以免对整个应用程序的周转时间造成负面影响。我们提供了一种并行和分布式方法来分析如何实现基于XML的大规模数据处理的可伸缩性和性能要求。为了使分布式解决方案有效，我们对Hadoop实施进行了调整，以确定每个节点所需的阈值数据大小和计算工作。我们还使用Piximal工具包对并行性进行了分析，以处理大型XML数据集，该工具集利用了新兴多核体系结构中可用的并行性功能。预计多核处理器将在研究集群和科学台式机中广泛使用，并且至关重要的是利用中间件中并行性的机会，而不是将任务交给应用程序程序员。我们针对多核节点的并行化方法是采用基于DFA的解析器，该解析器可识别XML规范的有用子集，并将DFA转换为可应用于输入的任意子集的NFA。推测性NFA被安排在节点中的可用核心上，以有效利用处理能力并获得整体性能提升。我们根据可以为代表性XML数据集实现的潜在加速来评估此方法的有效性。

著录项

来源
《Grid Computing》|2009年|105-112|共8页
会议地点 Banff(CA)
作者
Fadika Z.; Head M.R.; Govindaraju M.;
展开▼
作者单位

Comput. Sci. Dept., Binghamton Univ., Binghamton, NY, USA;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
XML; finite automata; grammars; parallel processing; DFA-based parser; Hadoop implementation; Piximal toolkit; XML data processing toolkit; XML specification; data format; deterministic finite automata; distributed scientific application; eXtensible Markup Language; large-scale XML dataset; middleware; multicore architectures; multicore node; multicore processors; parellel processing;

机译：XML;有限自动机;语法;并行处理;基于DFA的解析器; Hadoop实现; Piximal工具箱; XML数据处理工具箱; XML规范;数据格式;确定性有限自动机;分布式科学应用;可扩展标记语言;大规模XML数据集;中间件;多核体系结构;多核节点;多核处理器;并行处理;

相似文献

外文文献
中文文献
专利

1. PERFORMANCE ANALYSIS OF LARGE-SCALE PARALLEL-DISTRIBUTED PROCESSING WITH BACKUP TASKS FOR CLOUD COMPUTING [J] . Tsuguhito Hirai, Hiroyuki Masuyyama, Shoji Kasahara, Journal of industrial and management optimization . 2014,第1期

机译：带有云计算备份任务的大规模并行处理的性能分析
2. Distributed frameworks and parallel algorithms for processing large-scale geographic data [J] . Kenneth A. Hawick, P.D. Coddington, H.A. James Parallel Computing . 2003,第10期

机译：用于处理大规模地理数据的分布式框架和并行算法
3. Processing large-scale multi-dimensional data In parallel and distributed environments [J] . Michael Beynon, Chialin Chang, Umit Catalyurek Parallel Computing . 2002,第5期

机译：在并行和分布式环境中处理大规模多维数据
4. Parallel and Distributed Approach for Processing Large-Scale XML Datasets [C] . Zacharia Fadika, Michael R. Head, Madhusudhan Govindaraju IEEE/ACM International Conference on Grid Computing . 2009

机译：处理大型XML数据集的并行和分布式方法
5. Analysis and optimization for processing grid-scale XML datasets. [D] . Head, Michael Reuben. 2009

机译：分析和优化以处理网格规模的XML数据集。
6. Automatic analysis (aa): efficient neuroimaging workflows and parallel processing using Matlab and XML [O] . Rhodri Cusack, Alejandro Vicente-Grabovetsky, Daniel J. Mitchell, 2014

机译：自动分析（aa）：高效的神经影像工作流程和使用Matlab和XML的并行处理
7. Parallel and Distributed Approach for Processing Large-Scale XML Datasets [O] . Zacharia Fadika, Michael R. Head, Madhusudhan Govindaraju 2012

机译：处理大规模XML数据集的并行和分布式方法
8. A parallel data management system for large-scale NASA datasets [R] . Srivastava, Jaideep 1993

机译：用于大规模Nasa数据集的并行数据管理系统

Parallel and distributed approach for processing large-scale XML datasets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅