首页> 外文会议>IEEE/ACM International Conference on Software Engineering >Journal First Experiences and Challenges in Building a Data Intensive System for Data Migration
【24h】

Journal First Experiences and Challenges in Building a Data Intensive System for Data Migration

机译:期刊第一建立数据迁移数据密集型系统的经验和挑战

获取原文

摘要

Recent analyses[2, 4, 5] report that many sectors of our economy and society are more and more guided by data-driven decision processes (e.g., health care, public administrations, etc.). As such, Data Intensive (DI) applications are becoming more and more important and critical. They must be fault-tolerant, they should scale with the amount of data, and be able to elastically leverage additional resources as and when these last ones are provided [3]. Moreover, they should be able to avoid data drops introduced in case of sudden overloads and should offer some Quality of Service (QoS) guarantees. Ensuring all these properties is, per se, a challenge, but it becomes even more difficult for DI applications, given the large amount of data to be managed and the significant level of parallelism required for its components. Even if today some technological frameworks are available for the development of such applications (for instance, think of Spark, Storm, Flink), we still lack solid software engineering approaches to support their development and, in particular, to ensure that they offer the required properties in terms of availability, throughput, data loss, etc. In fact, at the time of writing, identifying the right solution can require several rounds of experiments and the adoption of many different technologies. This implies the need for highly skilled persons and the execution of experiments with large data sets and a large number of resources, and, consequently, a significant amount of time and budget. To experiment with currently available approaches, we performed an action research experiment focusing on developing- testing-reengineering a specific DI application, Hegira4Cloud, that migrates data between widely used NoSQL databases, including so-called Database as a Service (DaaS), as well as on-premise databases. This is a representative DI system because it has to handle large volumes of data with different structures and has to guarantee that some important characteristics, in terms of data types and transactional properties, are preserved. Also, it poses stringent requirements in terms of correctness, high performance, fault tolerance, and fast and effective recovery. In our action research, we discovered that the literature offered some high level design guidelines for DI applications, as well as some tools to support modelling and QoS analysis/simulation of complex architectures, however the available tools were not yet. suitable to support DI systems. Moreover, we realized that the available big data frameworks we could have used were not flexible enough to cope with all possible application-specific aspects of our system. Hence, to achieve the desired level of performance, fault tolerance and recovery, we had to adopt a time-consuming, experiment-based approach [1, 6], which, in our case, consisted of three iterations: (1) the design and implementation of a Mediation Data Model capable of managing data extracted from different databases, together with a first monholitical prototype of Hegira4Cloud; (2) the improvement of performance of our prototype when managing and transferring huge amounts of data; (3) the introduction of fault-tolerant data extraction and management mechanisms, which are independent from the targeted databases. Among the others, an important issue that has forced us to reiterate in the development of Hegira4Cloud concerned the DaaS we interfaced with. In particular these DaaS, which are well-known services with a large number of users: (1) were missing detailed information regarding the behaviour of their APIs; (2) did not offer a predictable service; (3) were suffering of random downtimes not correlated with the datasets we were experimenting with. In this journal first presentation, we describe our experience and the issues we encountered that led to some important decisions during the software design and engineering process. Also, we analyse the state of the art of software design and verifi
机译:最近的分析[2,4,5]报告称,我们的经济和社会的许多部门越来越多地通过数据驱动的决策过程(例如,医疗保健,公共主管部门等)。因此,数据密集型(DI)应用程序变得越来越重要和批判性。它们必须是容错的,它们应该以数据量扩展,并且能够弹性地利用额外的资源,何时提供了最后一个问题[3]。此外,他们应该能够避免在突然过载时引入的数据丢弃,并应提供一些服务质量(QoS)保证。确保所有这些属性是本身,挑战,但鉴于要管理的大量数据以及其组件所需的显着的并行性,它变得更加困难。即使今天某些技术框架可用于开发此类应用程序(例如,想想火花,风暴,传输),我们仍然缺乏稳固的软件工程方法,以支持他们的发展,特别是确保他们提供所需的在可用性,吞吐量,数据丢失等方面的性质实际上,在写作时,识别正确的解决方案可能需要几轮实验和采用许多不同的技术。这意味着需要高技能人员和大量数据集的实验以及大量资源的执行,因此,大量的时间和预算。为了实验目前可用的方法,我们执行了一个专注于开发 - 测试 - 重新创作的动作研究实验,Hegira4Cloud,它在广泛使用的NoSQL数据库之间迁移数据,包括所谓的数据库作为服务(DAAS)。作为内部部署数据库。这是代表性的DI系统,因为它必须处理具有不同结构的大量数据,并且必须保证在数据类型和事务性属性方面保证一些重要的特征。此外,它在正确性,高性能,容错和快速有效的恢复方面造成严格的要求。在我们的行动研究中,我们发现文献为DI应用提供了一些高级设计指南,以及一些支持复杂架构的建模和QoS分析/模拟的工具,但尚未。适合支持DI系统。此外,我们意识到我们可以使用的可用大数据框架不足以应对我们系统的所有可能的应用程序特定方面。因此,为了达到所需的性能水平,容错和恢复,我们必须采用耗时,基于实验的方法[1,6],在我们的情况下,其中包括三个迭代:(1)设计并实现能够管理从不同数据库中提取的数据的中介数据模型以及Hegira4Cloud的第一蒙拔原型; (2)管理和转移大量数据时,改善我们的原型的性能; (3)引入容错数据提取和管理机制,其独立于目标数据库。在其他人中,一个强迫我们在Hegira4Cloud的发展中重申的一个重要问题,涉及我们与之交接的DAA。特别是这些DAA,这是具有大量用户的知名服务:(1)缺少有关他们API行为的详细信息; (2)没有提供可预测的服务; (3)患有随机停机与我们正在尝试的数据集无关。在本期刊首次介绍中,我们描述了我们的经验和我们遇到的问题导致了软件设计和工程过程中的一些重要决策。此外,我们分析了软件设计和verifi的艺术状态

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号