针对当前抓取调度数据量巨大且计算复杂耗时长的问题,根据数据集的维度特征属性,通过凝聚层次聚类的方式对数据进行分层处理,并将其运用到小型Hadoop分布式系统中,通过服务器Master来对一般数据库MySQL数据库进行维护与待分层,并对其中的数据特征进行归列,再按照流程传递到不同的Slave服务器使得处理好工作得以进行.Map过程之前将凝聚层次聚类规则作为预处理操作,完成数据模板文件的编写.选取MVC模式应用到实验模拟测试中:小型Hadoop分布式系统Master节点和Slave节点的运行效率比单机爬虫的效率高了近65%.%Scheduling for the current fetch huge amount of data and computationally complex time-consuming issue, according to the dimension feature attribute data set, by the way cohesion hierarchical clustering data slicing, and apply it to small Hadoop distributed system, Master server maintenance to be stratified MySQL database cube dimensions characterized queue , send to a different server Slave scheduling process. The agglomeration process before Map hierarchical clustering rule as a pre-processing operation to complete the write data template files. MVC design pattern using experimental test: the efficiency of small-scale Hadoop Distributed System Master and Slave node node high of nearly 65%over single reptile efficiency.
展开▼