首页>
外国专利>
METHOD AND SYSTEM FOR PARALLELIZATION OF INGESTION OF LARGE DATA SETS
METHOD AND SYSTEM FOR PARALLELIZATION OF INGESTION OF LARGE DATA SETS
展开▼
机译:大数据集摄取的方法和系统
展开▼
页面导航
摘要
著录项
相似文献
摘要
The present invention relates, in an embodiment, to a method for ingestinginput datacontaining a plurality of records into a data lake. In an embodiment, themethod comprisessplitting the input data into a plurality of input splits consisting of abalanced number ofrecords; reading the records from the plurality of input splits in parallel,regardless of theformat and encoding of the input source; converting the input data within therecords into atleast one key/value pair; transforming the values input data into aserializable format; sortingthe key/value pairs of the transformed values such that the records are sortedin the same orderas they were read; writing the transformed values to an output file; andstoring the output fileto the data lake. The present invention also relates, in another embodiment,to a system foringesting input data containing a plurality of records into a data lake. In anembodiment, thesystem comprises one or more processors adapted to execute one or moremodules, themodules comprising: an input module for splitting the input data into inputsplits; a mappingmodule for transforming the input data in the input splits into a format forprocessing; apartition module for sorting the transformed data; and an output module forwriting the sortedtransformed data to an output file and determining a location on the data lakefor the outputfile; and a driver for communicating with the one or more modules of the oneor moreprocessors via a first communication layer, the driver configuring the one ormore modulesand calculating the input splits.
展开▼