首页> 外文学位 >Scaling the Technology Opportunity Analysis text data mining methodology: Data extraction, cleaning, online analytical processing analysis, and reporting of large multi-source datasets.
【24h】

Scaling the Technology Opportunity Analysis text data mining methodology: Data extraction, cleaning, online analytical processing analysis, and reporting of large multi-source datasets.

机译:扩展技术机会分析文本数据挖掘方法:数据提取,清理,在线分析处理分析以及大型多源数据集的报告。

获取原文
获取原文并翻译 | 示例

摘要

Because the existing applications of Technology Opportunity Analysis (TOA) text data mining framework developed by Alan Porter and other researchers used small datasets, previous research never pushed the limits of the methodology and failed to identify areas for future research associated with using larger datasets. This research developed extensions to the TOA framework to improve its performance and scalability and proved that the Technology Opportunity Analysis text data mining framework could be successfully scaled to analyze large datasets. The work included the development of a comprehensive set of new or significantly improved data extraction filters and data cleaning thesauruses, a data model and architecture based on relational database and online analytical processing technologies that provides an open platform provides easy, standards-compliant access to browsing, reporting, and data mining software that support either SQL or MDX queries, and a report distribution framework that does not require the end-users of the output of Technology Opportunity Analysis to use any specialized or prohibitively expensive client applications beyond the standard Microsoft Office applications and Adobe Acrobat Reader. In addition, it demonstrated that the time necessary to complete the data acquisition, cleaning, and transformation tasks can be reduced by at least 75% by creating libraries of import filters for commonly used data sources, eliminating unnecessary steps, using 64-bit native databases and extraction filters, improving the data model and architecture, and using significantly better data cleaning thesauruses. This work is significant because it enables a variety of research paths applying alternative statistical or data mining algorithms that previously would have been impossible to undertake. Thesauruses and fuzzy logic routines to clean and group the data are presented and their accuracy is tested on gene expression, energy storage, photovoltaics, smart materials, bioinformatics, quantum computing, wind turbine, nanotube, global warming, and data fusion data sets and benchmarked against existing thesauruses and fuzzy logic routines. A database on photovoltaic solar cell research that integrates data from 116,240 records from thirteen bibliographic, patent, and funding abstract databases was used to illustrate the concepts developed and tested in this dissertation.*; *This dissertation is a compound document (contains both a paper copy and a CD as part of the dissertation). The CD requires the following system requirements: Microsoft SQL Server 2005.
机译:由于艾伦·波特(Alan Porter)和其他研究人员开发的技术机会分析(TOA)文本数据挖掘框架的现有应用程序使用的是小型数据集,因此先前的研究从未突破方法的局限性,也未能确定与使用大型数据集相关的未来研究领域。这项研究开发了TOA框架的扩展,以提高其性能和可伸缩性,并证明了技术机会分析文本数据挖掘框架可以成功地扩展以分析大型数据集。这项工作包括开发一套全面的新的或显着改进的数据提取过滤器和数据清理叙词表,基于关系数据库和在线分析处理技术的数据模型和体系结构,该模型和结构提供了开放的平台,可以轻松,符合标准地进行浏览,报告和支持SQL或MDX查询的数据挖掘软件,以及不要求技术机会分析输出的最终用户使用标准Microsoft Office应用程序之外的任何专用或昂贵的客户端应用程序的报告分发框架和Adobe Acrobat Reader。此外,它还表明,使用64位本机数据库创建常用数据源的导入过滤器库,消除了不必要的步骤,可以将完成数据采集,清理和转换任务所需的时间至少减少75%。和提取过滤器,改善数据模型和体系结构,并使用更好的数据清理同义词库。这项工作意义重大,因为它可以应用以前无法进行的替代统计或数据挖掘算法,实现多种研究路径。介绍了清理和分组数据的同义词库和模糊逻辑例程,并在基因表达,能量存储,光伏,智能材料,生物信息学,量子计算,风力涡轮机,纳米管,全球变暖以及数据融合数据集和基准测试了其准确性反对现有的叙词表和模糊逻辑例程。光伏太阳能电池研究数据库集成了来自13个书目,专利和资助摘要数据库的116,240条记录的数据,用于说明本文开发和测试的概念。 *本论文是复合文件(作为论文的一部分,包含纸质副本和CD)。该CD需要满足以下系统要求:Microsoft SQL Server 2005。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号