...
首页> 外文期刊>Journal of chemical information and modeling >Harvesting Chemical Information from the Internet Using a Distributed Approach:ChemXtreme
【24h】

Harvesting Chemical Information from the Internet Using a Distributed Approach:ChemXtreme

机译:使用分布式方法从Internet收集化学信息:ChemXtreme

获取原文
获取原文并翻译 | 示例

摘要

The Internet is a comprehensive resource of chemical information which is at the same time largely unstructured.It provides a wealth of scientific information such as experimental data and requires a suitable automated data mining and analysis tool for its meaningful exploration.The Java based software presented here,ChemXtreme,is developed for harvesting chemical information from the Internet employing the Google API in combination with a distributed client/server text analysis architecture based on JavaRMI.It represents the first and until now the only toolkit for automated structured data retrieval from the Internet which is itself open source.ChemXtreme employs the "search the search engine" strategy,where the URLs returned from the search engine are analyzed further via textual pattern analysis.This process resembles the manual analysis of the hit list,where relevant data are captured and,by means of human intervention,are mined into a format suitable for further analysis.ChemXtreme on the other hand transforms chemical information automatically into a structured format suitable for storage in databases and further analysis and also provides links to the original information source.The query data retrieved from the search engine by the server is encoded,encrypted,and compressed and then sent to all the participating active clients in the network for parsing.Relevant information identified by the clients on the retrieved Web sites is sent back to the server,verified,and added to the database for data mining and further analysis.The distributed further analysis of URLs in a client/server architecture scales very favorably,thus producing only minimal overhead.
机译:互联网是一种综合的化学信息资源,同时又很大程度上是非结构化的,它提供了大量的科学信息,例如实验数据,并且需要合适的自动化数据挖掘和分析工具来进行有意义的探索。此处介绍的基于Java的软件ChemXtreme是使用Google API与基于JavaRMI的分布式客户端/服务器文本分析体系结构从Internet收集化学信息而开发的,它代表了第一个也是迄今为止唯一一个用于从Internet自动检索结构化数据的工具包,本身是开源的。ChemXtreme采用“搜索搜索引擎”策略,其中通过文本模式分析进一步分析从搜索引擎返回的URL。此过程类似于对匹配列表的手动分析,在此过程中捕获了相关数据,并且通过人为干预,将其挖掘为适合进一步分析的格式。另一方面,我将化学信息自动转换为适合存储在数据库中并进行进一步分析的结构化格式,还提供了到原始信息源的链接。服务器从搜索引擎检索到的查询数据经过编码,加密,压缩和压缩后,然后将其发送到网络中所有参与活动的客户端进行解析。客户端在检索到的网站上标识的相关信息将发送回服务器,进行验证,然后添加到数据库中以进行数据挖掘和进一步分析。客户端/服务器体系结构中URL的扩展非常有利,因此仅产生最小的开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号