首页> 外文会议>World multiconference on systemics, cybernetics and informatics >Web mining and similarity for documents and databases
【24h】

Web mining and similarity for documents and databases

机译:文档和数据库的网站挖掘和相似性

获取原文

摘要

This paper carries out some questions as searching, organizing and indexing documents and databases by content. They are not original questions. The methodologies, which we propose, developed according to the new INTERNET boundaries, are new. INTERNET requires a flexible approach able to take into account the complexity of the information elements in INTERNET and their heterogeneity. The great amount of Web documents, their different types and sources, produces information noise. When users have to search for a document they use a search engine, which returns an ordered list of fragmented documents. Users consider this list or they follow other approaches as well as the methodology proposed by Zamir, that classifies documents to organize results. Clasterization is produced considering fragments produced by the search engine. This paper introduces some basic concepts as the similarity concept between documents/databases. Similarity is the closeness of indexed documents/databases, belonging to the information base, and target document. Closeness is identified according to the characteristics used for indexing a document. Starting from the organization and syntactic structure of web documents described using HTML, we can extract implicit information connected to the lay-out of documents, according to a set of heuristics, which usually emphasize the content information units (i.e. the autonomous information parts of the document). The similarity concept has to take into account: ⅰ) the need to manage indexes produced by different interpretations, which are not necessarily independent; a user (reader) can improve indexes proposed by authors, ⅱ) the need to manage the document presentation according to different users culture. There exists a great number of similarity definitions; for textual information the terminological approach is the most used in the similarity definition. On the other hand, images similarity involves spatial relationships. This paper extends similarity definition byintroducing the concept of content similarity of documents indexes (semantic similarity). Moreover, we extend the similarity concept to Web available databases. It is becoming a web-mining problem involving web documents and metadata of databases accessible by web.
机译:本文进行一些问题的查找,组织和内容索引文件和数据库。他们是不是原来的问题。该方法,这是我们提出,根据新的互联网界发展,是新的。 INTERNET需要能够考虑到的信息元素的复杂性互联网及其异质性的灵活方法。在大量的Web文档,其不同的类型和来源,产生噪音的信息。当用户要搜索文档,他们使用搜索引擎,它返回碎片文件的有序列表。用户认为这个名单或他们跟随其他方法以及由扎米尔提出的方法,即文件进行分类整理结果。 Clasterization产生考虑由搜索引擎产生的片段。本文介绍了一些基本概念,文件/数据库之间的相似性概念。相似性是索引的文档/数据库的接近程度,属于信息的基础上,和目标文档。接近度是根据用于索引的文档的特征鉴定。从组织和web文档的句法结构开始说明了使用HTML,我们可以提取连接到所述布局设计的文档隐式信息,根据一组启发法,其通常强调的内容信息单元(即的自主信息部件文档)。相似概念必须考虑到:ⅰ)需要管理通过不同的解释,这并不一定是独立产生的索引;用户(读者)可以提高作者提出的指标,ⅱ)需要根据不同用户的文化来管理文件提供。存在的相似性定义大量;对于文本信息的术语的做法是在相似的定义最常用的。在另一方面,图像相似涉及空间关系。本文扩展相似定义byintroducing文件索引的内容相似性(语义相似)的概念。此外,我们扩展了类似概念的Web可用的数据库。它正在成为包括网络文档,并通过网络访问数据库的元数据的网络挖掘问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号