Web mining and similarity for documents and databases

机译：文档和数据库的网站挖掘和相似性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper carries out some questions as searching, organizing and indexing documents and databases by content. They are not original questions. The methodologies, which we propose, developed according to the new INTERNET boundaries, are new. INTERNET requires a flexible approach able to take into account the complexity of the information elements in INTERNET and their heterogeneity. The great amount of Web documents, their different types and sources, produces information noise. When users have to search for a document they use a search engine, which returns an ordered list of fragmented documents. Users consider this list or they follow other approaches as well as the methodology proposed by Zamir, that classifies documents to organize results. Clasterization is produced considering fragments produced by the search engine. This paper introduces some basic concepts as the similarity concept between documents/databases. Similarity is the closeness of indexed documents/databases, belonging to the information base, and target document. Closeness is identified according to the characteristics used for indexing a document. Starting from the organization and syntactic structure of web documents described using HTML, we can extract implicit information connected to the lay-out of documents, according to a set of heuristics, which usually emphasize the content information units (i.e. the autonomous information parts of the document). The similarity concept has to take into account: ⅰ) the need to manage indexes produced by different interpretations, which are not necessarily independent; a user (reader) can improve indexes proposed by authors, ⅱ) the need to manage the document presentation according to different users culture. There exists a great number of similarity definitions; for textual information the terminological approach is the most used in the similarity definition. On the other hand, images similarity involves spatial relationships. This paper extends similarity definition byintroducing the concept of content similarity of documents indexes (semantic similarity). Moreover, we extend the similarity concept to Web available databases. It is becoming a web-mining problem involving web documents and metadata of databases accessible by web.

机译：本文进行一些问题的查找，组织和内容索引文件和数据库。他们是不是原来的问题。该方法，这是我们提出，根据新的互联网界发展，是新的。 INTERNET需要能够考虑到的信息元素的复杂性互联网及其异质性的灵活方法。在大量的Web文档，其不同的类型和来源，产生噪音的信息。当用户要搜索文档，他们使用搜索引擎，它返回碎片文件的有序列表。用户认为这个名单或他们跟随其他方法以及由扎米尔提出的方法，即文件进行分类整理结果。 Clasterization产生考虑由搜索引擎产生的片段。本文介绍了一些基本概念，文件/数据库之间的相似性概念。相似性是索引的文档/数据库的接近程度，属于信息的基础上，和目标文档。接近度是根据用于索引的文档的特征鉴定。从组织和web文档的句法结构开始说明了使用HTML，我们可以提取连接到所述布局设计的文档隐式信息，根据一组启发法，其通常强调的内容信息单元（即的自主信息部件文档）。相似概念必须考虑到：ⅰ）需要管理通过不同的解释，这并不一定是独立产生的索引;用户（读者）可以提高作者提出的指标，ⅱ）需要根据不同用户的文化来管理文件提供。存在的相似性定义大量;对于文本信息的术语的做法是在相似的定义最常用的。在另一方面，图像相似涉及空间关系。本文扩展相似定义byintroducing文件索引的内容相似性（语义相似）的概念。此外，我们扩展了类似概念的Web可用的数据库。它正在成为包括网络文档，并通过网络访问数据库的元数据的网络挖掘问题。

著录项

来源
《World multiconference on systemics, cybernetics and informatics》|2001年||共6页
会议地点
作者
Grifoni Patrizia; Padula Marco;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
documents similarity; databases similarity; semantic similarity; web mining;

机译：文件相似度;数据库相似;语义相似性;网站挖掘;

相似文献

外文文献
中文文献
专利

1. Mining a web citation database for document clustering [J] . Y. He, S. C. Hui, A.C. M. Fong Applied Artificial Intelligence . 2002,第4期

机译：挖掘Web引用数据库以进行文档聚类
2. Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System [J] . C. I. Ezeife International Journal of Data Warehousing and Mining . 2012,第4期

机译：使用NFA进行Web文档对象的比较挖掘：WebOMiner系统
3. Extracting OLAP Cubes From Document-Oriented NoSQL Database Based on Parallel Similarity Algorithms [J] . Farnaz Davardoost, Amin Babazadeh Sangar, Kambiz Majidzadeh Canadian journal of electrical and computer engineering . 2020,第2期

机译：根据并行相似性算法从以文档为导向的NoSQL数据库中提取OLAP多维数据库
4. Web mining and similarity for documents and databases [C] . Grifoni Patrizia, Padula Marco World Multiconference on Systemics, Cybernetics and Informatics(SCI 2001) v.14: Computer Science and Engineering pt.2; 20010722-20010725; Orlando,FL; US . 2001

机译：Web挖掘以及文档和数据库的相似性
5. Design and Development of Intelligent Web Mining System for Extraction of Information from Web Databases [D] . Sharma, Sanjeev Kumar. 2010

机译：Web数据库提取信息的智能网络挖掘系统的设计与开发
6. Large expert-curated database for benchmarking document similarity detection in biomedical literature search [O] . Peter Brown, RELISH Consortium, Yaoqi Zhou -1

机译：大型专家管理的数据库用于对生物医学文献搜索中的基准文件相似性进行检测
7. Experience mining: Building a large-scale database of personal experiences and opinions from web documents [O] . Kentaro Inui, Shuya Abe, Kazuo Hara, 2008

机译：体验挖掘：从Web文档构建个人经验和观点的大规模数据库

Web mining and similarity for documents and databases

摘要

著录项

相似文献

相关主题

期刊订阅