首页> 外文期刊>IEEE Transactions on Fuzzy Systems >Using Fuzzy Logic to Leverage HTML Markup for Web Page Representation
【24h】

Using Fuzzy Logic to Leverage HTML Markup for Web Page Representation

机译:使用模糊逻辑来利用HTML标记进行网页表示

获取原文
获取原文并翻译 | 示例

摘要

The selection of a suitable document representation approach plays a crucial role in the performance of a document clustering task. Being able to pick out representative words within a document can lead to substantial improvements in document clustering. In the case of web documents, the HTML markup that defines the layout of the content provides additional structural information that can be further exploited to identify representative words. In this paper, we introduce a fuzzy term weighing approach that makes the most of the HTML structure for document clustering. We set forth and build on the hypothesis that a good representation can take advantage of how humans skim through documents to extract the most representative words. The authors of web pages make use of HTML tags to convey the most important message of a web page through page elements that attract the readers’ attention, such as page titles or emphasized elements. We define a set of criteria to exploit the information provided by these page elements, and introduce a fuzzy combination of these criteria that we evaluate within the context of a web page clustering task. Our proposed approach, called abstract fuzzy combination of criteria (AFCC), can adapt to datasets whose features are distributed differently, achieving good results compared with other similar fuzzy logic based approaches and TF-IDF across different datasets.
机译:选择合适的文档表示方法在执行文档聚类任务中起着至关重要的作用。能够挑选出文档中具有代表性的单词可以大大改善文档聚类。对于Web文档,定义内容布局的HTML标记提供了其他结构信息,可以进一步利用这些信息来标识代表词。在本文中,我们介绍了一种模糊术语加权方法,该方法可以充分利用HTML结构进行文档聚类。我们提出并建立了以下假设:良好的表示可以利用人类如何浏览文档来提取最具代表性的单词。网页的作者使用HTML标记通过页面元素(例如页面标题或强调的元素)吸引读者的注意力来传达网页中最重要的信息。我们定义了一组标准,以利用这些页面元素提供的信息,并引入我们在网页群集任务的上下文中评估的这些条件的模糊组合。我们提出的方法称为标准抽象模糊组合(AFCC),可以适应特征分布不同的数据集,与其他基于模糊逻辑的方法和跨不同数据集的TF-IDF相比,取得了良好的效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号