首页> 外文会议>International Workshop on Database and Expert Systems Applications >A Fast and Accurate Approach for Main Content Extraction based on Character Encoding
【24h】

A Fast and Accurate Approach for Main Content Extraction based on Character Encoding

机译:基于字符编码的主要内容提取快速准确的方法

获取原文

摘要

This paper presents a novel approach for extracting the main content from Web documents written in languages not based on the Latin alphabet. In practice, the HTML tags are based on the English language and, certainly, the English character set is encoded in the interval [0,127] of the Unicode character set. On the other hand, many languages, such as the Arabic language, use a different interval for their characters. In the first phase of our approach, we apply this distinction for a fast separation of the Non-ASCII from the English characters. After that, we determine some areas of the HTML file with high density of the Non-ASCII character set and low density of the ASCII character set. At the end of this phase, we use this density to identify the areas which contain the main content. Finally, we feed those areas to our parser in order to extract the main content of the Web page. The proposed algorithm, called DANA, exceeds alternative approaches in terms of both, efficiency and effectiveness, and has the potential to be extended also to languages based on ASCII characters.
机译:本文提出了一种新的方法,用于从非语言编写的Web文件中提取主要内容,而不是基于拉丁字母。在实践中,HTML标签基于英语,当然,英语字符集在Unicode字符集的间隔[0,127]中编码。另一方面,许多语言(例如阿拉伯语)使用不同的时间间隔来表示他们的角色。在我们的方法的第一阶段,我们将这种区别应用于从英语字符中快速分离非ASCII。之后,我们确定具有高密度的HTML文件的一些区域,以及ASCII字符集的低密度。在此阶段结束时,我们使用这种密度来识别包含主要内容的区域。最后,我们将这些区域馈送到我们的解析器,以提取网页的主要内容。所提出的算法称为DANA,在效率和有效性方面超过了替代方法,并且具有基于ASCII字符的语言扩展的可能性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号