A Fast and Accurate Approach for Main Content Extraction based on Character Encoding

机译：基于字符编码的主要内容提取快速准确的方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a novel approach for extracting the main content from Web documents written in languages not based on the Latin alphabet. In practice, the HTML tags are based on the English language and, certainly, the English character set is encoded in the interval [0,127] of the Unicode character set. On the other hand, many languages, such as the Arabic language, use a different interval for their characters. In the first phase of our approach, we apply this distinction for a fast separation of the Non-ASCII from the English characters. After that, we determine some areas of the HTML file with high density of the Non-ASCII character set and low density of the ASCII character set. At the end of this phase, we use this density to identify the areas which contain the main content. Finally, we feed those areas to our parser in order to extract the main content of the Web page. The proposed algorithm, called DANA, exceeds alternative approaches in terms of both, efficiency and effectiveness, and has the potential to be extended also to languages based on ASCII characters.

机译：本文提出了一种新的方法，用于从非语言编写的Web文件中提取主要内容，而不是基于拉丁字母。在实践中，HTML标签基于英语，当然，英语字符集在Unicode字符集的间隔[0,127]中编码。另一方面，许多语言（例如阿拉伯语）使用不同的时间间隔来表示他们的角色。在我们的方法的第一阶段，我们将这种区别应用于从英语字符中快速分离非ASCII。之后，我们确定具有高密度的HTML文件的一些区域，以及ASCII字符集的低密度。在此阶段结束时，我们使用这种密度来识别包含主要内容的区域。最后，我们将这些区域馈送到我们的解析器，以提取网页的主要内容。所提出的算法称为DANA，在效率和有效性方面超过了替代方法，并且具有基于ASCII字符的语言扩展的可能性。

著录项

来源
《International Workshop on Database and Expert Systems Applications》|2011年||共5页
会议地点
作者
Hadi Mohammadzadeh; Thomas Gottron; Franz Schweiggert; Gholamreza Nakhaeizadeh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13-53;
关键词
Main Content Extraction; Information Retrieval; UTF-8; HTML Documents; ASCII and Non-ASCII character set;

机译：主要内容提取;信息检索;UTF-8;HTML文档;ASCII和非ASCII字符集;

相似文献

外文文献
中文文献
专利

1. A fast and accurate approach to the extraction of leaf midribs from point clouds [J] . Remote sensing letters . 2020,第1a3期

机译：一种快速准确的从点云中提取叶片中脉的方法
2. Accurate and fast convergence method for parameter estimation of PV generators based on three main points of the I-V curve [J] . C. Carrero, D. Ramirez, J. Rodriguez, Renewable energy . 2011,第11期

机译：基于IV曲线的三个要点的光伏发电机参数估计的准确快速收敛方法
3. A fast approach for accurate content-adaptive mesh generation [J] . Yongyi Yang, Wernick M.N., Brankov J.G. IEEE Transactions on Image Processing . 2003,第8期

机译：精确的内容自适应网格生成的快速方法
4. A Fast and Accurate Approach for Main Content Extraction based on Character Encoding [C] . Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, International Workshop on Database and Expert Systems Applications . 2011

机译：基于字符编码的主要内容提取快速准确的方法
5. Accurate sampling-based algorithms for surface extraction and motion planning. [D] . Varadhan, Gokul. 2005

机译：基于精确采样的算法用于曲面提取和运动规划。
6. All Your Base: a fast and accurate probabilistic approach to base calling [O] . Tim Massingham, Nick Goldman 2012

机译：您的所有基地：一种快速准确的概率方法来进行基地呼叫
7. This article presents a new numerical model describing the behaviour of a thermally thick wood sample exposed to high solar heat flux (above 1 MW/m2). A preliminary study based on dimensionless numbers is used to classify the problem and support model building assumptions. Then, a model based on mass, momentum and energy balance equations is proposed. These equations are coupled with liquid-vapour drying model and pseudo species biomass degradation model. By comparing to a former experimental study, preliminary results have shown that these equations are not enough to accurately predict biomass behaviour under high solar heat flux. Indeed, a char layer acting as radiative shield forms on the sample exposed surface. In addition to this classical set of equations, it is mandatory to take into account radiation penetration into the medium. Furthermore, as biomass contains water, medium deformation consecutively to char steam gasification must also be implemented. Finally, with the addition of these two strategies, the model is able to properly capture the degradation of biomass when exposed to high radiative heat flux over a range of sample initial moisture content. Additional insights of biomass behaviour under high solar heat flux were also derived. Drying, pyrolysis and gasification fronts are present at the same time inside of the sample. The coexistence of these three thermochemical fronts leads to char gasification by the steam produced from drying of the sample, which it is the main phenomenon behind medium ablation. [O] . Pozzobon, Victor, Salvador, Sylvain, Bézian, Jean Jacques 2018

机译：本文提供了一个新的数值模型，该模型描述了暴露于高太阳热通量（高于1 / MW / m2）的热厚木材样品的行为。基于无量纲数的初步研究用于对问题进行分类并支持模型构建假设。然后，提出了一种基于质量，动量和能量平衡方程的模型。这些方程式与液体蒸汽干燥模型和假物种生物质降解模型耦合。通过与以前的实验研究进行比较，初步结果表明，这些方程不足以准确预测高太阳热通量下的生物量行为。的确，在样品暴露的表面上形成了充当辐射屏蔽层的炭层。除了这套经典的方程式之外，还必须考虑到辐射向介质的渗透。此外，由于生物质中含有水，因此还必须在炭蒸气汽化后进行连续的介质变形。最后，通过添加这两种策略，该模型能够在一定范围的样品初始水分含量下暴露于高辐射热通量的情况下，正确捕获生物质的降解。还得出了在高太阳热通量下生物量行为的其他见解。样品内部同时存在干燥，热解和气化前沿。这三个热化学前沿的共存会导致样品干燥产生的蒸汽产生焦炭气化，这是介质烧蚀的主要现象。

A Fast and Accurate Approach for Main Content Extraction based on Character Encoding

摘要

著录项

相似文献

相关主题

期刊订阅