...
首页> 外文期刊>Journal of web semantics: >Automatic metadata mining from multilingual enterprise content
【24h】

Automatic metadata mining from multilingual enterprise content

机译:从多语言企业内容中自动进行元数据挖掘

获取原文
获取原文并翻译 | 示例
           

摘要

Personalization is increasingly vital especially for enterprises to be able to reach their customers. The key challenge in supporting personalization is the need for rich metadata, such as metadata about structural relationships, subject/concept relations between documents and cognitive metadata about documents (e.g. difficulty of a document). Manual annotation of large knowledge bases with such rich metadata is not scalable. As well as, automatic mining of cognitive metadata is challenging since it is very difficult to understand underlying intellectual knowledge about document automatically. On the other hand, the Web content is increasing becoming multilingual since growing amount of data generated on the Web is non-English. Current metadata extraction systems are generally based on English content and this requires to be revolutionized in order to adapt to the changing dynamics of the Web. To alleviate these problems, we introduce a novel automatic metadata extraction framework, which is based on a novel fuzzy based method for automatic cognitive metadata generation and uses different document parsing algorithms to extract rich metadata from multilingual enterprise content using the newly developed Doc-Book, Resource Type and Topic ontologies. Since the metadata generation process is based upon DocBook structured enterprise content, our framework is focused on enterprise documents and content which is loosely based on the DocBook type of formatting. DocBook is a common documentation formatting to formally produce corporate data and it is adopted by many enterprises. The proposed framework is illustrated and evaluated on English, German and French versions of the Symantec Norton 360 knowledge bases. The user study showed that the proposed fuzzy-based method generates reasonably accurate values with an average precision of 89.39% on the metadata values of document difficulty, document interactivity level and document interactivity type. The proposed fuzzy inference system achieves improved results compared to a rule-based reasoner for difficulty metadata extraction (~11% enhancement). In addition, user perceived metadata quality scores (mean of 5.57 out of 6) found to be high and automated metadata analysis showed that the extracted metadata is high quality and can be suitable for personalized information retrieval.
机译:个性化变得越来越重要,特别是对于企业而言,能够接触到客户。支持个性化的关键挑战是需要丰富的元数据,例如有关结构关系的元数据,文档之间的主题/概念关系以及有关文档的认知元数据(例如文档的难易程度)。具有如此丰富的元数据的大型知识库的手动注释不可伸缩。以及,自动挖掘认知元数据也具有挑战性,因为很难自动理解有关文档的基础知识。另一方面,由于Web上生成的非英语数据量越来越多,Web内容正变得越来越多语言化。当前的元数据提取系统通常基于英语内容,因此需要对其进行革命性调整,以适应不断变化的Web动态。为了缓解这些问题,我们介绍了一种新颖的自动元数据提取框架,该框架基于一种新颖的基于模糊的方法来自动生成认知元数据,并使用新开发的Doc-Book使用不同的文档解析算法从多语言企业内容中提取丰富的元数据,资源类型和主题本体。由于元数据的生成过程是基于DocBook结构化的企业内容,因此我们的框架主要针对基于DocBook格式类型的企业文档和内容。 DocBook是一种常见的文档格式,用于正式生成公司数据,并且被许多企业采用。所提议的框架以Symantec Norton 360知识库的英语,德语和法语版本进行了说明和评估。用户研究表明,所提出的基于模糊的方法可在文档难度,文档交互性级别和文档交互性类型的元数据值上生成合理准确的值,平均精度为89.39%。与基于规则的推理器相比,所提出的模糊推理系统获得了更好的结果,用于难度元数据提取(提高了11%)。此外,发现用户感觉到的元数据质量得分很高(平均分为6.57,平均5.57),并且自动进行的元数据分析表明提取的元数据质量很高,并且可以适合个性化信息检索。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号