首页> 外文会议>International Conference on Web Information Systems and Technologies >AISLES THROUGH THE CATEGORY FOREST: Utilising the Wikipedia Category System for Corpus Building in Machine Learning
【24h】

AISLES THROUGH THE CATEGORY FOREST: Utilising the Wikipedia Category System for Corpus Building in Machine Learning

机译:通过类别森林的过道:利用机器学习中的维基百科类别系统

获取原文

摘要

The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning.
机译:Word Web Wind是对机器学习的持续挑战。必须增强建立的方法,并开发新方法,以解决寻找和组织相关信息的问题。它经常被激励,输入文档的语义分类有助于解决此任务。但是,虽然监督文本分类的方法在书面文本中发现的流派中表现得很好,但网络上的新进化类型更大。为了成功开发网站挖掘方法,需要各种基层。但是,类型或域特定Web Coress的组成仍然是一个未解决的问题。由于网页通常缺乏可靠的元信息,因此建立了大量的大型电流是耗时的。维基百科以及类似的协作文本生产方法提供了这种困境的方式。我们检查MediaWiki软件支持的社交标记如何,可作为语料库建设的来源。此外,我们描述了社会本体的表示格式,并呈现了维基百科类别资源管理器,该工具支持浏览维基百科并构建机器学习的域特定语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号