首页> 外文会议>IEEE International Conference on Information Reuse and Integration >An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data (Application Paper)
【24h】

An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data (Application Paper)

机译:在特定领域的Web数据中发现位置并对其进行地理编码的自动方法(应用论文)

获取原文

摘要

We present an automatic approach for discovering location names in WWW data culled from diverse domains. Our approach builds upon the Apache Tika, Apache OpenNLP, and Apache Lucene frameworks. Tika is used to extract text and metadata from any file. The text and metadata are provided to Apache OpenNLP and its location entity extraction model. The discovered location entities are then delivered to a gazetteer indexed in Apache Lucene derived from the Geonames.org dataset. This paper describes the overall approach and then explains in detail the challenges we faced, and the methodology that we employed to overcome them. We describe the evolution of our geo gazetteer process and algorithm and demonstrate the approach's accuracy in data collected in the DARPA MEMEX and NSF Polar Cyber Infrastructure efforts.
机译:我们提出了一种自动方法,用于发现来自不同域的WWW数据中的位置名称。我们的方法基于Apache Tika,Apache OpenNLP和Apache Lucene框架。 Tika用于从任何文件中提取文本和元数据。文本和元数据提供给Apache OpenNLP及其位置实体提取模型。然后,将发现的位置实体传递到在Apache Lucene中建立索引的地名索引中,该索引来自Geonames.org数据集。本文介绍了整体方法,然后详细说明了我们面临的挑战以及我们用来克服这些挑战的方法。我们描述了地名词典处理和算法的发展过程,并在DARPA MEMEX和NSF Polar网络基础设施工作中收集的数据中证明了该方法的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号