首页> 外文会议>Conference on Next-Generation Analyst IV >Transforming a research-oriented dataset for evaluation of tactical information extraction technologies
【24h】

Transforming a research-oriented dataset for evaluation of tactical information extraction technologies

机译:转换面向研究的数据集,用于评估战术信息提取技术

获取原文

摘要

The most representative and accurate data for testing and evaluating information extraction technologies is real-world data. Real-world operational data can provide important insights into human and sensor characteristics, interactions, and behavior. However, several challenges limit the feasibility of experimentation with real-world operational data. Real-world data lacks the precise knowledge of a "ground truth," a critical factor for benchmarking progress of developing automated information processing technologies. Additionally, the use of real-world data is often limited by classification restrictions due to the methods of collection, procedures for processing, and tactical sensitivities related to the sources, events, or objects of interest. These challenges, along with an increase in the development of automated information extraction technologies, are fueling an emerging demand for operationally-realistic datasets for benchmarking. An approach to meet this demand is to create synthetic datasets, which are operationally-realistic yet unclassified in content. The unclassified nature of these unclassified synthetic datasets facilitates the sharing of data between military and academic researchers thus increasing coordinated testing efforts. This paper describes the expansion and augmentation of two synthetic text datasets, one initially developed through academic research collaborations with the Army. Both datasets feature simulated tactical intelligence reports regarding fictitious terrorist activity occurring within a counter-insurgency (COIN) operation. The datasets were expanded and augmented to create two military relevant datasets. The first resulting dataset was created by augmenting and merging the two to create a single larger dataset containing ground-truth. The second resulting dataset was restructured to more realistically represent the format and content of intelligence reports. The dataset transformation effort, the final datasets, and their applicability for research are presented.
机译:用于测试和评估信息提取技术的最具代表性和准确的数据是真实世界的数据。现实世界的运营数据可以对人类和传感器特征,交互和行为提供重要的见解。然而,有几个挑战限制了实验与现实世界运营数据的可行性。现实世界数据缺乏“基础事实”的精确知识,是开发自动信息处理技术的基准进展的关键因素。另外,由于收集方法,处理程序,与源,事件或感兴趣对象相关的程序,使用现实数据的使用通常受到分类限制的限制。这些挑战随着自动信息提取技术的发展增加,正在为基准测试的运行现实数据集推动新兴的需求。满足此需求的方法是创建合成数据集,其在操作上尚未在内容中无分类。这些未分类的合成数据集的未分类性质有助于军事和学术研究人员之间的数据分享,从而增加了协调的测试努力。本文介绍了两个合成文本数据集的扩展和增强,首先通过与军队的学术研究合作开发。两种数据集都具有模拟的战术情报报告,了解在反叛乱(硬币)操作中发生的虚构恐怖活动。扩展并增强数据集以创建两个军事相关数据集。通过增强和合并两个创建的第一个生成的数据集以创建包含地面真实性的单个更大的数据集。将第二个结果数据集重组为更现实地代表智能报告的格式和内容。提出了数据集转换工作,最终数据集及其对研究的适用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号