【24h】

Filtering the Open-source Information

机译:过滤开源信息

获取原文

摘要

The abundance of information regarding the most of domains makes Internet the best resource. Besides its usefulness, it is however difficult to automate the process of information extraction due to lack of structure in online information. The most commonly used information sharing protocol Hyper Text Transfer Protocol (HTTP) makes it possible to embed a lot of noise (like advertisements, images, headers, menus, etc.) in a document containing the useful information. Thus the filtering of noise prior information extraction is necessary. Such filtering of noise has many applications, including cell phone and Personal Digigtal Assistant (PDA) browsing, speech rendering for visually impaired or blind people, open source intelligence and many others. In this paper, we describe a statistical model to filter such noise from a document containing useful information. Our model is based on strategies to analyse the text distribution and link densities in HTML page across all of the nodes of Document Object Model (DOM) tree for detection of useful nodes among them. We describe the validity of model with the help of experiment conducted in implementation of an Early Warning System to facilitate open source intelligence. We also present the general work flow to convert the unstructured online text about terrorists into investigate-able data structure for social network analysis and discuss how our model fits into it.
机译:有关大多数域的大量信息使Internet成为最佳资源。但是,除了其有用性之外,由于缺乏在线信息的结构,因此很难自动化信息提取过程。最常用的信息共享协议超文本传输​​协议(HTTP)可以在包含有用信息的文档中嵌入大量噪音(例如广告,图像,标题,菜单等)。因此,必须对噪声先验信息提取进行过滤。这种噪声过滤具有许多应用程序,包括手机和个人数字助理(PDA)浏览,为视力障碍或盲人提供的语音渲染,开源情报等等。在本文中,我们描述了一种统计模型,用于从包含有用信息的文档中过滤掉此类噪声。我们的模型基于分析文档对象模型(DOM)树的所有节点上HTML页面中文本分布和链接密度的策略,以检测其中的有用节点。我们借助于实施预警系统以促进开源情报的实验,描述了模型的有效性。我们还介绍了将非结构化的有关恐怖分子的在线文本转换为可调查的数据结构以进行社交网络分析的一般工作流程,并讨论了我们的模型如何适应该模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号