首页> 外文会议>IEEE/WIC/ACM International Conference on Web Intelligence >LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement
【24h】

LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement

机译:LSIF:基于主题相关语义相似度度量的大规模信息流检测系统

获取原文

摘要

Information flow detection is dedicated to tracking the dynamics and evolution of Web information spreading across the entire web over time. How to choose a comfortable information granularity to detect and how to track information evolution from one to another are the main challenges. Besides, the technological problem of doing that with a large scale information efficiently is yet to be solved. In this paper, we propose a system approach (LSIF) for a large-scale topic-related semantic information flow detection. We view the sentence as the basic information unit. Moreover, we represent a word or a sentence as continuous high-dimensional vector, which is used for semantic similarity measurement, with the help of word embedding and Fisher kernel. To handle the large-scale information efficiently, we propose a dimension reduction framework called Random Reference Reduction (3R). Furthermore, we adopt a novel clustering algorithm to extract meme -- a piece of information and its variants and analyze how memes evolve. We demonstrate the effectiveness of our approach on two terabyte-level datasets. One is the dataset used by some previous researchers, on which we conducted a series of experiments to evaluate performance. The result shows that our approach is more effective and more efficient comparing with the state-of-the-art methods. The other one is 5 terabyte dataset crawled from 20 Chinese news sites. We visualize the detection results of information flow and exact 9 million memes from the Chinese dataset, which spend about two days.
机译:信息流检测专用于跟踪随时间推移分布在整个Web上的Web信息的动态和演变。如何选择合适的信息粒度进行检测以及如何跟踪信息从一个到另一个的演变是主要的挑战。此外,有效地利用大规模信息进行处理的技术问题尚待解决。在本文中,我们提出了一种用于大规模主题相关语义信息流检测的系统方法(LSIF)。我们将句子视为基本信息单元。此外,我们将单词或句子表示为连续的高维向量,借助单词嵌入和Fisher核将其用于语义相似性度量。为了有效地处理大规模信息,我们提出了一种称为“随机参考缩减”(3R)的降维框架。此外,我们采用了一种新颖的聚类算法来提取模因-一条信息及其变体,并分析模因如何演化。我们在两个TB级数据集上证明了我们的方法的有效性。一个是以前的研究人员使用的数据集,我们在其上进行了一系列实验以评估性能。结果表明,与最新方法相比,我们的方法更加有效和高效。另一个是从20个中国新闻站点抓取的5 TB数据集。我们将信息流的检测结果可视化,并从中国数据集中准确提取了900万个模因,这些过程耗时约两天。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号