首页> 外文会议>International conference on computational linguistics >Fast Tweet Retrieval with Compact Binary Codes
【24h】

Fast Tweet Retrieval with Compact Binary Codes

机译:快速推文检索,紧凑的二进制代码

获取原文

摘要

The most widely used similarity measure in the field of natural language processing may be cosine similarity. However, in the context of Twitter, the large scale of massive tweet data inevitably makes it expensive to perform cosine similarity computations among tremendous data samples. In this paper, we exploit binary coding to tackle the scalability issue, which compresses each data sample into a compact binary code and hence enables highly efficient similarity computations via Hamming distances between the generated codes. In order to yield semantics sensitive binary codes for tweet data, we design a binarized matrix factorization model and further improve it in two aspects. First, we force the projection directions employed by the model nearly orthogonal to reduce the redundant information in their resulting binary bits. Second, we leverage the tweets' neighborhood information to encourage similar tweets to have adjacent binary codes. Evaluated on a tweet dataset using hashtags to create gold labels in an information retrieval scenario, our proposed model shows significant performance gains over competing methods.
机译:自然语言处理领域中使用的最广泛使用的相似度测量可能是余弦相似性。然而,在Twitter的背景下,大规模的大规模推文数据不可避免地使得在巨大的数据样本中执行余弦相似性计算昂贵。在本文中,我们利用二进制编码来解决可伸缩性问题,该缩放性问题将每个数据样本压缩到紧凑的二进制代码中,因此通过生成的代码之间的汉明距离实现高效的相似性计算。为了为推特数据产生语义敏感二进制代码,我们设计了二值化矩阵分解模型,并在两个方面进一步改进它。首先,我们强制模型采用的投影方向几乎正交,以减少其产生的二进制比特中的冗余信息。其次,我们利用推文的邻居信息鼓励类似的推文具有相邻的二进制代码。在信息检索方案中使用HASHTAG进行评估,在信息检索方案中创建金标签,我们所提出的模型显示出在竞争方法上的显着性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号