首页> 外文会议>IEEE/WIC/ACM International Conference on Web Intelligence >Deep Text Mining of Instagram Data without Strong Supervision
【24h】

Deep Text Mining of Instagram Data without Strong Supervision

机译:Instagram数据的深层文本挖掘,无需强力监督

获取原文

摘要

With the advent of social media, our online feeds increasingly consist of short, informal, and unstructured text. This textual data can be analyzed for the purpose of improving user recommendations and detecting trends. Instagram is one of the largest social media platforms, containing both text and images. However, most of the prior research on text processing in social media is focused on analyzing Twitter data, and little attention has been paid to text mining of Instagram data. Moreover, many text mining methods rely on annotated training data, which in practice is both difficult and expensive to obtain. In this paper, we present methods for unsupervised mining of fashion attributes from Instagram text, which can enable a new kind of user recommendation in the fashion domain. In this context, we analyze a corpora of Instagram posts from the fashion domain, introduce a system for extracting fashion attributes from Instagram, and train a deep clothing classifier with weak supervision to classify Instagram posts based on the associated text. With our experiments, we confirm that word embeddings are a useful asset for information extraction. Experimental results show that information extraction using word embeddings outperforms a baseline that uses Levenshtein distance. The results also show the benefit of combining weak supervision signals using generative models instead of majority voting. Using weak supervision and generative modeling, an F1 score of 0.61 is achieved on the task of classifying the image contents of Instagram posts based solely on the associated text, which is on level with human performance. Finally, our empirical study provides one of the few available studies on Instagram text and shows that the text is noisy, that the text distribution exhibits the long-tail phenomenon, and that comment sections on Instagram are multi-lingual.
机译:随着社交媒体的出现,我们的在线提要越来越多地包含简短,非正式和非结构化的文本。可以分析这些文本数据,以改善用户推荐并检测趋势。 Instagram是最大的社交媒体平台之一,同时包含文本和图像。但是,先前有关社交媒体中文本处理的大多数研究都集中在分析Twitter数据上,并且很少关注Instagram数据的文本挖掘。而且,许多文本挖掘方法都依赖于带注释的训练数据,实际上,获取训练数据既困难又昂贵。在本文中,我们提出了从Instagram文本中无监督地挖掘时尚属性的方法,这些方法可以在时尚领域中实现一种新的用户推荐。在这种情况下,我们分析了来自时尚领域的一系列Instagram帖子,引入了从Instagram提取时尚属性的系统,并训练了一个在监督不力的情况下对服装进行分类的深层分类器,以根据相关文本对Instagram帖子进行分类。通过我们的实验,我们确认单词嵌入是信息提取的有用资产。实验结果表明,使用单词嵌入的信息提取优于使用Levenshtein距离的基线。结果还显示了使用生成模型而不是多数表决将弱监管信号组合在一起的好处。使用弱监督和生成模型,F 1 仅基于相关文本对Instagram帖子的图像内容进行分类的任务就达到了0.61,这与人类的表现水平相当。最后,我们的实证研究提供了有关Instagram文本的少数可用研究之一,它表明文本很嘈杂,文本分布表现出长尾现象,Instagram上的注释部分是多语言的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号