首页> 外文会议>International Conference on Big Data Analytics >Generic Key Value Extractions from Emails
【24h】

Generic Key Value Extractions from Emails

机译:来自电子邮件的通用键值提取

获取原文

摘要

Web information extraction systems are widely used to find and understand relevant parts of a text, combine multiple such parts, and produce a structured representation of the information. There are various scenarios where HTML formatted emails are generated by filling a template with user and transaction-specific values from databases. These emails are sent for human consumption among other things. Examples are such emails include flight confirmation emails, restaurant reservation emails, bills, hospital records, etc. In majority of these B2C emails information is presented in the form of key-value pairs i.e., user or transactions specific values are presented in an HTML format with their associated keys. In this paper, we describe a generic method to extract these key-value pairs which can be used for various applications. We analyze the pairs for a number of applications including identifying semantically similar keywords and creating clusters of keywords which then can be used for building information extraction wrappers. We show that just using word-embeddings is a poor substitute for finding similar keys. We use a number of features-types of values, cooccurrence graph of keys, etc., and combine them to present a keyword similarity algorithm which gives more than 50% improvement in homogeneity of the clusters, in comparison to just using word embeddings, using various real-world data.
机译:Web信息提取系统广泛用于查找和理解文本的相关部分,组合多个这样的部分,并产生信息的结构化表示。有各种场景,通过从数据库中填充模板和特定于数据库的特定于事务的值来生成HTML格式化的电子邮件。这些电子邮件是为了人类消费而在其他方面发送。示例是此类电子邮件包括航班确认电子邮件,餐厅预订电子邮件,账单,账单,医院记录等。在大多数这些B2C电子邮件信息中以键值对的形式呈现,即用户或事务特定值以HTML格式呈现他们的相关钥匙。在本文中,我们描述了一种用于提取这些键值对的通用方法,可用于各种应用。我们分析了多个应用程序的对,包括识别语义相似的关键字和创建关键字的群集,然后可以用于构建信息提取包装器。我们表明,只需使用Word-Embeddings是一个糟糕的替代品,用于查找类似键。我们使用许多特征类型的值,键等的Cooccurrence图等,并将它们组合起来呈现一个关键词相似性算法,其在使用Word Embedings的情况下使用各种现实世界数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号