首页> 外文期刊>Procedia Computer Science >Entity Extraction for Malayalam Social Media Text Using Structured Skip-gram Based Embedding Features from Unlabeled Data
【24h】

Entity Extraction for Malayalam Social Media Text Using Structured Skip-gram Based Embedding Features from Unlabeled Data

机译:使用基于结构化跳过图的嵌入特征从未标记数据中提取马拉雅拉姆语社交媒体文本的实体

获取原文
           

摘要

Social media text is generally informal and noisy but sometimes tends to have informative content. Extracting these informative content such as entities is a challenging task. The main aim of this paper is to extract entities from Malayalam social media text efficiently. The social media corpus used in our system is from FIRE2015 entity extraction task. This data is initially subjected to pre-processing and feature extraction and then proceeds with entity extraction. Apart from the conventional stylometric features like prefixes, suffixes, hash tags etc., and POS tags, unsupervised word embedding features obtained from Structured Skip-gram model are utilized to train the system. The extracted features is given to the Support vector machine classifier to build and train model. Testing of the system resulted in better accuracy than the existing systems evaluated in FIRE2015 tasks. Unsupervised features retrieved using Structured Skip-gram model contributes to the reason for achieving better performance.
机译:社交媒体文本通常是非正式且嘈杂的,但有时往往具有翔实的内容。提取这些信息内容(例如实体)是一项艰巨的任务。本文的主要目的是有效地从马拉雅拉姆语社交媒体文本中提取实体。我们系统中使用的社交媒体语料库来自FIRE2015实体提取任务。该数据首先要进行预处理和特征提取,然后再进行实体提取。除了常规的样式特征(如前缀,后缀,哈希标签等)和POS标签外,还使用从结构化跳过图模型获得的无监督词嵌入功能来训练系统。提取的特征被提供给支持向量机分类器以构建和训练模型。与在FIRE2015任务中评估的现有系统相比,对该系统进行的测试产生的准确性更高。使用结构化跳过图模型检索的无监督特征是实现更好性能的原因。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号