Automatic open domain information extraction from Indonesian text

机译：来自印度尼西亚文本的自动开放域信息提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Availability of vast amount of digital documents that have surpassed human processing capability calls for an automatic information extraction method from any text document regardless of their domain. Unfortunately, open domain information extraction (open IE) systems are language-specific and there is no published system for Indonesian language. This paper introduces a system to extract entity relations from Indonesian text in triple format using an NLP pipeline, rule-based candidates generator, rule-based token expander and machine-learning-based triple selector. We cross-validate four candidates: logistic regression, SVM, MLP, Random Forest using our dataset to discover that Random Forest is the best classifier for the triple selector achieving 0.60 F1 score (0.62 precision and 0.58 recall). The low score is largely due to the simplistic candidate generation rules and the coverage of dataset.

机译：可用性大量的数字文档超过了人工处理能力，无论其域如何，都会从任何文本文档中呼叫自动信息提取方法。不幸的是，开放式域信息提取（开放IE）系统是特定于语言的，并且没有印度尼西亚语言的发布系统。本文介绍了一种系统，用于使用NLP管道，基于规则的候选生成器，规则的令牌扩展器和基于机器 - 基于基于机器学习的三重选择器的三重格式中的INDONESIAN文本中提取实体关系的系统。我们交叉验证四个候选者：Logistic回归，SVM，MLP，随机林使用我们的数据集发现随机森林是三重选择器实现0.60 F1分数的最佳分类器（0.62精度和0.58召回）。低分数主要是由于简单候选生成规则和数据集的覆盖范围。

著录项

来源
《International Workshop on Big Data and Information Security》|2017年|164p|共8页
会议地点
作者
Yohanes Gultom; Wahyu Catur Wibowo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP393-53;
关键词
Pipelines; Information retrieval; Task analysis; Generators; Data mining; Java; Compounds;

机译：管道;信息检索;任务分析;生成器;数据挖掘;Java;化合物;

相似文献

外文文献
中文文献
专利

1. Automatic Domain Knowledge Extraction from Requirements Specification Text [J] . S. Geetha, G.S. Anandha Mala, Suresh Kumar Sanampudi Research journal of applied science, engineering and technology . 2016,第9期

机译：从需求规范文本中自动提取领域知识
2. Automatic Domain Knowledge Extraction from Requirements Specification Text [J] . S. Geetha, G.S. Anandha Mala, Suresh Kumar Sanampudi Research journal of applied science, engineering and technology . 2016,第9期

机译：从需求规范文本中自动提取领域知识
3. Automatic extraction of keywords from scientific text:application to the knowledge domain of protein families [J] . Miguel A.Andrade... Bioinformatics . 1998,第7期

机译：从科学文本中自动提取关键词：在蛋白质家族知识领域的应用
4. Automatic open domain information extraction from Indonesian text [C] . Yohanes Gultom, Wahyu Catur Wibowo International Workshop on Big Data and Information Security . 2017

机译：从印尼文字中自动提取开放域信息
5. Knowledge-based methods for automatic extraction of domain-specific ontologies. [D] . Punuru, Janardhana R. 2007

机译：用于自动提取特定领域本体的基于知识的方法。
6. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks [O] . Mohammed Alawad, Shang Gao, John X Qiu, 2020

机译：使用Multitask卷积神经网络自动提取癌症注册表的癌症注册表可报告信息
7. A Method of Automatic Domain Extraction of Text to Facilitate Retrieval of Arabic Documents [O] . Mohammad Khaled A. Al-Maghasbeh, Mohd Pouzi 2018

机译：一种自动域提取文本的方法，便于检索阿拉伯文档

Automatic open domain information extraction from Indonesian text

摘要

著录项

相似文献

相关主题

期刊订阅