首页> 外文会议>Brazilian Conference on Intelligent Systems >Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

【24h】

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

机译：文本特异性和规模对词嵌入性能的影响：巴西法律领域的实证评价

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word embeddings is a text representation technique capable of capturing syntactic and semantic linguistic patterns and of representing each word as an n-dimensional dense vector. In the domain of legal texts, there are trained word embeddings in languages like English, Polish, and Chinese. However, to the best of our knowledge, there are no embeddings based on Portuguese (Brazilian and European) legal texts. Given that, our research question is: does the specificity and size of the text corpus used for a word embedding training contribute to a more successful classification? To answer the question, we train word embeddings models in the legal domain with different levels of specificity and size. Then we evaluate their impact on text classification. To deal with the different levels of specificity, we collect text documents from different courts of the Brazilian Judiciary, in hierarchical order. We used these text corpora to train a word embeddings model (GloVe) and then had then evaluated while classifying processes with a deep learning model (CNN). In a context perspective, the results show that in word embeddings trained on smaller corpora sizes, text specificity has a higher impact than for large sizes. Also, in a corpus size perspective, the results demonstrate that the greater the corpus size in embeddings training, the better are the results. However, this impact decreases as the corpus size increases until a point where more words in the corpus have little impact on the results.

机译：Word Embeddings是能够捕获句法和语义语言模式的文本表示技术，以及表示每个单词作为N维密度矢量。在法律文本的领域中，有训练有素的单词嵌入语言，如英语，波兰语和中国人。然而，据我们所知，基于葡萄牙语（巴西和欧洲）法律文本的嵌入没有嵌入。鉴于这一点，我们的研究问题是：用于嵌入培训单词的文本语料库的特殊性和规模是否有助于更成功的分类？要回答这个问题，我们在法律领域中培训单词嵌入式模型，不同程度的特异性和大小。然后我们评估他们对文本分类的影响。要处理不同程度的特殊性，我们以分层秩序收集来自巴西司法机构的不同法院的文本文件。我们使用这些文本语料库来培训一个单词嵌入式模型（手套），然后在分类具有深度学习模型（CNN）的过程时进行评估。在上下文中，结果表明，在较小的语料库尺寸培训的单词嵌入中，文本特异性的影响力比大尺寸更高。此外，在语料库尺寸的角度下，结果表明，嵌入训练中的语料库尺寸越大，结果越好。然而，这种影响随着毒品尺寸的增加而减小，直到导致语料库中更多单词对结果影响很小。

著录项

来源
《Brazilian Conference on Intelligent Systems》|2020年|521-535|共15页
会议地点
作者
Thiago Raulino Dal Pont; Isabela Cristina Sabo; Jomi Fred Huebner; Aires Jose Rover;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Word embeddings; Legal corpora; GloVe; Text classification; Convolutional Neural Network;

机译：单词嵌入式;合法的;手套;文本分类;卷积神经网络;

相似文献

外文文献
中文文献
专利

1. Detecting new Chinese words from massive domain texts with word embedding [J] . Qian Yu, Du Yang, Deng Xiongwen, Journal of Information Science . 2019,第2期

机译：通过单词嵌入从大量领域文本中检测新的中文单词
2. Development and evaluation of novel ophthalmology domain-specific neural word embeddings to predict visual prognosis [J] . Wang Sophia, Tseng Benjamin, Hernandez-Boussard Tina International journal of medical informatics . 2021,第Juna期

机译：新型眼科域特异性神经词嵌入的开发与评价，以预测视觉预后
3. Text Rank for Domain Specific Using Field Association Words [J] . Omnia G. El Barbary, El Sayed Atlam Journal of Computer and Communications . 2020,第11期

机译：使用字段关联单词特定的文本等级
4. Semi-Supervised Word Sense Disambiguation Using Word Embeddings in General and Specific Domains [C] . Kaveh Taghipour, Hwee Tou Ng Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2015

机译：在一般和特定领域使用词嵌入进行半监督词义消歧
5. Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages [D] . Kamil, Shoaib Ashraf. 2012

机译：自动调整的领域特定嵌入式语言的高效生产并行编程
6. How Word Reading Skill Impacts Text Memory: The Centrality Deficit and How Domain Knowledge Can Compensate [O] . Amanda C. Miller, Janice M. Keenan -1

机译：Word如何阅读能力的影响文本内存：采用掌微亏和如何领域知识可以补偿
7. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain [O] . Alhanoof Althnian, Duaa AlSaeed, Heyam Al-Baity, 2021

机译：数据集大小对分类性能的影响：医学领域的实证评估

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

摘要

著录项

相似文献

相关主题

期刊订阅