首页> 外文会议>Brazilian Conference on Intelligent Systems >Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain
【24h】

Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

机译:文本特异性和规模对词嵌入性能的影响:巴西法律领域的实证评价

获取原文

摘要

Word embeddings is a text representation technique capable of capturing syntactic and semantic linguistic patterns and of representing each word as an n-dimensional dense vector. In the domain of legal texts, there are trained word embeddings in languages like English, Polish, and Chinese. However, to the best of our knowledge, there are no embeddings based on Portuguese (Brazilian and European) legal texts. Given that, our research question is: does the specificity and size of the text corpus used for a word embedding training contribute to a more successful classification? To answer the question, we train word embeddings models in the legal domain with different levels of specificity and size. Then we evaluate their impact on text classification. To deal with the different levels of specificity, we collect text documents from different courts of the Brazilian Judiciary, in hierarchical order. We used these text corpora to train a word embeddings model (GloVe) and then had then evaluated while classifying processes with a deep learning model (CNN). In a context perspective, the results show that in word embeddings trained on smaller corpora sizes, text specificity has a higher impact than for large sizes. Also, in a corpus size perspective, the results demonstrate that the greater the corpus size in embeddings training, the better are the results. However, this impact decreases as the corpus size increases until a point where more words in the corpus have little impact on the results.
机译:Word Embeddings是能够捕获句法和语义语言模式的文本表示技术,以及表示每个单词作为N维密度矢量。在法律文本的领域中,有训练有素的单词嵌入语言,如英语,波兰语和中国人。然而,据我们所知,基于葡萄牙语(巴西和欧洲)法律文本的嵌入没有嵌入。鉴于这一点,我们的研究问题是:用于嵌入培训单词的文本语料库的特殊性和规模是否有助于更成功的分类?要回答这个问题,我们在法律领域中培训单词嵌入式模型,不同程度的特异性和大小。然后我们评估他们对文本分类的影响。要处理不同程度的特殊性,我们以分层秩序收集来自巴西司法机构的不同法院的文本文件。我们使用这些文本语料库来培训一个单词嵌入式模型(手套),然后在分类具有深度学习模型(CNN)的过程时进行评估。在上下文中,结果表明,在较小的语料库尺寸培训的单词嵌入中,文本特异性的影响力比大尺寸更高。此外,在语料库尺寸的角度下,结果表明,嵌入训练中的语料库尺寸越大,结果越好。然而,这种影响随着毒品尺寸的增加而减小,直到导致语料库中更多单词对结果影响很小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号