【24h】

Topic Modelling Brazilian Supreme Court Lawsuits

机译:主题建模巴西最高法院诉讼

获取原文

摘要

The present work proposes the use of Latent Dirichlet Allocation to model Extraordinary Appeals received by Brazil's Supreme Court. The data consist of a corpus of 45,532 lawsuits manually annotated by the Court's experts with theme labels, a multi-class and multi-label classification task. We initially train models with 10 and 30 topics and analyze their semantics by examining each topic's most relevant words and their most representative texts, aiming to evaluate model interpretability and quality. We also train models with 30, 100, 300 and 1,000 topics, and quantitatively evaluate their potential using the topics to generate feature vectors for each appeal. These vectors are then used to train a lawsuit theme classifier. We compare traditional bag-of-words approaches (word counts and tf-idf values) with the topic-based text representation to assess topic relevancy. Our topics semantic analysis demonstrate that our models with 10 and 30 topics were capable of capturing some of the legal matters discussed by the Court. In addition, our experiments show that the model with 300 topics was the best text vectoriser and that the interpretable, low dimensional representations it generates achieve good classification results.
机译:本工作提出了利用潜在的Dirichlet分配来模拟巴西最高法院的非凡呼吁。该数据包括由法院专家手动注释的45,532名诉讼,其中包括主题标签,多级和多标签分类任务。我们首先通过10和30个主题培训模型,并通过检查每个主题最相关的单词及其最具代表性的文本来分析他们的语义,旨在评估模型可解释性和质量。我们还使用30,100,300和1,000个主题培训模型,并使用主题定量评估其潜力,以为每个吸引力生成特征向量。然后使用这些载体训练诉讼主题分类器。我们将传统的词语方法(单词计数和TF-IDF值)与基于主题的文本表示进行比较,以评估主题相关性。我们的主题语义分析表明,我们具有10和30个主题的模型能够捕获法院讨论的一些法律事务。此外,我们的实验表明,具有300个主题的模型是最好的文本矢量传染媒介,并且可解释的低维度表示它产生的良好分类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号