COMPARATIVE ANALYSIS OF SUPERVISED AND UNSUPERVISED LEARNING ALGORITHMS FOR ONLINE USER CONTENT SUICIDAL IDEATION DETECTION

SERGAZY NARYNOV; DANIYAR MUKHTARKHANULY; ILMURAT KERIMOV; BATYRKHAN OMAROV

摘要

Suicide is one of the leading causes of death in most countries around the world; it is one of the three most common causes of death in a group of young people (15-24 years old), but so far no methods have been developed for diagnosing suicidal tendencies. In this connection, the problem of developing methods for identifying people prone to suicidal behavior is becoming especially topical. One of the directions of such research is the search for typological features of the speech related to suicide using the methods of mathematical linguistics, automatic text processing and machine learning. In foreign science, the texts of people that were motivated by suicide (mainly suicide notes) are studied using methods of automatic text processing (natural language processing), machine learning methods, and models that are constructed to allow to classify whether the text is related to suicide or not. It seems obvious that in order to develop methods for identifying people who are prone to suicide, it is necessary to analyze not only suicide notes (which are usually texts of small volume), but also other texts created by people who have committed suicide. The purpose of this work is to build a model of machine learning, apply teaching methods with and without a teacher, then select the most efficient algorithm for the task to classify whether the text is connected to suicide using comparative analysis. Our research contributes to detection of depressive content that can cause suicide, and to help such people reach confident help from psychologists of national suicide preventing center in Kazakhstan. Obtaining highest result for 95% of f1-score for Random Forest (Supervised) with tf-idf vectorization model, in conclusion we may say that K-means (Unsupervised) using tf-idf shows impressive results, which is only 4% lower in f1-score and precision.

机译：自杀是世界各国的主要死因之一;这是一群年轻人（15-24岁）中的三个最常见的死因之一（15-24岁），但到目前为止没有开发任何用于诊断自杀趋势的方法。在这方面，识别易于自杀行为的人的发展方法正在变得尤其是局部。这种研究的一个方向是搜索使用数学语言学方法，自动文本处理和机器学习的方法的语音类型的类型。在外国科学中，通过自动文本处理（自然语言处理），机器学习方法和模型来研究由自杀的人（主要是自杀式笔记）的文本，以允许分类文本是否相关的方法自杀或不自杀。显而易见的是，为了开发识别易于自杀的人的方法，不仅要分析自杀票据（通常是小卷的文本），还要分析，而且还有致力于自杀的人创造的其他文本。这项工作的目的是建立一个机器学习型号，用手施用教学方法，然后选择最有效的算法，以分类文本是否连接到自杀，使用比较分析。我们的研究有助于检测可能导致自杀的抑郁症，并帮助这些人在哈萨克斯坦国家自杀预防中心的心理学家达到自信的帮助。获得最高结果的95％的F1分数用于随机森林（监督）与TF-IDF矢量化模型，总之，我们可以说使用TF-IDF的K-Means（无监督）显示出令人印象深刻的结果，这仅为4％ F1 - 得分和精度。

COMPARATIVE ANALYSIS OF SUPERVISED AND UNSUPERVISED LEARNING ALGORITHMS FOR ONLINE USER CONTENT SUICIDAL IDEATION DETECTION

摘要

著录项

相关主题

期刊订阅