Defining text topicality is often an expensive problem thatrequires significant resources for text labeling. Though manypackages already exist that provide dictionaries of labeled text,synonyms, and Part-of-Speach tagging, the problem is ongoingas language develops and new meanings of words and phrasesemerge. This paper proposes a cheap in human labor solution totopic labeling of any text in the majority of languages. Themethodology uses links to the naturally emerging corpus oflabeled text – the Wikipedia. Wikipedia categories areprocessed to extract a weighted set of topic labels for theanalyzed text. The approach is evaluated by processingcategorized texts and comparing the similarity of the top ranksof topic labels to the text category. The topic labels extractedusing this methodology can be used for comparing similarity oftexts, for the assessment of the completeness of topic coveragein automated marking of essays, and for coding in qualitativetext analysis. The paper contributes to the field of NLP byoffering a cheap and organically developing method of topicaltext labeling. The paper contributes to the work of qualitativeanalysts by offering a methodology for the analysis of interviewtranscripts and other unstructured text.
展开▼