Using word frequency lists to measure corpus homogeneity and similarity between corpora

机译：使用词频表来测量语料库之间的语料同质性和相似性

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering.Word frequency lists are cheap and easy to generate so a measrue based on them world be of use as a quick guide in many circumstances; fro example,to judge how a newly available corpus related to exisxting resources,or how easy it might be to port an NLP system designed to work with one text tyep to work with another.We hsow that corpus similarity cna only be intepreted in the light of corpus homogeneity.the paper presents a measure,based on the X~2 statistic,fro measuring both corpus similarity and corpus homogeneity.The measure is compared with a rank-based measure and hsown to outprform it.Some results are presented.A method fro evaluating the accuracy of the measure is introduced and some results of using the measure are presented.

机译：两个语料库有多相似？语料相似度的测量对词典学和语言工程非常有用。词频表便宜且易于生成，因此基于它们的测度表可在许多情况下用作快速指南；例如，要判断一个新的语料库与现有资源的关系如何，或者移植一个设计用于一种文本类型的NLP系统与另一种文字类型一起使用可能有多容易。我们认为，语料库相似性cna只能从角度来理解本文基于X〜2统计量，提出了一种度量语料相似度和语料均质性的度量方法，将该度量值与基于等级的度量值进行比较，并据此进行超越。本文提出了一些结果。一种方法介绍了评估该方法的准确性，并给出了使用该方法的一些结果。

著录项

来源
《Proceedings of the Fifth workshop on very large corpora》|1997年|p.231-245|共15页
会议地点 Beijing Hong Kong(CN)
作者
Adam Kilgarriff;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
入库时间 2022-08-26 14:23:40

相似文献

外文文献
中文文献
专利

1. Low-frequency words in bilingual corpora a step towards automatic extraction of bilingual word pairs [J] . Keita Tsuji, Fuyuki Yoshikane, Kyo Kageura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2000,第200期

机译：双语语料库中的低频单词迈向自动提取双语单词对的一步
2. Low-frequency words in bilingual corpora a step towards automatic extraction of bilingual word pairs [J] . Keita Tsuji, Fuyuki Yoshikane, Kyo Kageura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2000,第200期

机译：双语语料库中的低频单词迈向自动提取双语单词对的一步
3. Low-frequency words in bilingual corpora a step towards automatic extraction of bilingual word pairs [J] . Keita Tsuji, Fuyuki Yoshikane, Kyo Kageura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2000,第200期

机译：双语语料库中的低频词是自动提取双语词对的一步
4. Using word frequency lists to measure corpus homogeneity and similarity between corpora [C] . Adam Kilgarriff Workshop on very large corpora . 1997

机译：使用Word频率列表来测量语料库的均匀性和相似性
5. Building High-frequency Word Lists for the Semantic Domain of ?āINA (‘land’) Using a Raw Corpus of Spoken ?ōlelo Hawai?i [D] . Brockway, Catherine Elizabeth Lee. 2021

机译：使用原始语料库的语义域构建高频词列表？lelo hawai？我
6. Measuring open-set word recognition in school-aged children: Corpus of monosyllabic target words and speech maskers [O] . Angela Yarnell Bonino, Ashley R. Malley -1

机译：测量学龄儿童的开放式单词识别：单音节目标单词和语音掩盖语的语料库
7. Corpus-Based Frequency Profiling: Migration To A Word List Based On The British National Corpus [O] . Leah Gilner, Frank Morales 2010

机译：基于语料库的频率分析：基于英国国家语料库迁移到单词列表

Using word frequency lists to measure corpus homogeneity and similarity between corpora

摘要

著录项

相似文献

相关主题

期刊订阅