How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering.Word frequency lists are cheap and easy to generate so a measrue based on them world be of use as a quick guide in many circumstances; fro example,to judge how a newly available corpus related to exisxting resources,or how easy it might be to port an NLP system designed to work with one text tyep to work with another.We hsow that corpus similarity cna only be intepreted in the light of corpus homogeneity.the paper presents a measure,based on the X~2 statistic,fro measuring both corpus similarity and corpus homogeneity.The measure is compared with a rank-based measure and hsown to outprform it.Some results are presented.A method fro evaluating the accuracy of the measure is introduced and some results of using the measure are presented.
展开▼