In general, natural language processing researchers are finding that statistical methods can do something that it was once thought could only be done by intellectual understanding, but there were not many fruitful experiments reported in genre recognition algorithm. In this context, we have looked into the question of distinguishing different genres of text by purely statistical means. To illustrate our approach, we report here on experiments to distinguish news journal article from government documents using only information about the relative frequencies of punctuation marks. In our pervious study, we have applied discriminant analysis to achieve about 80% of correct classification rate. In the experiment reported in this paper, we used other statistical techniques to improve our methods and finally we could push correct classification rate up to about 90%. The coefficients of the classifying equations may serve as genre signatures. The methods developed here can be used for automatic classification of web pages into different genres after stable genre signatures are detected.
展开▼