Recently, a new class of data-intensive application becomes widely recognized where data is modeled best as transient open-end streams rather than persistent tables on disk. It leads to a new surge of research interest called data streams. However, most of the reported works are concentrated on structural data, such as bit-sequences, and seldom focus on unstructural data, such as textual documents. In this paper, we propose an efficient classification approach for classifying high-speed text streams. The proposed approach is based on sketches such that it is able to classify the streams efficiently by scanning them only once, meanwhile consuming a small bounded of memory in both model maintenance and operation. Extensive experiments using benchmarks and a real-life news article collection are conducted. The encouraging results indicated that our proposed approach is highly feasible.
展开▼