Text segmentation is the process of converting information in unstructured text into structured records. This is an important problem since structured data is amenable to efficient query processing. CRFs are a class of discriminative probabilistic models that are gaining acceptance as an effective computing machinery for text segmentation. An important aspect of CRFs is learning model parameters from labeled training data. Labeling can be a labor intensive process. One can avoid the labeling step by using structured reference tables whose data domains and that of the input text data given for segmentation, coincide. In other words the labels in the training data drawn from reference tables "come for free". Inspired by recent work on their use for training HMMs, we developed an unsupervised technique for text segmentation with CRFs using reference tables. Assuming text sequences to be segmented come in batches and sequences in a batch conform to the same attribute order, we build CRF models for each attribute in the reference table, use them to decide the attribute order of a batch of input sequences, derive labeled training data from the reference table according to that order, and train a global CRF model to segment the input sequences in the batch. Preliminary experimental results indicate that our technique works well in practice.
展开▼