Recognition of special linguistic patterns in a certain language is very helpful for many NLP applications such as information extraction, machine translation and parsing. State-of-the-arts syntax parsers are based on given grammar. The used grammar is context free and cannot discover complex patterns which contain multiple linguistic units. We propose an unsupervised method to automatically discover the complex linguistic patterns from a classically parsed corpus. A specialized and efficient algorithm is applied to mine the frequent subtrees in the forest and the found subtrees are formalized as the linguistic patterns. The approach is validated on the Penn Chinese Treebank with found linguistic patterns.
展开▼
机译:识别某种语言中的特殊语言模式对于许多NLP应用程序(例如信息提取,机器翻译和解析)非常有帮助。最新的语法解析器基于给定的语法。使用的语法不受上下文限制,无法发现包含多个语言单元的复杂模式。我们提出了一种无监督的方法,可以从经典解析的语料库中自动发现复杂的语言模式。应用一种专业高效的算法在森林中挖掘频繁的子树,并将找到的子树形式化为语言模式。该方法已在Penn Chinese Treebank上以发现的语言模式进行了验证。
展开▼