We describe basic concepts and software architectures for the integration of shallow and deep (linguistics-based, semantics-oriented) natural language processing (NLP) components. The main goal of this novel, hybrid integration paradigm is improving robustness of deep processing. After an introduction to constraint-based natural language parsing, we give an overview of typical shallow processing tasks. We introduce XML standoff markup as an additional abstraction layer that eases integration of NLP components, and propose the use of XSLT as a standardized and efficient transformation language for online NLP integration. In the main part of the thesis, we describe our contributions to three hybrid architecture frameworks that make use of these fundamentals. SProUT is a shallow system that uses elements of deep constraint-based processing, namely type hierarchy and typed feature structures. WHITEBOARD is the first hybrid architecture to integrate not only part-of-speech tagging, but also named entity recognition and topological parsing, with deep parsing. Finally, we present Heart of Gold, a middleware architecture that generalizes WHITEBOARD into various dimensions such as configurability, multilinguality and flexible processing strategies. We describe various applications that have been implemented using the hybrid frameworks such as structured named entity recognition, information extraction, creative document authoring support, deep question analysis, as well as evaluations. In WHITEBOARD, e.g., it could be shown that shallow pre-processing increases both coverage and efficiency of deep parsing by a factor of more than two. Heart of Gold not only forms the basis for applications that utilize semanticsoriented natural language analysis, but also constitutes a complex research instrument for experimenting with novel processing strategies combining deep and shallow methods, and eases replication and comparability of results.
展开▼
机译:我们描述了浅层和深层(基于语言,面向语义)自然语言处理(NLP)组件集成的基本概念和软件体系结构。这种新颖的混合集成范例的主要目标是提高深度处理的鲁棒性。在介绍了基于约束的自然语言解析之后,我们概述了典型的浅层处理任务。我们引入XML隔离标记作为附加的抽象层,简化了NLP组件的集成,并建议使用XSLT作为在线NLP集成的标准化和高效转换语言。在论文的主要部分,我们描述了我们对利用这些基础知识的三个混合体系结构框架的贡献。 SProUT是一个浅层系统,它使用基于深度约束的深度处理元素,即类型层次结构和类型化特征结构。 WHITEBOARD是第一个混合语言体系结构,它不仅集成了词性标记,而且还命名实体识别和拓扑解析以及深度解析。最后,我们介绍了“黄金之心”,这是一种中间件体系结构,将WHITEBOARD概括为各个方面,例如可配置性,多语言和灵活的处理策略。我们描述了使用混合框架实现的各种应用程序,例如结构化的命名实体识别,信息提取,创意文档编写支持,深入的问题分析以及评估。例如,在WHITEBOARD中,可以表明,浅层预处理将深度解析的覆盖率和效率提高了两倍以上。 Gold of Heart不仅构成利用面向语义的自然语言分析的应用程序的基础,而且构成了一种复杂的研究工具,用于尝试结合深浅方法的新颖处理策略,并简化了结果的复制和可比性。
展开▼