首页> 外文会议>Construction Research Congress >Evaluation of Seven Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors
【24h】

Evaluation of Seven Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors

机译:评估七分语音标记标记的建筑码:识别最佳性能的标签和常见错误来源

获取原文

摘要

As the number, size, and complexity of building construction projects increase, code compliance checking becomes more challenging because of the time-consuming, costly, and error-prone nature of a manual checking process. A fully automated code compliance checking would be desirable in facilitating a more efficient, cost effective, and human error-proof code checking. Such automation requires automated information extraction from building designs and building codes, and automated information transformation to a format that allows automated reasoning. Natural language processing (NLP) is an important technology to support such automated processing of building codes, because building codes are represented in natural language texts. Part-of-speech (POS) tagging, as an important basis of NLP tasks, must have a high performance to ensure the quality of the automated processing of building codes in such a compliance checking system. However, no systematic testing of existing POS taggers on domain specific building codes data have been performed. To address this gap, the authors analyzed the performance of seven state-of-the-at POS taggers on tagging building codes and compared their results to a manually-labeled gold standard. The authors aim to: (1) find the best performing tagger in terms of accuracy, and (2) identify common sources of errors. In providing the POS tags, the authors used the Perm Treebank tagset, which is a widely used tagset with a proper balance between conciseness and information richness. An average accuracy of 88.80% was found on the testing data. The Standford coreNLP tagger outperformed the other taggers in the experiment. Common sources of errors were identified to be: (1) word ambiguity, (2) rare words, and (3) unique meaning of common English words in the construction context. The found result of machine taggers on building codes calls for performance improvement, such as error-fixing transformational rules and machine taggers that are trained on building codes.
机译:由于建筑施工项目的数量,规模和复杂性增加,代码合规检查因手动检查过程的耗时,昂贵和易于易受性质而变得更具挑战性。完全自动化的代码合规性检查是可促进更有效,经济效益和人为错误的代码检查。此类自动化需要自动提取从建立设计和构建代码,以及自动信息转换到允许自动推理的格式。自然语言处理(NLP)是支持这种建筑码自动处理的重要技术,因为建筑码在自然语言文本中表示。作为NLP任务的重要依据,语音部分(POS)标记必须具有高性能,以确保在这种合规性检查系统中的构建代码自动处理的质量。但是,已经执行对域特定构建代码数据的现有POS标记的系统测试。为了解决这一差距,作者分析了在标记建筑码上的七个位于POS标签的性能,并将其结果与手动标记的金标准进行了比较。作者旨在:(1)在准确性方面找到最好的表现标记,(2)确定常见的错误源。在提供POS标签时,作者使用了普遍的TreeBank Taget,这是一个广泛使用的标签,在简洁和信息之间具有适当的平衡。在测试数据上发现了88.80%的平均精度。 Standford Corenlp标记器优于实验中的其他标记器。鉴定了常见的错误来源是:(1)歧义,(2)稀有字,(3)施工背景中常见英语单词的独特含义。机器标签在构建代码上的发现结果呼叫性能改进,例如在建筑码上培训的错误修复变换规则和机器标签。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号