首页> 外文期刊>ACM journal of data and information quality >Document and Corpus Quality Challenges for Knowledge Management in Engineering Enterprises
【24h】

Document and Corpus Quality Challenges for Knowledge Management in Engineering Enterprises

机译:工程企业知识管理中的文档和语料库质量挑战

获取原文
获取原文并翻译 | 示例
           

摘要

Enterprise data is an amalgam of mostly semistructured and unstructured data and documents stored in heterogeneous systems. The available structure is often not readily apparent or modeled to be useful. Formats such as PDF, DWG, Excel, or Word offer a high grade of flexibility; the issue is rather that their freeform content does not divulge its structure and meaning. In the case of binary formats as used in CAD or simulation tools, even basic textual content may be missing. Structured metadata is only sometimes available. When taking a step back, we see a challenging quality issue based not on individual documents, but on the whole corpus of enterprise documents. Our specific background is an engineering setting with large sets of documents from whole development and manufacturing lifecycles. An individual document is of limited use; the organizational knowledge is distributed throughout many separate documents and entities. In some cases, it is easy to find, but in other cases, heterogenous documents all over the organization make up the knowledge of, for example, how to build a complex processing plant. For such complex retrieval tasks, different types of relations between documents and their entities have to be identified, such as same author, same or similar parts, part of same project, subpart or subproject, predecessor, precondition, clarifications, updates, vendor lists, financial or structural relations, similar tasks in previous projects, and many more [Ahlers and Mehrpoor 2014].
机译:企业数据是存储在异构系统中的大多数半结构化和非结构化数据以及文档的混合物。可用的结构通常不容易看出来或没有被建模为有用的。 PDF,DWG,Excel或Word之类的格式具有高度的灵活性。问题在于,它们的自由格式内容不会泄露其结构和含义。对于CAD或模拟工具中使用的二进制格式,甚至可能缺少基本文本内容。结构化元数据有时仅可用。当退后一步时,我们会看到一个具有挑战性的质量问题,它不是基于单个文档,而是基于整个企业文档集。我们的特定背景是一个工程环境,其中包含来自整个开发和制造生命周期的大量文档。单个文件用途有限;组织知识分布在许多单独的文档和实体中。在某些情况下,很容易找到,但是在其他情况下,组织中的异构文档构成了例如如何构建复杂的加工厂的知识。对于此类复杂的检索任务,必须确定文档及其实体之间的不同类型的关系,例如同一作者,相同或相似的部分,同一项目的一部分,子部分或子项目,前身,前提条件,说明,更新,供应商列表,财务或结构关系,以前项目中的类似任务,以及更多[Ahlers and Mehrpoor 2014]。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号