首页> 外文OA文献 >The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)
【2h】

The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

机译:数字人文搜索和挖掘系统的解剖:语言档案搜索和挖掘工具(samTLa)

摘要

Humanities researchers are faced with an overwhelming volume of digitisedudprimary source material, and "born digital" information, of relevance to theirudresearch as a result of large-scale digitisation projects. The current digital toolsuddo not provide consistent support for analysing the content of digital archivesudthat are potentially large in scale, multilingual, and come in a range of dataudformats. The current language-dependent, or project specific, approach to tooluddevelopment often puts the tools out of reach for many research disciplines inudthe humanities. In addition, the tools can be incompatible with the wayudresearchers locate and compare the relevant sources. For instance, researchersudare interested in shared structural text patterns, known as parallel passages"udthat describe a specific cultural, social, or historical context relevant to theirudresearch topic. Identifying these shared structural text patterns is challenginguddue to their repeated yet highly variable nature, as a result of differences inudthe domain, author, language, time period, and orthography.udThe contribution of the thesis is a novel infrastructure that directly addressesudthe need for generic, udflexible, extendable, and sustainable digital toolsudthat are applicable to a wide range of digital archives and research in theudhumanities. The infrastructure adopts a character-level n-gram StatisticaludLanguage Model (SLM), stored in a space-optimised k-truncated suffix treeuddata structure as its underlying data model. A character-level n-gram modeludis a relatively new approach that is competitive with word-level n-gram models,udbut has the added advantage that it is domain and language-independent,udrequiring little or no preprocessing of the document text unlike word-leveludmodels that require some form of language-dependent tokenisation and stemming.udCharacter-level n-grams capture word internal features that are ignoredudby word-level n-gram models, which provides greater udexibility in addressingudthe information need of the user through tolerant search, and compensationudfor erroneous query specification or spelling errors in the document text. Furthermore,udthe SLM provides a unified approach to information retrieval andudtext mining, where traditional approaches have tended to adopt separate dataudmodels that are often ad-hoc or based on heuristic assumptions. In addition,udthe performance of the character-level n-gram SLM was formally evaluatedudthrough crowdsourcing, which demonstrates that the retrieval performance ofudthe SLM is close to that of the human level performance.udThe proposed infrastructure, supports the development of the Samtla (SearchudAnd Mining Tools for Language Archives), which provides humanities researchersuddigital tools for search, browsing, and text mining of digital archivesudin any domain or language, within a single system. Samtla supersedes many ofudthe existing tools for humanities researchers, by supporting the same or similarudfunctionality of the systems, but with a domain-independent and languageindependentudapproach. The functionality includes a browsing tool constructedudfrom the metadata and named entities extracted from the document text, audhybrid-recommendation system for recommending related queries and documents.udHowever, some tools are novel tools and developed in response toudthe specific needs of the researchers, such as the document comparison tooludfor visualising shared sequences between groups of related documents. Furthermore,udSamtla is the first practical example of a system with a SLM asudits primary data model that supports the real research needs of several caseudstudies covering different areas of research in the humanities.
机译:由于大规模的数字化项目,人文研究人员面临着大量的数字化原始资料和与他们的研究相关的“天生数字”信息。当前的数字工具 ud无法为分析数字档案的内容提供一致的支持,而ud可能规模庞大,使用多种语言,并且具有多种数据 udformat。当前的依赖于语言或特定于项目的工具开发方法常常使工具对于人文科学的许多研究学科来说遥不可及。此外,这些工具可能与 udresearchers查找和比较相关资源的方式不兼容。例如,研究人员对共享结构文本模式(称为“平行段落” ud)感兴趣,它描述了与他们的 udresearch主题相关的特定文化,社会或历史背景。识别这些共享结构文本模式对他们来说是具有挑战性的由于域,作者,语言,时间段和拼字法的不同,导致重复的但高度可变的性质。论文的贡献是一种新颖的基础结构,直接解决了对通用,易弯曲,可扩展,基础设施采用字符级n元语法统计 udLanguage模型(SLM),存储在空间优化的k截断后缀中树 uddata结构作为其基础数据模型。字符级n-gram模型 udis一种相对较新的方法,可与单词级n-gram模型竞争, ud但具有附加的优势ge与域和语言无关, ud很少或不需要对文档文本进行预处理,这与单词级 udmodel不同,后者需要某种形式的依赖于语言的标记和词干。 ud字符级n-gram捕获单词的内部特征单词级n元语法模型将忽略 ud,从而通过容错搜索提供更大的 udxibility来解决 ud用户的信息需求,并为文档文本中的错误查询规范或拼写错误提供补偿 ud。此外,SLM为信息检索和文本挖掘提供了统一的方法,其中传统方法倾向于采用通常是临时性的或基于启发式假设的单独数据模型。此外,通过众包形式对字符级n-gram SLM的性能进行了正式评估 ud,这表明 ud SLM的检索性能接近于人类水平的性能。 ud拟议的基础架构支持开发Samtla(用于语言档案的Search udAnd挖掘工具)的产品,它提供了人文研究人员 uddigital工具,用于在单个系统中以任何域或语言搜索,浏览和文本挖掘数字档案 ud。 Samtla通过支持系统相同或相似的 udfunction,但具有与域无关且与语言无关的 udappach方法,取代了许多用于人文研究人员的现有工具。该功能包括从元数据构造的浏览工具和从文档文本中提取的命名实体,以及用于推荐相关查询和文档的混合建议系统。不过,某些工具是新颖的工具,是针对特定需求而开发的的研究人员,例如文档比较工具 ud,用于可视化相关文档组之间的共享序列。此外, udSamtla是使用SLM作为 udit主要数据模型的系统的第一个实际示例,该模型支持涵盖人文学科不同研究领域的多个案例研究的实际研究需求。

著录项

  • 作者

    Harris Martyn;

  • 作者单位
  • 年度 2017
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号