【24h】

Transforming the arχiv to XML

机译:将Arχiv转化为XML

获取原文

摘要

We describe an experiment of transforming large collections of LATEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (ARXIV) using the LATEX to XML converter which is currently under development. The main technical task of our ARXMLIV project is to supply LaTeXML bindings for the (thousands of) LATEX classes and packages used in the ARXIV collection. For this we have developed a distributed build system that reiteratively runs LaTeXML over the ARXIV collection and collects statistics about e.g. the most sorely missing LaTeXML bindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXML package and to binding implementers. We have now processed the complete ARXIV collection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are LATEX have been converted by LaTeXML without noticing an error and are available as XHTML+MathML documents).
机译:我们描述了将大型乳胶文件转变为更多机可理解的陈述的实验。具体地,我们正在使用目前正在开发的乳胶到XML转换器转换康奈尔电子印刷存档(ARXIV)的科学出版物的集合。我们的ARXMLIV项目的主要技术任务是为ARXIV集合中使用的(数千个)乳胶类和软件包提供乳胶绑定。为此,我们开发了一个分布式构建系统,在Arxiv收集中重复运行LaTeXML,并收集大约一节的统计信息。最严重缺少的乳胶绑定和群集常见错误事件。这为乳胶包的开发人员和绑定实施者创造了有价值的反馈。我们现在已从1993年从1993年处理了超过40万个文件的完整Arxiv收集,直到2006年(一跑是一个处理器年级的承诺),并使我们的成功率不断提高到56%以上(即超过56%的文件乳胶是否已被乳胶转换,而不会注意到错误,并且可用作XHTML + MathML文档)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号