首页> 外国专利> METHOD OF CONSTRUCTING CORPUS BASED ON INTERNET FORUMS

METHOD OF CONSTRUCTING CORPUS BASED ON INTERNET FORUMS

机译:基于互联网论坛的语料库构建方法

摘要

FIELD: physics, computer engineering.SUBSTANCE: invention relates to systems and methods of creating corpuses for various research and other purposes. The method of constructing a corpus based on Internet forums for a computer system comprises constructing a document object model (DOM) in the form of a tree DOM data structure; selecting a group of single-type vertices in the DOM tree; removing optional design elements from pages; merging non-sheet vertices with the same names in the object model tree and combining sheet vertices with the same properties; estimating the vertices and filtering groups; constructing XPATH expressions and applying the obtained XPATH expressions to a set of files containing all documents from a selected forum.EFFECT: high accuracy of separating user text from other content on web pages with automatic construction of a corpus.10 cl, 3 dwg
机译:技术领域:物理学,计算机工程。发明领域:本发明涉及为各种研究和其他目的而创建语料库的系统和方法。基于互联网论坛为计算机系统构建语料库的方法包括:以树形DOM数据结构的形式构建文档对象模型(DOM)。在DOM树中选择一组单型顶点;从页面中删除可选的设计元素;在对象模型树中合并具有相同名称的非图纸顶点,并合并具有相同属性的图纸顶点;估计顶点和过滤组;构造XPATH表达式并将获得的XPATH表达式应用于包含来自选定论坛的所有文档的文件集。效果:通过自动构建语料库,可以高度准确地将用户文本与网页上的其他内容分离。10 cl,3 dwg

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号