首页> 外国专利> Automatic information extraction method from a web text using mDTD rule

Automatic information extraction method from a web text using mDTD rule

机译:使用mDTD规则从Web文本自动提取信息的方法

摘要

PURPOSE: A method for automatically extracting the information of a web document using an mDTD(modified Document Type Definition) grammar rule is provided to conveniently and efficiently extract many information from the vast information of a domain by using the mDTD rule through the mechanical repetition learning. CONSTITUTION: The method for the mechanical learning comprises the steps of collecting the web document from the domain(S1), transforming the web document into a text object(S2), extracting a sample data from the text object according to a previously written seed mDTD rule(S3), attaching a format element tag to the sample data(S4), and generating the proper mDTD rule by using the tagged sample data(S5). The method for the automatic extraction comprises the steps of collecting the web document from the domain(S11), transforming the web document into the text object(S12), attaching the format element tag to the text object(S13), extracting a target by judging which mDTD rule among the mDTD rules generated by the mechanical learning process is suitable for the tagged text object(S14), and storing the extracted target in a domain database(S15).
机译:目的:提供一种使用mDTD(修改的文档类型定义)语法规则自动提取Web文档信息的方法,以通过机械重复学习使用mDTD规则方便,有效地从域的大量信息中提取许多信息。 。组成:用于机械学习的方法包括以下步骤:从域中收集Web文档(S1),将Web文档转换为文本对象(S2),根据先前编写的种子mDTD从文本对象中提取样本数据规则(S3),将格式元素标签附加到样本数据(S4),并通过使用标记的样本数据来生成适当的mDTD规则(S5)。用于自动提取的方法包括以下步骤:从域中收集Web文档(S11),将Web文档转换为文本对象(S12),将格式元素标签附加到文本对象(S13),通过以下方法提取目标判断由机械学习过程生成的mDTD规则中哪个mDTD规则适合于标记文本对象(S14),并将提取的目标存储在域数据库中(S15)。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号