首页> 外国专利> Email document parsing method and apparatus

Email document parsing method and apparatus

机译:电子邮件文件解析方法及装置

摘要

A preferred example of the process flow of the inventive method (1) is depicted in FIG. 1). The first step (2) of the method (1) is to import an email document (3) to be parsed. In the preprocessing step (10) the email (3) is processed to determine the presence of any header text (5) (excluding any header text that may be within the embedded reply chain) or attachments 4, including attached email documents, if any. Once the header text (5), attachments (4) or other forwarded materials have been identified in the preprocessing step (10), these components of the email (3) are categorized by the computer (51) as non-author composed text. Next the process flow of the parsing computer (51) moves to the step of normalization (11). This entails processing the email document (3) to ascertain whether it is in a preferred format and, if the email document (3) is not in the preferred format, converting at least some of the information within the email document to the preferred format. The parsing computer (51) now progresses through several analysis steps, referred to as the segmentation step (12), the linguistic analysis step (13) and the punctuation analysis step (14). The results of these analysis steps (12) to (14) are recorded in suitable memory or storage means accessible to the CPU of the parsing computer (51). In the segmentation step (12) the text of email (3) is split into paragraphs, and the paragraphs are split into sentences. The linguistic analysis step (13) includes identification of predefined words and phrases of various types. In the punctuation analysis step (14) the parsing computer (51) analyses the text at the character level so as to check for use of sentence punctuation marks and other predefined characters. At the completion of the analysis steps (12) to (14), the process flow proceeds to step (15), in which the analysed email document, including any annotations that have been inserted, is saved into the memory of the computing apparatus, along with any extraneous results of the analysis. Next a number of features are defined at step (18). Typically, a feature is a descriptive statistic calculated from either or both of the raw text and the annotations. At step (19) the features extracted at step (18) are converted into data structures associated with segments of the text. At step (20) the machine learning system receives the data structures and associated lines of text as input and is responsive to that input so as to categorise each line of text as broadly falling into one of two categories: author composed text or non-author composed text.
机译:在图1中示出了本发明方法(1)的处理流程的优选示例。 1)。方法(1)的第一步(2)是导入要解析的电子邮件文档(3)。在预处理步骤(10)中,对电子邮件(3)进行处理以确定是否存在任何标题文本(5)(不包括可能位于嵌入式回复链中的任何标题文本)或附件4(包括附件电子邮件文档)(如果有)的存在。一旦在预处理步骤(10)中已经识别了标题文本(5),附件(4)或其他转发的材料,则电子邮件(3)的这些组件被计算机(51)分类为非作者撰写的文本。接下来,解析计算机(51)的处理流程进入标准化步骤(11)。这需要处理电子邮件文档(3)以确定其是否为优选格式,并且如果电子邮件文档(3)不是优选格式,则将电子邮件文档内的至少一些信息转换为优选格式。解析计算机(51)现在进行几个分析步骤,称为分割步骤(12),语言分析步骤(13)和标点分析步骤(14)。这些分析步骤(12)至(14)的结果被记录在解析计算机(51)的CPU可访问的合适的存储器或存储装置中。在分段步骤(12)中,电子邮件(3)的文本分为多个段落,而这些段落则分为多个句子。语言分析步骤(13)包括识别各种类型的预定义单词和短语。在标点符号分析步骤(14)中,解析计算机(51)在字符级别分析文本,以便检查句子标点符号和其他预定义字符的使用。在分析步骤(12)到(14)完成时,处理流程前进到步骤(15),在该步骤中,将已分析的电子邮件文档(包括已插入的任何注释)保存到计算设备的内存中,以及任何其他无关的分析结果。接下来,在步骤(18)定义多个特征。通常,功能是根据原始文本和注释中的一个或两个计算的描述性统计信息。在步骤(19),将在步骤(18)提取的特征转换为与文本段相关联的数据结构。在步骤(20),机器学习系统接收数据结构和关联的文本行作为输入,并响应于该输入,以便将文本的每一行大致归为以下两类之一:作者撰写的文本或非作者撰写的文字。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号