首页>
外国专利>
METHOD OF CONSTRUCTING CORPUS BASED ON INTERNET FORUMS
METHOD OF CONSTRUCTING CORPUS BASED ON INTERNET FORUMS
展开▼
机译:基于互联网论坛的语料库构建方法
展开▼
页面导航
摘要
著录项
相似文献
摘要
FIELD: physics, computer engineering.SUBSTANCE: invention relates to systems and methods of creating corpuses for various research and other purposes. The method of constructing a corpus based on Internet forums for a computer system comprises constructing a document object model (DOM) in the form of a tree DOM data structure; selecting a group of single-type vertices in the DOM tree; removing optional design elements from pages; merging non-sheet vertices with the same names in the object model tree and combining sheet vertices with the same properties; estimating the vertices and filtering groups; constructing XPATH expressions and applying the obtained XPATH expressions to a set of files containing all documents from a selected forum.EFFECT: high accuracy of separating user text from other content on web pages with automatic construction of a corpus.10 cl, 3 dwg
展开▼