Learning element similarity matrix for semi-structured document analysis

Jianwu Yang; William K. Cheung; Xiaoou Chen

首页> 外文期刊>Knowledge and information systems >Learning element similarity matrix for semi-structured document analysis

【24h】

Learning element similarity matrix for semi-structured document analysis

机译：用于半结构化文档分析的学习元素相似度矩阵

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements-the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.

机译：捕获半结构化文档（例如XML文档）中潜在的结构和语义属性对于提高相关文档分析任务的性能至关重要。结构化链接矢量模式（SLVM）是最近提出的用于对半结构化文档进行建模的一种表示形式。它使用元素相似度矩阵来捕获XML元素之间的潜在关系，这些元素是XML文档的构成组件。在本文中，我们不使用启发式方法来定义元素相似性矩阵，而是建议使用机器学习方法来计算矩阵。此外，我们使用潜在语义索引将术语语义纳入SLVM，以提高模型的准确性，同时保留了元素相似性可学习性。为了进行性能评估，我们将相似性学习应用于k近邻搜索和基于相似性的聚类，并使用两个不同的XML文档集合测试了性能。发现通过学习获得的SLVM明显优于常规向量空间模型和基于编辑距离的方法。同样，作为副产品获得的相似性矩阵可以提供有关XML元素之间的语义关系的高级知识。

著录项

来源
《Knowledge and information systems》 |2009年第1期|共26页
作者
Jianwu Yang; William K. Cheung; Xiaoou Chen;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动信息理论;
关键词
Semi-structured document analysis; Learning similarity matrix; Similarity-based clustering; Extended Vector Space Model;

机译：半结构文档分析;学习相似度矩阵;基于相似度的聚类;扩展向量空间模型;

相似文献

外文文献
中文文献
专利

1. Learning element similarity matrix for semi-structured document analysis [J] . Jianwu Yang, William K. Cheung, Xiaoou Chen Knowledge and information systems . 2009,第1期

机译：用于半结构化文档分析的学习元素相似度矩阵
2. Similarity Based Clustering with Indexing for Semi-Structured Document [J] . Palanisamy S., K. Baskaran Journal of computer sciences . 2012,第4期

机译：半结构化文档基于索引的相似度聚类
3. Similarity Based Clustering with Indexing for Semi-Structured Document | Science Publications [J] . K. Baskaran, S. Palanisamy Journal of computer sciences . 2012,第4期

机译：半结构化文档的基于索引的相似度聚类科学出版物
4. Semi-structured document extraction based on document element block model [C] . Tao Lv, Jiang Liu, Fan Lu, IEEE International Conference on Cloud Computing and Intelligent Systems . 2016

机译：基于文档元素块模型的半结构化文档提取
5. A comparative analysis framework for semi-structured documents, with applications to government regulations. [D] . Lau, Gloria T. 2004

机译：半结构化文档的比较分析框架，适用于政府法规。
6. JSONize: A Scalable Machine Learning Pipeline to Model Medical Notes as Semi-structured Documents [O] . Everett N. Rush, Ioana Danciu, George Ostrouchov, 2020

机译：JSONize：可扩展的机器学习管道可将医学笔记建模为半结构化文档
7. Similarity Based Clustering with Indexing for Semi-Structured Document [O] . S. Palanisamy, K. Baskaran 2012

机译：半结构化文档基于索引的相似度聚类

Learning element similarity matrix for semi-structured document analysis

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅