首页> 外文会议>International Conference on Computer Science and Information Processing;CSIP 2012 >Sentence alignment for web page text based on vector space model
【24h】

Sentence alignment for web page text based on vector space model

机译:基于向量空间模型的网页文本句子对齐

获取原文
获取原文并翻译 | 示例

摘要

There exist noisy, unparallel sentences in parallel web pages. Web page structure is subjected to some limitation for sentences alignment task for web page text. The most straightforward way of aligning sentences is using a translation lexicon. However, a major obstacle to this approach is the lack of dictionary for training. This paper presents a method for automatically align Mongolian-Chinese parallel text on the Web via vector space model. Vector space model is an algebraic model for representing any object as vectors of identifiers, such as index terms. In the statistically based vector-space model, a sentence is conceptually represented by a vector of keywords extracted from the text. Extracted keywords are composed by content words, known as terms and the weight of a term in a sentence vector can be determined tf-idf method. CHI is used to compute the association between bilingual words. Once the term weights are determined, the similarity between sentence vectors is computed via cosine measure. The experimental results indicate that the method is accurate and efficient enough to apply without human intervention.
机译:并行网页中存在嘈杂,不平行的句子。对于网页文本的句子对齐任务,网页结构受到某些限制。对齐句子最直接的方法是使用翻译词典。但是,这种方法的主要障碍是缺乏训练词典。本文提出了一种通过向量空间模型在网络上自动对齐蒙汉平行文本的方法。向量空间模型是一种代数模型,用于将任何对象表示为标识符(例如索引项)的向量。在基于统计的向量空间模型中,从概念上讲,句子是由从文本中提取的关键字向量表示的。提取的关键字由内容词(称为术语)组成,可以使用tf-idf方法确定句子向量中术语的权重。 CHI用于计算双语单词之间的关联。一旦确定了术语权重,就可以通过余弦测度来计算句子向量之间的相似度。实验结果表明,该方法准确有效,无需人工干预即可应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号