首页> 外国专利> Method and apparatus for assessing similarity between online job listings

Method and apparatus for assessing similarity between online job listings

机译:评估在线工作清单之间的相似性的方法和装置

摘要

Job listings retrieved from external sources are pre-processed prior to being stored in the search engine production database and duplicate records identified prior to storage in a production database for the search engine. Inter-source and intra-source hash values are calculated for each job listing and the values compared. Job listings having the same intra-source hash are judged to be duplicates of each other. Descriptions whose intra-source hash values do not match, but whose inter-source hash values match are judged to be duplicate candidates and subject to further processing. Suffixes for each such record are stored to a data structure such as a suffix array and the records searched and compared based on the suffix arrays. Records having a pre-determined number of contiguous words in common are judged to be duplicates. Duplicate records are identified before the data set is stored to the production data base.
机译:从外部源检索到的作业清单在存储在搜索引擎生产数据库中之前经过预处理,在存储在搜索引擎的生产数据库中之前,已识别出重复的记录。为每个作业列表计算源间和源内哈希值,并比较这些值。具有相同源内散列的作业列表被判断为彼此重复。源内哈希值不匹配但源间哈希值匹配的描述被判断为重复候选者,并需要进行进一步处理。每个此类记录的后缀都存储到诸如后缀数组之类的数据结构中,并且基于后缀数组搜索和比较记录。具有预定数量的连续单词的记录被判断为重复。在将数据集存储到生产数据库之前,将识别重复的记录。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号