首页> 外文学位 >Creating a criterion-based information agent through data mining for automated identification of scholarly research on the World Wide Web.
【24h】

Creating a criterion-based information agent through data mining for automated identification of scholarly research on the World Wide Web.

机译:通过数据挖掘创建基于标准的信息代理,以自动识别万维网上的学术研究。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation creates an information agent that correctly identifies Web pages containing scholarly research approximately 96% of the time. It does this by analyzing the Web page with a set of criteria, and then uses a classification tree to arrive at a decision.;The criteria were gathered from the literature on selecting print and electronic materials for academic libraries. A Delphi study was done with an international panel of librarians to expand and refine the criteria until a list of 41 operationalizable criteria was agreed upon. A Perl program was then designed to analyze a Web page and determine a numerical value for each criterion.;A large collection of Web pages was gathered comprising 5,000 pages that contain the full work of scholarly research and 5,000 random pages, representative of user searches, that do not contain scholarly research. Datasets were built by running the Perl program on these Web pages. The datasets were split into model building and testing sets.;Data mining was then used to create different classification models. Four techniques were used: logistic regression, non-parametric discriminant analysis, classification trees, and neural networks. The models were created with the model datasets and then tested against the test dataset. Precision and recall were used to judge the effectiveness of each model. In addition, a set of pages that were difficult to classify because of their similarity to scholarly research was gathered and classified with the models.;The classification tree created the most effective classification model, with a precision of 96% and a recall of 95.6%. However, logistic regression created a model that was able to correctly classify more of the problematic pages.;This agent can be used to create a database of scholarly research published on the Web. In addition, the technique can be used to create a database of any type of structured electronic information.
机译:本文创建了一个信息代理,可以正确识别大约96%的时间包含学术研究的网页。它通过使用一组标准分析网页来完成此任务,然后使用分类树来做出决定。这些标准是从文献中收集的,这些文献是关于为大学图书馆选择印刷和电子材料的。与国际图书管理员小组进行了德尔菲研究,以扩展和完善标准,直到商定了41个可操作标准的列表。然后设计了一个Perl程序来分析网页并确定每个标准的数值。收集了大量网页,其中包括5,000个页面,其中包含学术研究的全部内容; 5,000个随机页面,代表用户搜索,不包含学术研究。通过在这些网页上运行Perl程序来构建数据集。将数据集分为模型建立和测试集。然后使用数据挖掘来创建不同的分类模型。使用了四种技术:逻辑回归,非参数判别分析,分类树和神经网络。使用模型数据集创建模型,然后针对测试数据集进行测试。精确度和召回率用于判断每个模型的有效性。此外,还收集了由于与学术研究相似而难以分类的一组页面,并使用这些模型对其进行了分类。分类树创建了最有效的分类模型,准确度为96%,召回率为95.6% 。但是,逻辑回归创建了一个模型,该模型能够正确分类更多有问题的页面。该代理可用于创建网络上发布的学术研究数据库。另外,该技术可用于创建任何类型的结构化电子信息的数据库。

著录项

  • 作者

    Nicholson, Scott Richard.;

  • 作者单位

    University of North Texas.;

  • 授予单位 University of North Texas.;
  • 学科 Mathematics.;Information Science.;Computer Science.
  • 学位 Ph.D.
  • 年度 2000
  • 页码 100 p.
  • 总页数 100
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号