【24h】

A THEMATIC ANALYSIS OF THE AIDS LITERATURE

机译:艾滋病文学的主题分析

获取原文
获取原文并翻译 | 示例

摘要

Faced with the need for human comprehension of any large collection of objects, a time honored approach has been to cluster the objects into groups of closely related objects. Individual groups are then summarized in some convenient manner to provide a more manageable view of the data. Such methods have been applied to document collections with mixed results. If a hard clustering of the data into mutually exclusive clusters is performed then documents are frequently forced into one cluster when they may contain important information that would also appropriately make them candidates for other clusters. If a soft clustering is used there still remains the problem of how to provide a useful summary of the data in a cluster. Here we introduce a new algorithm to produce a soft clustering of document collections that is based on the concept of a theme. A theme is conceptually a subject area that is discussed by multiple documents in the database. A theme has two potential representations that may be viewed as dual to each other. First it is represented by the set of documents that discuss the subject or theme and second it is also represented by the set of key terms that are typically used to discuss the theme. Our algorithm is an EM algorithm in which the term representation and the document representation are explicit components and each is used to refine the other in an alternating fashion. Upon convergence the term representation provides a natural summary of the document representation (the cluster). We describe how to optimize the themes produced by this process and give the results of applying the method to a database of over fifty thousand PubMed documents dealing with the subject of AIDS. How themes may improve access to a document collection is also discussed.
机译:面对人类对任何大型对象集合的理解的需求,一种受人尊敬的方法是将这些对象聚类为紧密相关的对象组。然后以某种方便的方式汇总各个组,以提供更易于管理的数据视图。此类方法已应用于混合结果的文档收集。如果将数据硬分组到互斥群集中,则当文档中可能包含重要信息(也可能使它们适当地成为其他群集的候选者)时,它们经常被强制放入一个群集中。如果使用软集群,仍然存在如何提供集群中数据的有用摘要的问题。在这里,我们介绍一种基于主题概念的新算法来生成文档集合的软聚类。从概念上讲,主题是一个主题领域,数据库中的多个文档对此进行了讨论。一个主题有两个可能的表示形式,可以看作是彼此双重的。首先,它由讨论主题或主题的文档集表示,其次也由通常用于讨论主题的关键术语集表示。我们的算法是一种EM算法,其中术语表示形式和文档表示形式是显式组件,并且彼此交替使用以精炼彼此。融合后,术语表示形式提供了文档表示形式(群集)的自然摘要。我们描述了如何优化此过程产生的主题,并给出了将该方法应用到涉及艾滋病主题的超过五万篇PubMed文档的数据库中的结果。还讨论了主题如何改善对文档集的访问。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号