The increase number of scientific publications has made digital scientific literature search a difficult task and highly dependent of the researcher ability to search, filter and classify content. Most used scientific literature search engines and portals, such as Google Scholar, Citeseer and ACM, use only simple text-base and citation-base score to rank the query result, and the rank is barely useful. The number of references that a scientific publication has received (known as citations) determines the impact that the contribution has made to the community. Many methods (known as index) to measure or rank researchers are citation based. A fair index for these is important because it is used to evaluate and compare researchers for different purpose, such as university recruitment, faculty advancement, award of grants, among others. The world of science has many fields (Human, Social, Computer Science, etc.). Each field has different structures and publication dynamics. %le{Cite newman2001 and Structure and Dyna..}. An example is the number of citations in the top-20 most cited journals in Computer Science is 4 times higher than the top-20 most cited journals in Social Science. Therefore, it is unfair to compare researchers using citation-based metrics without a context, in other words, the community they belong to. Different sizes of communities make currently most used metrics that measure the productivity or impact of researchers an unfair evaluation when comparing researchers from different communities since those with higher productivity are likely to produce more citations than communities with lower productivity. This thesis presents a model and a tool for the detection and evaluation of scientific communities. Moreover, the detection of them will allow the improvement of two important activities in scientific research area: First, the of scientific contributions. Being aware of the existing relations between scientific entities by knowing the communities they are part of, will enable more efficient search mechanisms since the domain of the queries can be narrowed down to particular communities, or can be sparse to different communities to obtain diversity of content. Moreover, having a framework that supports discovering scientific communities will provide the means for a better understanding of the social behavior in the scope of scientific research, enabling us the possibility to identify patterns in developments of projects, research trends, successful research profiles, and so on. Second, the assessment of people (researchers). In InfEur2008 is suggested that numerical indicators must not be used to compare researches or researchers across different disciplines. Since nowadays the boarders between disciplines are blurring, it is hard to define a priori the disciplines to which someone belongs. Ad-hoc and evolving communities can provide a better way for this. The approach presented in this thesis combines different clustering algorithms for detecting overlapped scientific communities, based on conference publication data. The Community Engine Tool (CET) has implemented the algorithm and has been evaluated using the DBLP dataset, which contains information on more than 12 thousand conferences. The results showed that using our approach makes it possible to automatically produce community structure close to human-defined classification of conferences. The approach is part of a larger research effort aimed at studying how scientific communities are born, evolve, remain healthy or become unhealthy (e.g., self-referential), and eventually vanish.
展开▼
机译:科学出版物的数量增加使数字科学文献搜索成为一项艰巨的任务,并且高度依赖于研究人员搜索,过滤和分类内容的能力。最常用的科学文献搜索引擎和门户网站,例如Google Scholar,Citeseer和ACM,仅使用简单的基于文本和基于引文的得分来对查询结果进行排名,排名几乎没有用。科学出版物收到的参考文献数量(称为引用)确定了贡献对社区的影响。衡量或排名研究人员的许多方法(称为索引)都是基于引用的。这些指标的公平很重要,因为它用于评估和比较研究人员出于不同目的,例如大学招聘,教职晋升,奖学金授予等。科学世界有许多领域(人类,社会,计算机科学等)。每个领域都有不同的结构和发布动态。 % ale {引用newman2001和Structure and Dyna ..}。例如,计算机科学中被引用次数最多的20种期刊的引用次数是社会科学中被引用次数最多的20种期刊的4倍。因此,在没有上下文(换句话说,他们所属的社区)的情况下,比较使用基于引用的指标的研究人员是不公平的。在比较来自不同社区的研究人员时,不同规模的社区使当前用于衡量研究人员的生产率或影响的当前最常用的度量标准成为不公平的评估,因为生产率较高的人可能会比生产率较低的社区产生更多的引用。本文提出了一种用于科学界的发现和评估的模型和工具。此外,对它们的检测将允许改进科学研究领域中的两项重要活动:第一,科学贡献。通过了解科学实体之间所存在的社区来了解科学实体之间的现有关系,这将启用更有效的搜索机制,因为可以将查询的范围缩小到特定的社区,或者可以将其稀疏到不同的社区以获取内容的多样性。此外,拥有一个支持发现科学共同体的框架将为更好地理解科学研究范围内的社会行为提供手段,使我们有可能确定项目发展的模式,研究趋势,成功的研究概况等。上。第二,评估人(研究者)。在InfEur2008中,建议不要使用数值指标来比较跨不同学科的研究或研究人员。如今,各学科之间的界限越来越模糊,因此很难先验地确定某人所属的学科。临时社区和不断发展的社区可以为此提供更好的方法。本文提出的方法结合了不同的聚类算法,用于基于会议出版物数据检测重叠的科学共同体。社区引擎工具(CET)已实现该算法,并已使用DBLP数据集进行了评估,该数据库包含有关12,000多个会议的信息。结果表明,使用我们的方法可以自动生成接近于人类定义的会议分类的社区结构。该方法是一项较大规模研究工作的一部分,旨在研究科学界如何诞生,发展,保持健康或变得不健康(例如,自我指涉)并最终消失。
展开▼