Clickstream analysis provides valuable insight into the behavior of users and can be translated into better business opportunities and increased user satisfaction. A fundamental problem in clickstream analysis is the computation of the distance (or the similarity) between two clickstreams. While, there exists a considerable amount of literature which propose methods of computing path similarities, they rely on the edit distance or the related longest common subsequence to align the two clickstreams. The edit distance provides a least cost sequence of transformations that result in the two clickstreams to be identical. Often, measures of path similarity are defined on these "aligned" clickstreams. However, the replacement cost used in the "alignment" process used by the edit distance is assumed to be fixed and ignores the degree of similarity of the two page views. Proposed in this paper is a method for computing the replacement cost that is based on the assumption that the degree of similarity between two page views is proportional to their relative frequency of cooccurrence. We de ne a method, which includes the order of the sequence as well as the time spent on each page, for obtaining the replacement cost of two arbitrary web pages. Though less accurate than content based analysis, our experiments with data generated from a simulator as well as data from an actual web site show that our assumption is well founded and that the proposed method provides a fast and accurate method of computing the similarity between two page views.
展开▼