The entropy of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown large, possibly in nite, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive integer indices in increasing order of rst occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. We extend our previous upper bounds on the entropy of patterns generated by a bounded alphabet to unbounded, possibly in nite, alphabets. Unlike the bounded case, we now allow alphabets with symbols that occur with both high and very low probabilities. We study the effect of all very low probability letters on the pattern entropy. All the low probability letters are collapsed into one symbol. Beyond the contribution of that symbol to the entropy, and unlike i.i.d. sequences, the additional contribution to the entropy of patterns of length n of all letters with probability 1/n~(1+ε) or smaller, for some arbitrarily small ε, is shown to be of o(n) over the whole sequence (and o(1) per symbol). The same contribution of all letters with probability 1/n~(2+ε) or smaller is shown to be o(1) for the whole sequence. If an i.i.d. source with an in nite alphabet has only letters with probability 1/n~(2+ε), the entropy of its patterns approaches zero, i.e., the only likely pattern is the pattern 123... n. This is in contrast to the i.i.d. entropy that is super-linear in n. The results are derived through a design of a low-complexity sequential coding method for patterns that achieves the upper bound.
展开▼