One thing I hadn't noted is that if the code is executed serially, it should probably still all be decoded at the time of fetching, and so pre-decoded instructions should be what is cached. That is what K9 did . . . Athlon used 3 bits per byte to aid decode conveying which bytes were the Opcode, which bytes were R/M and SIB, and the end of the instruction byte. Opteron got rid of 2 of those bits and just used the end marker but added a stage in the decode pipeline.
展开▼