A programmable multimedia accelerator which maximizes data bandwidth utilization with minimal hardware (and consequently minimal power consumption) is provided herein. In one embodiment, the accelerator includes four functional units, a routing unit, and a control module. The functional units each operate on four input bytes and a carry-in bit, and produce two output bytes and a carry-out bit. The carry-out bit of each functional unit is provided as a carry-in bit to another functional unit, allowing the functional units to operate cooperatively to carry out extended-precision operations when needed. The functional units can also operate individually to perform low-precision operations in parallel. The routing unit is coupled to the functional units to receive the output bytes and to provide a permutation of the output bytes as additional pairs of input bytes to the functional units. The control module stores and executes a set of instructions to provide control signals to the functional units and the routing units. The functional units are preferably configured to perform 8×8 bit multiplications and 16 bit operations such as addition, subtraction, logical AND, and logical XOR.
展开▼