|
Detailed
Explanation
AltiVec is an extension to the PowerPC instruction set. It is designed to enhance PowerPC processor performance on dynamic, media-rich applications such as video and animation. AltiVec achieves this goal by providing a mechanism for programs to encode the low-level data parallelism that is common to multimedia tasks in such a way that microprocessor hardware can exploit it more efficiently. While AltiVec was designed primarily for multimedia acceleration, it is general purpose in nature and capable of accelerating almost any application that handles non-trivial amounts of data.
The AltiVec architectural specification describes a Single Instruction Multiple Data (SIMD) processing unit that is integrated with the PowerPC architecture in a manner analogous to the existing integer and floating-point units. AltiVec introduces a new register file, separate from the existing general purpose or floating-point registers. There are 32 registers. Each register is 128-bits wide.
This new extension adds 160 new "vector" instructions to the PowerPC instruction set. These instructions operate on operands in the new vector register file similar to the way scalar integer and scalar floating-point instructions operate on operands in their respective registers. A processor executes vector instructions out of the same instruction stream as other PowerPC instructions. There are no restrictions on how vector instructions can be intermixed with branch, integer or floating-point instructions and there is no context switching, overhead or penalty for doing so. The AltiVec instruction set is general in nature, but optimized for digital signal processing algorithms. Vector instructions are RISC style, which are relatively simple in nature. Every vector instruction is carefully designed to be fully pipelineable. All vector instructions are architecturally consistent with the existing PowerPC scalar instructions but operate on fixed-length vector operands rather than scalar operands. AltiVec instructions have one, two or three source operands and are non-destructive in nature.
The 128-bit wide vector operands consist of multiple, packed and scalar data elements. The packed vector data types supported in AltiVec include:
|
1
|
128
bits
|
|
2
|
16
8-bit Integers
|
|
3
|
8
16-bit Integers
|
|
4
|
8
16-bit pixels
|
|
5
|
4
32-bit Integers
|
|
6
|
4
32-bit pixels
|
|
7
|
4
IEEE-754 floats
|
There are five general classes of AltiVec instructions:
|
1
|
load
/ store
|
|
2
|
prefetch
|
|
3
|
data
manipulation
|
|
4
|
arithmetic
& logical
|
|
5
|
control
flow
|
1. Loads and Stores.
AltiVec loads and stores use indexed addressing with address operands coming from the general register file. Data moved between memory and vector registers can be full vectors (16 bytes) or individual scalar elements (bytes, halfwords or words) within a vector. All data operands are aligned to their natural size boundary and scalar data elements are transferred on their natural byte lanes. Misaligned data is handled explicitly with the data manipulation instructions.
2. Prefetch.
These instructions are provided for software to allow
data blocks to be prefetched into the cache(s) before they
are needed by the program so that demand misses can be
avoided and long memory latency overlapped with useful
execution of other instructions. Four independent prefetch
streams are provided, each described by a block size, stride
and count. Data prefetching allows the process of loading
data into the processor and data manipulation to operate in
parallel.
3. Data Manipulation.
A deficiency of classical SIMD architectures is that the parallelism is lost when the organization of the data does not exactly match that of the machine. AltiVec solves this with a set of powerful data manipulation instructions. These instructions are essentially variations of the very general "permute" instruction. Permute creates a result vector consisting of any 16 bytes selected out of any two source vector registers and arranged in any arbitrary order as specified by a control vector located in a third source vector register. Besides its data reorganization capabilities, permute performs other useful functions such as table-lookups which are prevalent in DSP algorithms. Special cases of permute are also provided for common operations such as packing, unpacking, merging, shifting and element replication (splat).
4. Arithmetic and Logical.
In general, arithmetic and logical operations are performed on all corresponding elements of the source operand vectors with the results placed in the corresponding elements of the destination vector. A wide variety of arithmetic and logical operations are provided. A few instructions perform operations on elements within a single vector, such as sum-across and multiply-sum-across. Integer operations are provided in either saturation (peg or clipped) or modulo (wrap) arithmetic forms, to handle overflow. Single Precision floating-point operations, though not uniformly orthogonal, exist for most of the important arithmetic operations including add, subtract, multiply-add, min, max, round to integral value and convert to and from integer. Floating-point operations are carried out in a Java-compliant subset of the IEEE-754 specification (lacking only IEEE directed rounding modes and exceptions). No divide instructions are provided. However, a reciprocal estimate is provided for a fully-pipelined Newton-Raphson refinement algorithm.
5. Control Flow.
There are no new instructions for controlling program flow in the normal sense. Program flow is entirely controlled by the existing PowerPC branch instructions. There are vector compare instructions that can set the PowerPC condition code register for testing by PowerPC conditional branch instructions, but the emphasis in AltiVec is on avoiding branches altogether. To this end, vector compare instructions generate a true/false result vector. This can in turn be used by the "select" instruction to select between elements from two source operand registers. This mechanism can be used to implement branch-free control flow mechanisms such as conditional execution. Compares provide equal-to and greater-than predicates with others being synthesized from logical operations on the true/false result vector. Special floating-point compares are also provided for bounds checking (e.g., for 3D trivial accept/reject) and for processing NaNs.
Table of Contents Next Previous Top
of Page
|