11
SIMD Instructions

This chapter discusses the vector instructions on the x86-64. This special class of instructions provides parallel processing, traditionally known as single-instruction, multiple-data (SIMD) instructions because, quite literally, a single instruction operates on several pieces of data concurrently. As a result of this concurrency, SIMD instructions can often execute several times faster (in theory, as much as 32 to 64 times faster) than the comparable single-instruction, single-data (SISD), or scalar, instructions that compose the standard x86-64 instruction set.

The x86-64 actually provides three sets of vector instructions: the Multimedia Extensions (MMX) instruction set, the Streaming SIMD Extensions (SSE) instruction set, and the Advanced Vector Extensions (AVX) instruction set. This book does not consider the MMX instructions as they are obsolete (SSE equivalents exist for the MMX instructions).

The x86-64 vector instruction set (SSE/AVX) is almost as large as the scalar instruction set. A whole book could be written about SSE/AVX programming and algorithms. However, this is not that book; SIMD and parallel algorithms are an advanced subject beyond the scope of this book, so this chapter settles for introducing a fair number of SSE/AVX instructions and leaves it at that.

This chapter begins with some prerequisite information. First, it begins with a discussion of the x86-64 vector architecture and streaming data types. Then, it discusses how to detect the presence of various vector instructions (which are not present on all x86-64 CPUs) by using the cpuid instruction. Because most vector instructions require special memory alignment for data operands, this chapter also discusses MASM segments.

11.1 The SSE/AVX Architectures

Let’s begin by taking a quick look at the SSE and AVX features in the x64-86 CPUs. The SSE and AVX instructions have several variants: the original SSE, plus SSE2, SSE3, SSE3, SSE4 (SSE4.1 and SSE4.2), AVX, AVX2 (AVX and AVX2 are sometimes called AVX-256), and AVX-512. SSE3 was introduced along with the Pentium 4F (Prescott) CPU, Intel’s first 64-bit CPU. Therefore, you can assume that all Intel 64-bit CPUs support the SSE3 and earlier SIMD instructions.

The SSE/AVX architectures have three main generations:

The SSE architecture, which (on 64-bit CPUs) provided sixteen 128-bit XMM registers supporting integer and floating-point data types
The AVX/AVX2 architecture, which supported sixteen 256-bit YMM registers (also supporting integer and floating-point data types)
The AVX-512 architecture, which supported up to thirty-two 512-bit ZMM registers

As a general rule, this chapter sticks to AVX2 and earlier instructions in its examples. Please see the Intel and AMD CPU manuals for a discussion of the additional instruction set extensions such as AVX-512. This chapter does not attempt to describe every SSE or AVX instruction. Most streaming instructions have very specialized purposes and aren’t particularly useful in generic applications.

11.2 Streaming Data Types

The SSE and AVX programming models support two basic data types: scalars and vectors. Scalars hold one single- or double-precision floating-point value. Vectors hold multiple floating-point or integer values (between 2 and 32 values, depending on the scalar data type of byte, word, dword, qword, single precision, or double precision, and the register and memory size of 128 or 256 bits).

The XMM registers (XMM0 to XMM15) can hold a single 32-bit floating-point value (a scalar) or four single-precision floating-point values (a vector). The YMM registers (YMM0 to YMM15) can hold eight single-precision (32-bit) floating-point values (a vector); see Figure 11-1.

f11001 — Figure 11-1: Packed and scalar single-precision floating-point data type

The XMM registers can hold a single double-precision scalar value or a vector containing a pair of double-precision values. The YMM registers can hold a vector containing four double-precision floating-point values, as shown in Figure 11-2.

f11002 — Figure 11-2: Packed and scalar double-precision floating-point type

The XMM registers can hold 16 byte values (YMM registers can hold 32 byte values), allowing the CPU to perform 16 (32) byte-sized computations with one instruction (Figure 11-3).

f11003 — Figure 11-3: Packed byte data type

The XMM registers can hold eight word values (YMM registers can hold sixteen word values), allowing the CPU to perform eight (sixteen) 16-bit word-sized integer computations with one instruction (Figure 11-4).

f11004 — Figure 11-4: Packed word data type

The XMM registers can hold four dword values (YMM registers can hold eight dword values), allowing the CPU to perform four (eight) 32-bit dword-sized integer computations with one instruction (Figure 11-5).

f11005 — Figure 11-5: Packed double-word data type

The XMM registers can hold two qword values (YMM registers can hold four qword values), allowing the CPU to perform two (four) 64-bit qword computations with one instruction (Figure 11-6).

f11006 — Figure 11-6: Packed quad-word data type

Intel’s documentation calls the vector elements in an XMM and a YMM register lanes. For example, a 128-bit XMM register has 16 bytes. Bits 0 to 7 are lane 0, bits 8 to 15 are lane 1, bits 16 to 23 are lane 2, . . . , and bits 120 to 127 are lane 15. A 256-bit YMM register has 32 byte-sized lanes, and a 512-bit ZMM register has 64 byte-sized lanes.

Similarly, a 128-bit XMM register has eight word-sized lanes (lanes 0 to 7). A 256-bit YMM register has sixteen word-sized lanes (lanes 0 to 15). On AVX-512-capable CPUs, a ZMM register (512 bits) has thirty-two word-sized lanes, numbered 0 to 31.

An XMM register has four dword-sized lanes (lanes 0 to 3); it also has four single-precision (32-bit) floating-point lanes (also numbered 0 to 3). A YMM register has eight dword or single-precision lanes (lanes 0 to 7). An AVX2 ZMM register has sixteen dword or single-precision-sized lanes (numbers 0 to 15).

XMM registers support two qword-sized lanes (or two double-precision lanes), numbered 0 to 1. As expected, a YMM register has twice as many (four lanes, numbered 0 to 3), and an AVX2 ZMM register has four times as many lanes (0 to 7).

Several SSE/AVX instructions refer to various lanes within these registers. In particular, the shuffle and unpack instructions allow you to move data between lanes in SSE and AVX operands. See “The Shuffle and Unpack Instructions” on page 625 for examples of lane usage.

11.3 Using cpuid to Differentiate Instruction Sets

Intel introduced the 8086 (and shortly thereafter, the 8088) microprocessor in 1978. With almost every succeeding CPU generation, Intel added new instructions to the instruction set. Until this chapter, this book has used instructions that are generally available on all x86-64 CPUs (Intel and AMD). This chapter presents instructions that are available only on later-model x86-64 CPUs. To allow programmers to determine which CPU their applications were using so they could dynamically avoid using newer instructions on older processors, Intel introduced the cpuid instruction.

The cpuid instruction expects a single parameter (called a leaf function) passed in the EAX register. It returns various pieces of information about the CPU in different 32-bit registers based on the value passed in EAX. An application can test the return information to see if certain CPU features are available.

As Intel introduced new instructions, it changed the behavior of cpuid to reflect those changes. Specifically, Intel changed the range of values a program could legally pass in EAX to cpuid; this is known as the highest function supported. As a result, some 64-bit CPUs accept only values in the range 0h to 05h. The instructions this chapter discusses may require passing values in the range 0h to 07h. Therefore, the first thing you have to do when using cpuid is to verify that it accepts EAX = 07h as a valid parameter.

To determine the highest function supported, you load EAX with 0 or 8000_0000h and execute the cpuid instruction (all 64-bit CPUs support these two function values). The return value is the maximum you can pass to cpuid in EAX. The Intel and AMD documentation (also see https://en.wikipedia.org/wiki/CPUID) will list the values cpuid returns for various CPUs; for the purposes of this chapter, we need only verify that the highest function supported is 01h (which is true for all 64-bit CPUs) or 07h for certain instructions.

In addition to providing the highest function supported, the cpuid instruction with EAX = 0h (or 8000_0002h) also returns a 12-character vendor ID in the EBX, ECX, and EDX registers. For x86-64 chips, this will be either of the following:

GenuineIntel (EBX is 756e_6547h, EDX is 4965_6e69h, and ECX is 6c65_746eh)
AuthenticAMD (EBX is 6874_7541h, EDX is 6974_6E65h, and ECX is 444D_4163h)

To determine if the CPU can execute most SSE and AVX instructions, you must execute cpuid with EAX = 01h and test various bits placed in the ECX register. For a few of the more advanced features (advanced bit-manipulation functions and AVX2 instructions), you’ll need to execute cpuid with EAX = 07h and check the results in the EBX register. The cpuid instruction (with EAX = 1) returns an interesting SSE/AVX feature flag in the following bits in ECX, as shown in Table 11-1; with EAX = 07h, it returns the bit manipulation or AVX2 flag in EBX, as shown in Table 11-2. If the bit is set, the CPU supports the specific instruction(s).

Table 11-1: Intel cpuid Feature Flags (EAX = 1)

Bit	ECX
0	SSE3 support
1	PCLMULQDQ support
9	SSSE3 support
19	CPU supports SSE4.1 instructions
20	CPU supports SSE4.2 instructions
28	Advanced Vector Extensions

Table 11-2: Intel cpuid Extended Feature Flags (EAX = 7, ECX = 0)

Bit	EBX
3	Bit Manipulation Instruction Set 1
5	Advanced Vector Extensions 2 (AVX2)
8	Bit Manipulation Instruction Set 2

Listing 11-1 queries the vendor ID and basic feature flags on a CPU.

; Listing 11-1
 
; CPUID Demonstration.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-1", 0

            .data
maxFeature  dword   ?
VendorID    byte    14 dup (0)

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Used for debugging:

print       proc
            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

            push    rbp
            mov     rbp, rsp
            sub     rsp, 40
            and     rsp, -16

            mov     rcx, [rbp + 72]   ; Return address
            call    printf

            mov     rcx, [rbp + 72]
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     [rbp + 72], rcx

            leave
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

            xor     eax, eax
            cpuid
            mov     maxFeature, eax
            mov     dword ptr VendorID, ebx 
            mov     dword ptr VendorID[4], edx 
            mov     dword ptr VendorID[8], ecx

            lea     rdx, VendorID
            mov     r8d, eax
            call    print
            byte    "CPUID(0): Vendor ID='%s',  "
            byte    "max feature=0%xh", nl, 0

; Leaf function 1 is available on all CPUs that support
; CPUID, no need to test for it. 

            mov     eax, 1
            cpuid
            mov     r8d, edx
            mov     edx, ecx
            call    print
            byte    "cpuid(1), ECX=%08x, EDX=%08x", nl, 0

; Most likely, leaf function 7 is supported on all modern CPUs
; (for example, x86-64), but we'll test its availability nonetheless.

            cmp     maxFeature, 7
            jb      allDone

            mov     eax, 7
            xor     ecx, ecx
            cpuid
            mov     edx, ebx
            mov     r8d, ecx
            call    print
            byte    "cpuid(7), EBX=%08x, ECX=%08x", nl, 0

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

Listing 11-1: cpuid demonstration program

On an old MacBook Pro Retina with an Intel i7-3720QM CPU, running under Parallels, you get the following output:

C:\>build listing11-1

C:\>echo off
 Assembling: listing11-1.asm
c.cpp

C:\>listing11-1
Calling Listing 11-1:
CPUID(0): Vendor ID='GenuineIntel', max feature=0dh
cpuid(1), ECX=ffba2203, EDX=1f8bfbff
cpuid(7), EBX=00000281, ECX=00000000
Listing 11-1 terminated

This CPU supports SSE3 instructions (bit 0 of ECX is 1), SSE4.1 and SSE4.2 instructions (bits 19 and 20 of ECX are 1), and the AVX instructions (bit 28 is 1). Those, largely, are the instructions this chapter describes. Most modern CPUs will support these instructions (the i7-3720QM was released by Intel in 2012). The processor doesn’t support some of the more interesting extended features on the Intel instruction set (the extended bit-manipulation instructions and the AVX2 instruction set). Programs using those instructions will not execute on this (ancient) MacBook Pro.

Running this on a more recent CPU (an iMac Pro 10-core Intel Xeon W-2150B) produces the following output:

C:\>listing11-1
Calling Listing 11-1:
CPUID(0): Vendor ID='GenuineIntel', max feature=016h
cpuid(1), ECX=fffa3203, EDX=1f8bfbff
cpuid(7), EBX=d09f47bb, ECX=00000000
Listing 11-1 terminated

As you can see, looking at the extended feature bits, the newer Xeon CPU does support these additional instructions. The code fragment in Listing 11-2 provides a quick modification to Listing 11-1 that tests for the availability of the BMI1 and BMI2 bit-manipulation instruction sets (insert the following code right before the allDone label in Listing 11-1).

; Test for extended bit manipulation instructions 
; (BMI1 and BMI2):

            and     ebx, 108h       ; Test bits 3 and 8
            cmp     ebx, 108h       ; Both must be set
            jne     Unsupported
            call    print
            byte    "CPU supports BMI1 & BMI2", nl, 0
            jmp     allDone 

Unsupported:
            call    print
            byte    "CPU does not support BMI1 & BMI2 "
            byte    "instructions", nl, 0

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp

Listing 11-2: Test for BMI1 and BMI2 instruction sets

Here’s the build command and program output on the Intel i7-3720QM CPU:

C:\>build listing11-2

C:\>echo off
 Assembling: listing11-2.asm
c.cpp

C:\>listing11-2
Calling Listing 11-2:
CPUID(0): Vendor ID='GenuineIntel', max feature=0dh
cpuid(1), ECX=ffba2203, EDX=1f8bfbff
cpuid(7), EBX=00000281, ECX=00000000
CPU does not support BMI1 & BMI2 instructions
Listing 11-2 terminated

Here’s the same program running on the iMac Pro (Intel Xeon W-2150B):

C:\>listing11-2
Calling Listing 11-2:
CPUID(0): Vendor ID='GenuineIntel', max feature=016h
cpuid(1), ECX=fffa3203, EDX=1f8bfbff
cpuid(7), EBX=d09f47bb, ECX=00000000
CPU supports BMI1 & BMI2
Listing 11-2 terminated

11.4 Full-Segment Syntax and Segment Alignment

As you will soon see, SSE and AVX memory data require alignment on 16-, 32-, and even 64-byte boundaries. Although you can use the align directive to align data (see “MASM Support for Data Alignment” in Chapter 3), it doesn’t work beyond 16-byte alignment when using the simplified segment directives presented thus far in this book. If you need alignment beyond 16 bytes, you have to use MASM full-segment declarations.

If you want to create a segment with complete control over segment attributes, you need to use the segment and ends directives.¹ The generic syntax for a segment declaration is as follows:

segname  segment readonly alignment 'class'
         statements
segname  ends

segname is an identifier. This is the name of the segment (which must also appear before the closing ends directive). It need not be unique; you can have several segment declarations that share the same name. MASM will combine segments with the same name when emitting code to the object file. Avoid the segment names _TEXT, _DATA, _BSS, and _CONST, as MASM uses these names for the .code, .data, .data?, and .const directives, respectively.

The readonly option is either blank or the MASM-reserved word readonly. This is a hint to MASM that the segment will contain read-only (constant) data. If you attempt to (directly) store a value into a variable that you declare in a read-only segment, MASM will complain that you cannot modify a read-only segment.

The alignment option is also optional and allows you to specify one of the following options:

byte
word
dword
para
page
align(n) (n is a constant that must be a power of 2)

The alignment options tell MASM that the first byte emitted for this particular segment must appear at an address that is a multiple of the alignment option. The byte, word, and dword reserved words specify 1-, 2-, or 4-byte alignments. The para alignment option specifies paragraph alignment (16 bytes). The page alignment option specifies an address alignment of 256 bytes. Finally, the align(n) alignment option lets you specify any address alignment that is a power of 2 (1, 2, 4, 8, 16, 32, and so on).

The default segment alignment, if you don’t explicitly specify one, is paragraph alignment (16 bytes). This is also the default alignment for the simplified segment directives (.code, .data, .data?, and .const).

If you have some (SSE/AVX) data objects that must start at an address that is a multiple of 32 or 64 bytes, then creating a new data segment with 64-byte alignment is what you want. Here’s an example of such a segment:

dseg64  segment align(64)
obj64   oword   0, 1, 2, 3   ; Starts on 64-byte boundary
b       byte    0            ; Messes with alignment
        align   32           ; Sets alignment to 32 bytes
obj32   oword   0, 1         ; Starts on 32-byte boundary
dseg64  ends

The optional class field is a string (delimited by apostrophes and single quotes) that is typically one of the following names: CODE, DATA, or CONST. Note that MASM and the Microsoft linker will combine segments that have the same class name even if their segment names are different.

This chapter presents examples of these segment declarations as they are needed.

11.5 SSE, AVX, and AVX2 Memory Operand Alignment

SSE and AVX instructions typically allow access to a variety of memory operand sizes. The so-called scalar instructions, which operate on single data elements, can access byte-, word-, dword-, and qword-sized memory operands. In many respects, these types of memory accesses are similar to memory accesses by the non-SIMD instructions. The SSE, AVX, and AVX2 instruction set extensions also access packed or vector operands in memory. Unlike with the scalar memory operands, stringent rules limit the access of packed memory operands. This section discusses those rules.

The SSE instructions can access up to 128 bits of memory (16 bytes) with a single instruction. Most multi-operand SSE instructions can specify an XMM register or a 128-bit memory operand as their source (second) operand. As a general rule, these memory operands must appear on a 16-byte-aligned address in memory (that is, the LO 4 bits of the memory address must contain 0s).

Because segments have a default alignment of para (16 bytes), you can easily ensure that any 16-byte packed data objects are 16-byte-aligned by using the align directive:

align 16

MASM will report an error if you attempt to use align 16 in a segment you’ve defined with the byte, word, or dword alignment type. It will work properly with para, page, or any align(n) option where n is greater than or equal to 16.

If you are using AVX instructions to access 256-bit (32-byte) memory operands, you must ensure that those memory operands begin on a 32-byte address boundary. Unfortunately, align 32 won’t work, because the default segment alignment is para (16-byte) alignment, and the segment’s alignment must be greater than or equal to the operand field of any align directives appearing within that segment. Therefore, to be able to define 256-bit variables usable by AVX instructions, you must explicitly define a (data) segment that is aligned on a (minimum) 32-byte boundary, such as the following:

avxData    segment  align(32)
           align    32    ; This is actually redundant here
someData   oword    0, 1  ; 256 bits of data
             .
             .
             .
avxData    ends

Though it’s somewhat redundant to say this, it’s so important it’s worth repeating:

Almost all AVX/AVX2 instructions will generate an alignment fault if you attempt to access a 256-bit object at an address that is not 32-byte-aligned. Always ensure that your AVX packed operands are properly aligned.

If you are using the AVX2 extended instructions with 512-bit memory operands, you must ensure that those operands appear on an address in memory that is a multiple of 64 bytes. As for AVX instructions, you will have to define a segment that has an alignment greater than or equal to 64 bytes, such as this:

avx2Data   segment  align(64)
someData   oword    0, 1, 2, 3  ; 512 bits of data
             .
             .
             .
avx2Data   ends

Forgive the redundancy, but it’s important to remember:

Almost all AVX-512 instructions will generate an alignment fault if you attempt to access a 512-bit object at an address that is not 64-byte-aligned. Always ensure that your AVX-512 packed operands are properly aligned.

If you’re using SSE, AVX, and AVX2 data types in the same application, you can create a single data segment to hold all these data values by using a 64-byte alignment option for the single section, instead of a segment for each data type size. Remember, the segment’s alignment has to be greater than or equal to the alignment required by the specific data type. Therefore, a 64-byte alignment will work fine for SSE and AVX/AVX2 variables, as well as AVX-512 variables:

SIMDData   segment  align(64)
sseData    oword    0    ; 64-byte-aligned is also 16-byte-aligned
           align    32   ; Alignment for AVX data
avxData    oword    0, 1 ; 32 bytes of data aligned on 32 bytes
           align    64
avx2Data   oword    0, 1, 2, 3  ; 64 bytes of data
             .
             .
             .
SIMDData   ends

If you specify an alignment option that is much larger than you need (such as 256-byte page alignment), you might unnecessarily waste memory.

The align directive works well when your SSE, AVX, and AVX2 data values are static or global variables. What happens when you want to create local variables on the stack or dynamic variables on the heap? Even if your program adheres to the Microsoft ABI, you’re guaranteed only 16-byte alignment on the stack upon entry to your program (or to a procedure). Similarly, depending on your heap management functions, there is no guarantee that a malloc (or similar) function returns an address that is properly aligned for SSE, AVX, or AVX2 data objects.

Inside a procedure, you can allocate storage for a 16-, 32-, or 64-byte-aligned variable by over-allocating the storage, adding the size minus 1 of the object to the allocated address, and then using the and instruction to zero out LO bits of the address (4 bits for 16-byte-aligned objects, 5 bits for 32-byte-aligned objects, and 6 bits for 64-byte-aligned objects). Then you reference the object by using this pointer. The following sample code demonstrates how to do this:

sseproc     proc
sseptr      equ     <[rbp - 8]>
avxptr      equ     <[rbp - 16]>
avx2ptr     equ     <[rbp - 24]>
            push    rbp
            mov     rbp, rsp
            sub     rsp, 160

; Load RAX with an address 64 bytes
; above the current stack pointer. A
; 64-byte-aligned address will be somewhere
; between RSP and RSP + 63.

            lea     rax, [rsp + 63]

; Mask out the LO 6 bits of RAX. This
; generates an address in RAX that is
; aligned on a 64-byte boundary and is
; between RSP and RSP + 63:

            and     rax, -64 ; 0FFFF...FC0h

; Save this 64-byte-aligned address as
; the pointer to the AVX2 data:

            mov     avx2ptr, rax

; Add 64 to AVX2's address. This skips
; over AVX2's data. The address is also
; 64-byte-aligned (which means it is
; also 32-byte-aligned). Use this as
; the address of AVX's data:

            add     rax, 64
            mov     avxptr, rax

; Add 32 to AVX's address. This skips
; over AVX's data. The address is also
; 32-byte-aligned (which means it is
; also 16-byte-aligned). Use this as
; the address of SSE's data:

            add     rax, 32
            mov     sseptr, rax
             .
             . Code that accesses the
             . AVX2, AVX, and SSE data
             . areas using avx2ptr,
             . avxptr, and sseptr

            leave
            ret
sseproc     endp

For data you allocate on the heap, you do the same thing: allocate extra storage (up to twice as many bytes minus 1), add the size of the object minus 1 (15, 31, or 63) to the address, and then mask the newly formed address with –64, –32, or –16 to produce a 64-, 32-, or 16-byte-aligned object, respectively.

11.6 SIMD Data Movement Instructions

The x86-64 CPUs provide a variety of data move instructions that copy data between (SSE/AVX) registers, load registers from memory, and store register values to memory. The following subsections describe each of these instructions.

11.6.1 The (v)movd and (v)movq Instructions

For the SSE instruction set, the movd (move dword) and movq (move qword) instructions copy the value from a 32- or 64-bit general-purpose register or memory location into the LO dword or qword of an XMM register:²

movd xmm_n, reg32/mem32
movq xmm_n, reg64/mem64

These instructions zero-extend the value to remaining HO bits in the XMM register, as shown in Figures 11-7 and 11-8.

f11007 — Figure 11-7: Moving a 32-bit value from memory to an XMM register (with zero extension)

f11008 — Figure 11-8: Moving a 64-bit value from memory to an XMM register (with zero extension)

The following instructions store the LO 32 or 64 bits of an XMM register into a dword or qword memory location or general-purpose register:

movd reg₃₂/mem₃₂, xmmn
movq reg₆₄/mem₆₄, xmmn

The movq instruction also allows you to copy data from the LO qword of one XMM register to another, but for whatever reason, the movd instruction does not allow two XMM register operands:

movq xmmn, xmmn

For the AVX instructions, you use the following instructions:³

vmovd xmmn, reg₃₂/mem₃₂
vmovd reg₃₂/mem₃₂, xmmn
vmovq xmmn, reg₆₄/mem₆₄
vmovq reg₆₄/mem₆₄, xmmn

The instructions with the XMM destination operands also zero-extend their values into the HO bits (up to bit 255, unlike the standard SSE instructions that do not modify the upper bits of the YMM registers).

Because the movd and movq instructions access 32- and 64-bit values in memory (rather than 128-, 256-, or 512-bit values), these instructions do not require their memory operands to be 16-, 32-, or 64-byte-aligned. Of course, the instructions may execute faster if their operands are dword (movd) or qword (movq) aligned in memory.

11.6.2 The (v)movaps, (v)movapd, and (v)movdqa Instructions

The movaps (move aligned, packed single), movapd (move aligned, packed double), and movdqa (move double quad-word aligned) instructions move 16 bytes of data between memory and an XMM register or between two XMM registers. The AVX versions (with the v prefix) move 16 or 32 bytes between memory and an XMM or a YMM register or between two XMM or YMM registers (moves involving XMM registers zero out the HO bits of the corresponding YMM register). The memory locations must be aligned on a 16-byte or 32-byte boundary (respectively), or the CPU will generate an unaligned access fault.

All three mov* instructions load 16 bytes into an XMM register and are, in theory, interchangeable. In practice, Intel may optimize the operations for the type of data they move (single-precision floating-point values, double-precision floating-point values, or integer values), so it’s always a good idea to choose the appropriate instruction for the data type you are using (see “Performance Issues and the SIMD Move Instructions” on page 622 for an explanation). Likewise, all three vmov* instructions load 16 or 32 bytes into an XMM or a YMM register and are interchangeable.

These instructions take the following forms:

movaps xmmn, mem₁₂₈    vmovaps xmmn, mem₁₂₈    vmovaps ymmn, mem₂₅₆
movaps mem₁₂₈, xmmn    vmovaps mem₁₂₈, xmmn    vmovaps mem₂₅₆, ymmn
movaps xmmn, xmmn     vmovaps xmmn, xmmn     vmovaps ymmn, ymmn
movapd xmmn, mem₁₂₈    vmovapd xmmn, mem₁₂₈    vmovapd ymmn, mem₂₅₆
movapd mem₁₂₈, xmmn    vmovapd mem₁₂₈, xmmn    vmovapd mem₂₅₆, ymmn
movapd xmmn, xmmn     vmovapd xmmn, xmmn     vmovapd ymmn, ymmn
movdqa xmmn, mem₁₂₈    vmovdqa xmmn, mem₁₂₈    vmovdqa ymmn, mem₂₅₆
movdqa mem₁₂₈, xmmn    vmovdqa mem₁₂₈, xmmn    vmovdqa mem₂₅₆, ymmn
movdqa xmmn, xmmn     vmovdqa xmmn, xmmn     vmovdqa ymmn, ymmn

The mem128 operand should be a vector (array) of four single-precision floating-point values for the (v)movaps instruction; it should be a vector of two double-precision floating-point values for the (v)movapd instruction; it should be a 16-byte value (16 bytes, 8 words, 4 dwords, or 2 qwords) when using the (v)movdqa instruction. If you cannot guarantee that the operands are aligned on a 16-byte boundary, use the movups, movupd, or movdqu instructions, instead (see the next section).

The mem256 operand should be a vector (array) of eight single-precision floating-point values for the vmovaps instruction; it should be a vector of four double-precision floating-point values for the vmovapd instruction; it should be a 32-byte value (32 bytes, 16 words, 8 dwords, or 4 qwords) when using the vmovdqa instruction. If you cannot guarantee that the operands are 32-byte-aligned, use the vmovups, vmovupd, or vmovdqu instructions instead.

Although the physical machine instructions themselves don’t particularly care about the data type of the memory operands, MASM’s assembly syntax certainly does care. You will need to use operand type coercion if the instruction doesn’t match one of the following types:

The movaps instruction allows real4, dword, and oword operands.
The movapd instruction allows real8, qword, and oword operands.
The movdqa instruction allows only oword operands.
The vmovaps instruction allows real4, dword, and ymmword ptr operands (when using a YMM register).
The vmovapd instruction allows real8, qword, and ymmword ptr operands (when using a YMM register).
The vmovdqa instruction allows only ymmword ptr operands (when using a YMM register).

Often you will see memcpy (memory copy) functions use the (v)movapd instructions for very high-performance operations. See Agner Fog’s website at https://www.agner.org/optimize/ for more details.

11.6.3 The (v)movups, (v)movupd, and (v)movdqu Instructions

When you cannot guarantee that packed data memory operands lie on a 16- or 32-byte address boundary, you can use the (v)movups (move unaligned packed single-precision), (v)movupd (move unaligned packed double-precision), and (v)movdqu (move double quad-word unaligned) instructions to move data between XMM or YMM registers and memory.

As for the aligned moves, all the unaligned moves do the same thing: copying 16 (32) bytes of data to and from memory. The convention for the various data types is the same as it is for the aligned data movement instructions.

11.6.4 Performance of Aligned and Unaligned Moves

Listings 11-3 and 11-4 provide sample programs that demonstrate the performance of the mova* and movu* instructions using aligned and unaligned memory accesses.

; Listing 11-3
 
; Performance test for packed versus unpacked
; instructions. This program times aligned accesses.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-3", 0

dseg        segment align(64) 'DATA'

; Aligned data types:

            align   64
alignedData byte    64 dup (0)
dseg        ends
            
            .code
            externdef printf:proc
            
; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Used for debugging:

print       proc

; Print code removed for brevity.
; See Listing 11-1 for actual code.

print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage
            
            call    print
            byte    "Starting", nl, 0
            
            mov     rcx, 4000000000 ; 4,000,000,000
            lea     rdx, alignedData
            mov     rbx, 0
rptLp:      mov     rax, 15
rptLp2:     movaps  xmm0, xmmword ptr [rdx + rbx * 1]
            movapd  xmm0, real8 ptr   [rdx + rbx * 1]
            movdqa  xmm0, xmmword ptr [rdx + rbx * 1]
            vmovaps ymm0, ymmword ptr [rdx + rbx * 1]
            vmovapd ymm0, ymmword ptr [rdx + rbx * 1]
            vmovdqa ymm0, ymmword ptr [rdx + rbx * 1]
            vmovaps zmm0, zmmword ptr [rdx + rbx * 1]
            vmovapd zmm0, zmmword ptr [rdx + rbx * 1]
            
            dec     rax
            jns     rptLp2

            dec     rcx
            jnz     rptLp
            
            call    print
            byte    "Done", nl, 0
             
allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

Listing 11-3: Aligned memory-access timing code

; Listing 11-4
 
; Performance test for packed versus unpacked
; instructions. This program times unaligned accesses. 

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-4", 0

dseg        segment align(64) 'DATA'

; Aligned data types:

            align   64
alignedData byte    64 dup (0)
dseg        ends

            .code
            externdef printf:proc
            
; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Used for debugging:

print       proc

; Print code removed for brevity.
; See Listing 11-1 for actual code.

print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage
            
            call    print
            byte    "Starting", nl, 0
            
            mov     rcx, 4000000000 ; 4,000,000,000
            lea     rdx, alignedData
rptLp:      mov     rbx, 15
rptLp2:
            movups  xmm0, xmmword ptr [rdx + rbx * 1]
            movupd  xmm0, real8 ptr   [rdx + rbx * 1]
            movdqu  xmm0, xmmword ptr [rdx + rbx * 1]
            vmovups ymm0, ymmword ptr [rdx + rbx * 1]
            vmovupd ymm0, ymmword ptr [rdx + rbx * 1]
            vmovdqu ymm0, ymmword ptr [rdx + rbx * 1]
            vmovups zmm0, zmmword ptr [rdx + rbx * 1]
            vmovupd zmm0, zmmword ptr [rdx + rbx * 1]
            dec     rbx
            jns     rptLp2

            dec     rcx
            jnz     rptLp
            
            call    print
            byte    "Done", nl, 0
             
allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

Listing 11-4: Unaligned memory-access timing code

The code in Listing 11-3 took about 1 minute and 7 seconds to execute on a 3GHz Xeon W CPU. The code in Listing 11-4 took 1 minute and 55 seconds to execute on the same processor. As you can see, there is sometimes an advantage to accessing SIMD data on an aligned address boundary.

11.6.5 The (v)movlps and (v)movlpd Instructions

The (v)movl* instructions and (v)movh* instructions (from the next section) might look like normal move instructions. Their behavior is similar to many other SSE/AVX move instructions. However, they were designed to support packing and unpacking floating-point vectors. Specifically, these instructions allow you to merge two pairs of single-precision or a pair of double-precision floating-point operands from two different sources into a single XMM register.

The (v)movlps instructions use the following syntax:

movlps  xmm_dest, mem₆₄
movlps  mem₆₄,  xmm_src
vmovlps xmm_dest, xmm_src, mem₆₄
vmovlps mem₆₄,  xmm_src

The movlps xmmdest, mem64 form copies a pair of single-precision floating-point values into the two LO 32-bit lanes of a destination XMM register, as shown in Figure 11-9. This instruction leaves the HO 64 bits unchanged.

f11009 — Figure 11-9: `movlps` instruction

The movlps mem64, xmmsrc form copies the LO 64 bits (the two LO single-precision lanes) from the XMM source register to the specified memory location. Functionally, this is equivalent to the movq or movsd instructions (as it copies 64 bits to memory), though this instruction might be slightly faster if the LO 64 bits of the XMM register actually contain two single-precision values (see “Performance Issues and the SIMD Move Instructions” on page 622 for an explanation).

The vmovlps instruction has three operands: a destination XMM register, a source XMM register, and a source (64-bit) memory location. This instruction copies the two single-precision values from the memory location into the LO 64 bits of the destination XMM register. It copies the HO 64 bits of the source register (which also hold two single-precision values) into the HO 64 bits of the destination register. Figure 11-10 shows the operation. Note that this instruction merges the pair of operands with a single instruction.

f11010 — Figure 11-10: `vmovlps` instruction

Like movsd, the movlpd (move low packed double) instruction copies the LO 64 bits (a double-precision floating-point value) of the source operand to the LO 64 bits of the destination operand. The difference is that the movlpd instruction doesn’t zero-extend the value when moving data from memory into an XMM register, whereas the movsd instruction will zero-extend the value into the upper 64 bits of the destination XMM register. (Neither the movsd nor movlpd will zero-extend when copying data between XMM registers; of course, zero extension doesn’t apply when storing data to memory.)⁴

11.6.6 The movhps and movhpd Instructions

The movhps and movhpd instructions move a 64-bit value (either two single-precision floats in the case of movhps, or a single double-precision value in the case of movhpd) into the HO quad word of a destination XMM register. Figure 11-11 shows the operation of the movhps instruction; Figure 11-12 shows the movhpd instruction.

f11011 — Figure 11-11: `movhps` instruction

f11012 — Figure 11-12: `movhpd` instruction

The movhps and movhpd instructions can also store the HO quad word of an XMM register into memory. The allowable syntax is shown here:

movhps xmmn, mem₆₄
movhps mem₆₄, xmmn
movhpd xmmn, mem₆₄
movhpd mem₆₄, xmmn

These instructions do not affect bits 128 to 255 of the YMM registers (if present on the CPU).

You would normally use a movlps instruction followed by a movhps instruction to load four single-precision floating-point values into an XMM register, taking the floating-point values from two different data sources (similarly, you could use the movlpd and movhpd instructions to load a pair of double-precision values into a single XMM register from different sources). Conversely, you could also use this instruction to split a vector result in half and store the two halves in different data streams. This is probably the intended purpose of this instruction. Of course, if you can use it for other purposes, have at it.

MASM (version 14.15.26730.0, at least) seems to require movhps operands to be a 64-bit data type and does not allow real4 operands.⁵ Therefore, you may have to explicitly coerce an array of two real4 values with qword ptr when using this instruction:

r4m         real4   1.0, 2.0, 3.0, 4.0
r8m         real8   1.0, 2.0
              .
              .
              .
            movhps  xmm0, qword ptr r4m2
            movhpd  xmm0, r8m

11.6.7 The vmovhps and vmovhpd Instructions

Although the AVX instruction extensions provide vmovhps and vmovhpd instructions, they are not a simple extension of the SSE movhps and movhpd instructions. The syntax for these instructions is as follows:

vmovhps xmm_dest, xmm_src, mem₆₄
vmovhps mem₆₄,  xmm_src
vmovhpd xmm_dest, xmm_src, mem₆₄
vmovhpd mem₆₄,  xmm_src

The instructions that store data into a 64-bit memory location behave similarly to the movhps and movhpd instructions. The instructions that load data into an XMM register have two source operands. They load a full 128 bits (four single-precision values or two double-precision values) into the destination XMM register. The HO 64 bits come from the memory operand; the LO 64 bits come from the LO quad word of the source XMM register, as Figure 11-13 shows. These instructions also zero-extend the value into the upper 128 bits of the (overlaid) YMM register.

f11013 — Figure 11-13: `vmovhpd` and `vmovhps` instructions

Unlike for the movhps instruction, MASM properly accepts real4 source operands for the vmovhps instruction:

r4m         real4   1.0, 2.0, 3.0, 4.0
r8m         real8   1.0, 2.0
              .
              .
              .
            vmovhps xmm0, xmm1, r4m
            vmovhpd xmm0, xmm1, r8m

11.6.8 The movlhps and vmovlhps Instructions

The movlhps instruction moves a pair of 32-bit single-precision floating-point values from the LO qword of the source XMM register into the HO 64 bits of a destination XMM register. It leaves the LO 64 bits of the destination register unchanged. If the destination register is on a CPU that supports 256-bit AVX registers, this instruction also leaves the HO 128 bits of the overlaid YMM register unchanged.

The syntax for these instructions is as follows:

movlhps  xmm_dest, xmm_src
vmovlhps xmm_dest, xmm_src1, xmm_src2

You cannot use this instruction to move data between memory and an XMM register; it transfers data only between XMM registers. No double-precision version of this instruction exists.

The vmovlhps instruction is similar to movlhps, with the following differences:

vmovlhps requires three operands: two source XMM registers and a destination XMM register.
vmovlhps copies the LO quad word of the first source register into the LO quad word of the destination register.
vmovlhps copies the LO quad word of the second source register into bits 64 to 127 of the destination register.
vmovlhps zero-extends the result into the upper 128 bits of the overlaid YMM register.

There is no vmovlhpd instruction.

11.6.9 The movhlps and vmovhlps Instructions

The movhlps instruction has the following syntax:

movhlps xmm_dest, xmm_src

The movhlps instruction copies the pair of 32-bit single-precision floating-point values from the HO qword of the source operand to the LO qword of the destination register, leaving the HO 64 bits of the destination register unchanged (this is the converse of movlhps). This instruction copies data only between XMM registers; it does not allow a memory operand.

The vmovhlps instruction requires three XMM register operands; here is its syntax:

vmovhlps xmm_dest, xmm_src1, xmm_src2

This instruction copies the HO 64 bits of the first source register into the HO 64 bits of the destination register, copies the HO 64 bits of the second source register into bits 0 to 63 of the destination register, and finally, zero-extends the result into the upper bits of the overlaid YMM register.

There are no movhlpd or vmovhlpd instructions.

11.6.10 The (v)movshdup and (v)movsldup Instructions

The movshdup instruction moves the two odd-index single-precision floating-point values from the source operand (memory or XMM register) and duplicates each element into the destination XMM register, as shown in Figure 11-14.

f11014 — Figure 11-14: `movshdup` and `vmovshdup` instructions

This instruction ignores the single-precision floating-point values at even-lane indexes into the XMM register. The vmovshdup instruction works the same way but on YMM registers, copying four single-precision values rather than two (and, of course, zeroing the HO bits). The syntax for these instructions is shown here:

movshdup  xmm_dest, mem₁₂₈/xmm_src
vmovshdup xmm_dest, mem₁₂₈/xmm_src
vmovshdup ymm_dest, mem₂₅₆/ymm_src

The movsldup instruction works just like the movshdup instruction, except it copies and duplicates the two single-precision values at even indexes in the source XMM register to the destination XMM register. Likewise, the vmovsldup instruction copies and duplicates the four double-precision values in the source YMM register at even indexes, as shown in Figure 11-15.

f11015 — Figure 11-15: `movsldup` and `vmovsldup` instructions

The syntax is as follows:

movsldup  xmm_dest, mem₁₂₈/xmm_src
vmovsldup xmm_dest, mem₁₂₈/xmm_src
vmovsldup ymm_dest, mem₂₅₆/ymm_src

11.6.11 The (v)movddup Instruction

The movddup instruction copies and duplicates a double-precision value from the LO 64 bits of an XMM register or a 64-bit memory location into the LO 64 bits of a destination XMM register; then it also duplicates this value into bits 64 to 127 of that same destination register, as shown in Figure 11-16.

f11016 — Figure 11-16: `movddup` instruction behavior

This instruction does not disturb the HO 128 bits of a YMM register (if applicable). The syntax for this instruction is as follows:

movddup xmm_dest, mem₆₄/xmm_src

The vmovddup instruction operates on an XMM or a YMM destination register and an XMM or a YMM source register or 128- or 256-bit memory location. The 128-bit version works just like the movddup instruction except it zeroes the HO bits of the destination YMM register. The 256-bit version copies a pair of double-precision values at even indexes (0 and 2) in the source value to their corresponding indexes in the destination YMM register and duplicates those values at the odd indexes in the destination, as Figure 11-17 shows.

f11017 — Figure 11-17: `vmovddup` instruction behavior

Here is the syntax for this instruction:

movddup  xmm_dest, mem₆₄/xmm_src
vmovddup ymm_dest, mem₂₅₆/ymm_src

11.6.12 The (v)lddqu Instruction

The (v)lddqu instruction is operationally identical to (v)movdqu. You can sometimes use this instruction to improve performance if the (memory) source operand is not aligned properly and crosses a cache line boundary in memory. For more details on this instruction and its performance limitations, refer to the Intel or AMD documentation (specifically, the optimization manuals).

These instructions always take the following form:

lddqu  xmm_dest, mem₁₂₈
vlddqu xmm_dest, mem₁₂₈
vlddqu ymm_dest, mem₂₅₆

11.6.13 Performance Issues and the SIMD Move Instructions

When you look at the SSE/AVX instructions’ semantics at the programming model level, you might question why certain instructions appear in the instruction set. For example, the movq, movsd, and movlps instructions can all load 64 bits from a memory location into the LO 64 bits of an XMM register. Why bother doing this? Why not have a single instruction that copies the 64 bits from a quad word in memory to the LO 64 bits of an XMM register (be it a 64-bit integer, a pair of 32-bit integers, a 64-bit double-precision floating-point value, or a pair of 32-bit single-precision floating-point values)? The answer lies in the term microarchitecture.

The x86-64 macroarchitecture is the programming model that a software engineer sees. In the macroarchitecture, an XMM register is a 128-bit resource that, at any given time, could hold a 128-bit array of bits (or an integer), a pair of 64-bit integer values, a pair of 64-bit double-precision floating-point values, a set of four single-precision floating-point values, a set of four double-word integers, eight words, or 16 bytes. All these data types overlay one another, just like the 8-, 16-, 32-, and 64-bit general-purpose registers overlay one another (this is known as aliasing). If you load two double-precision floating-point values into an XMM register and then modify the (integer) word at bit positions 0 to 15, you’re also changing those same bits (0 to 15) in the double-precision value in the LO qword of the XMM register. The semantics of the x86-64 programming model require this.

At the microarchitectural level, however, there is no requirement that the CPU use the same physical bits in the CPU for integer, single-precision, and double-precision values (even when they are aliased to the same register). The microarchitecture could set aside a separate set of bits to hold integers, single-precision, and double-precision values for a single register. So, for example, when you use the movq instruction to load 64 bits into an XMM register, that instruction might actually copy the bits into the underlying integer register (without affecting the single-precision or double-precision subregisters). Likewise, movlps would copy a pair of single-precision values into the single-precision register, and movsd would copy a double-precision value into the double-precision register (Figure 11-18). These separate subregisters (integer, single-precision, and double-precision) could be connected directly to the arithmetic or logical unit that handles their specific data types, making arithmetic and logical operations on those subregisters more efficient. As long as the data is sitting in the appropriate subregister, everything works smoothly.

f11018 — Figure 11-18: Register aliasing at the microarchitectural level

However, what happens if you use movq to load a pair of single-precision floating-point values into an XMM register and then try to perform a single-precision vector operation on those two values? At the macroarchitectural level, the two single-precision values are sitting in the appropriate bit positions of the XMM register, so this has to be a legal operation. At the microarchitectural level, however, those two single-precision floating-point values are sitting in the integer subregister, not the single-precision subregister. The underlying microarchitecture has to note that the values are in the wrong subregister and move them to the appropriate (single-precision) subregister before performing the single-precision arithmetic or logical operation. This may introduce a slight delay (while the microarchitecture moves the data around), which is why you should always pick the appropriate move instructions for your data types.

11.6.14 Some Final Comments on the SIMD Move Instructions

The SIMD data movement instructions are a confusing bunch. Their syntax is inconsistent, many instructions duplicate the actions of other instructions, and they have some perplexing irregularity issues. Someone new to the x86-64 instruction set might ask, “Why was the instruction set designed this way?” Why, indeed?

The answer to that question is historical. The SIMD instructions did not exist on the earliest x86 CPUs. Intel added the MMX instruction set to the Pentium-series CPUs. At that time (the early 1990s), current technology allowed Intel to add only a few additional instructions, and the MMX registers were limited to 64 bits in size. Furthermore, software engineers and computer systems designers were only beginning to explore the multimedia capabilities of modern computers, so it wasn’t entirely clear which instructions (and data types) were necessary to support the type of software we see several decades later. As a result, the earliest SIMD instructions and data types were limited in scope.

As time passed, CPUs gained additional silicon resources, and software/systems engineers discovered new uses for computers (and new algorithms to run on those computers), so Intel (and AMD) responded by adding new SIMD instructions to support these more modern multimedia applications. The original MMX instructions, for example, supported only integer data types, so Intel added floating-point support in the SSE instruction set, because multimedia applications needed real data types. Then Intel extended the integer types from 64 bits to 128, 256, and even 512 bits. With each extension, Intel (and AMD) had to retain the older instruction set extensions in order to allow preexisting software to run on the new CPUs.

As a result, the newer instruction sets kept piling on new instructions that did the same work as the older ones (with some additional capabilities). This is why instructions like movaps and vmovaps have considerable overlap in their functionality. If the CPU resources had been available earlier (for example, to put 256-bit YMM registers on the CPU), there would have been almost no need for the movaps instruction—the vmovaps could have done all the work.⁶

In theory, we could create an architecturally elegant variant of the x86-64 by starting over from scratch and designing a minimal instruction set that handles all the activities of the current x86-64 without all the kruft and kludges present in the existing instruction set. However, such a CPU would lose the primary advantage of the x86-64: the ability to run decades of software written for the Intel architecture. The cost of being able to run all this old software is that assembly language programmers (and compiler writers) have to deal with all these irregularities in the instruction set.

11.7 The Shuffle and Unpack Instructions

The SSE/AVX shuffle and unpack instructions are variants of the move instructions. In addition to moving data around, these instructions can also rearrange the data appearing in different lanes of the XMM and YMM registers.

11.7.1 The (v)pshufb Instructions

The pshufb instruction was the first packed byte shuffle SIMD instruction (it first appeared with the MMX instruction set). Because of its origin, its syntax and behavior are a bit different from the other shuffle instructions in the instruction set. The syntax is the following:

pshufb xmm_dest, xmm/mem₁₂₈

The first (destination) operand is an XMM register whose byte lanes pshufb will shuffle (rearrange). The second operand (either an XMM register or a 128-bit oword memory location) is an array of 16 byte values holding indexes that control the shuffle operation. If the second operand is a memory location, that oword value must be aligned on a 16-byte boundary.

Each byte (lane) in the second operand selects a value for the corresponding byte lane in the first operand, as shown in Figure 11-19.

f11019 — Figure 11-19: Lane index correspondence for `pshufb` instruction

The 16-byte indexes in the second operand each take the form shown in Figure 11-20.

f11020 — Figure 11-20: `phsufb` byte index

The pshufb instruction ignores bits 4 to 6 in an index byte. Bit 7 is the clear bit; if this bit contains a 1, the pshufb instruction ignores the lane index bits and stores a 0 into the corresponding byte in XMM_dest. If the clear bit contains a 0, the pshufb instruction does a shuffle operation.

The pshufb shuffle operation takes place on a lane-by-lane basis. The instruction first makes a temporary copy of XMM_dest. Then for each index byte (whose HO bit is 0), the pshufb copies the lane specified by the LO 4 bits of the index from the XMM_dest lane that matches the index’s lane, as shown in Figure 11-21. In this example, the index appearing in lane 6 contains the value 00000011b. This selects the value in lane 3 of the temporary (original XMM_dest) value and copies it to lane 6 of XMM_dest. The pshufb instruction repeats this operation for all l6 lanes.

f11021 — Figure 11-21: Shuffle operation

The AVX instruction set extensions introduced the vpshufb instruction. Its syntax is the following:

vpshufb xmm_dest, xmm_src, xmm_index/mem₁₂₈
vpshufb ymm_dest, ymm_src, ymm_index/mem₂₅₆

The AVX variant adds a source register (rather than using XMM_dest as both the source and destination registers), and, rather than creating a temporary copy of XMM_destprior to the operation and picking the values from that copy, the vpshufb instructions select the source bytes from the XMM_src register. Other than that, and the fact that these instructions zero the HO bits of YMM_dest, the 128-bit variant operates identically to the SSE pshufb instruction.

The AVX instruction allows you to specify 256-bit YMM registers in addition to 128-bit XMM registers.⁷

11.7.2 The (v)pshufd Instructions

The SSE extensions first introduced the pshufd instruction. The AVX extensions added the vpshufd instruction. These instructions shuffle dwords in XMM and YMM registers (not double-precision values) similarly to the (v)pshufb instructions. However, the shuffle index is specified differently from (v)pshufb. The syntax for the (v)pshufd instructions is as follows:

pshufd  xmm_dest, xmm_src/mem₁₂₈, imm₈
vpshufd xmm_dest, xmm_src/mem₁₂₈, imm₈
vpshufd ymm_dest, ymm_src/mem₂₅₆, imm₈

The first operand (XMM_destor YMM_dest) is the destination operand where the shuffled values will be stored. The second operand is the source from which the instruction will select the double words to place in the destination register; as usual, if this is a memory operand, you must align it on the appropriate (16- or 32-byte) boundary. The third operand is an 8-bit immediate value that specifies the indexes for the double words to select from the source operand.

For the (v)pshufd instructions with an XMM_dest operand, the imm₈ operand has the encoding shown in Table 11-3. The value in bits 0 to 1 selects a particular dword from the source operand to place in dword 0 of the XMM_dest operand. The value in bits 2 to 3 selects a dword from the source operand to place in dword 1 of the XMM_dest operand. The value in bits 4 to 5 selects a dword from the source operand to place in dword 2 of the XMM_dest operand. Finally, the value in bits 6 to 7 selects a dword from the source operand to place in dword 3 of the XMM_dest operand.

Table 11-3: (v)pshufd imm₈Operand Values

Bit positions	Destination lane
0 to 1	0
2 to 3	1
4 to 5	2
6 to 7	3

The difference between the 128-bit pshufd and vpshufd instructions is that pshufd leaves the HO 128 bits of the underlying YMM register unchanged and vpshufd zeroes the HO 128 bits of the underlying YMM register.

The 256-bit variant of vpshufd (when using YMM registers as the source and destination operands) still uses an 8-bit immediate operand as the index value. Each 2-bit index value manipulates two dword values in the YMM registers. Bits 0 to 1 control dwords 0 and 4, bits 2 to 3 control dwords 1 and 5, bits 4 to 5 control dwords 2 and 6, and bits 6 to 7 control dwords 3 and 7, as shown in Table 11-4.

Table 11-4: Double-Word Transfers for vpshufd YMM_dest, YMM_src/mem_src, imm₈

Index	YMM/mem_src[index] copied into	YMM/mem_src[index + 4] copied into
Bits 0 to 1 of imm₈	YMM_dest[0]	YMM_dest[4]
Bits 2 to 3 of imm₈	YMM_dest[1]	YMM_dest[5]
Bits 4 to 5 of imm₈	YMM_dest[2]	YMM_dest[6]
Bits 6 to 7 of imm₈	YMM_dest[3]	YMM_dest[7]

The 256-bit version is slightly less flexible as it copies two dwords at a time, rather than one. It processes the LO 128 bits exactly the same way as the 128-bit version of the instruction; it also copies the corresponding lanes in the upper 128 bits of the source to the YMM destination register by using the same shuffle pattern. Unfortunately, you can’t independently control the HO and LO halves of the YMM register by using the vpshufd instruction. If you really need to shuffle dwords independently, you can use vshufb with appropriate indexes that copy 4 bytes (in place of a single dword).

11.7.3 The (v)pshuflw and (v)pshufhw Instructions

The pshuflw and vpshuflw and the pshufhw and vpshufhw instructions provide support for 16-bit word shuffles within an XMM or a YMM register. The syntax for these instructions is the following:

pshuflw  xmm_dest, xmm_src/mem₁₂₈, imm₈
pshufhw  xmm_dest, xmm_src/mem₁₂₈, imm₈

vpshuflw xmm_dest, xmm_src/mem₁₂₈, imm₈
vpshufhw xmm_dest, xmm_src/mem₁₂₈, imm₈

vpshuflw ymm_dest, ymm_src/mem₂₅₆, imm₈
vpshufhw ymm_dest, ymm_src/mem₂₅₆, imm₈

The 128-bit lw variants copy the HO 64 bits of the source operand to the same positions in the XMM_dest operand. Then they use the index (imm₈) operand to select word lanes 0 to 3 in the LO qword of the XMM_src/mem₁₂₈ operand to move to the LO 4 lanes of the destination operand. For example, if the LO 2 bits of imm₈ are 10b, then the pshuflw instruction copies lane 2 from the source into lane 0 of the destination operand (Figure 11-22). Note that pshuflw does not modify the HO 128 bits of the overlaid YMM register, whereas vpshuflw zeroes those HO bits.

f11022 — Figure 11-22: `(v)pshuflw` `xmm,` `xmm/mem,` `imm`8 operation

The 256-bit vpshuflw instruction (with a YMM destination register) copies two pairs of words at a time—one pair in the HO 128 bits and one pair in the LO 128 bits of the YMM destination register and 256-bit source locations, as shown in Figure 11-23. The index (imm₈) selection is the same for the LO and HO 128 bits.

f11023 — Figure 11-23: `vpshuflw` `ymm,` `ymm/mem,` `imm`8 operation

The 128-bit hw variants copy the LO 64 bits of the source operand to the same positions in the destination operand. Then they use the index operand to select words 4 to 7 (indexed as 0 to 3) in the 128-bit source operand to move to the HO four word lanes of the destination operand (Figure 11-24).

f11024 — Figure 11-24: `(v)pshufhw` operation

The 256-bit vpshufhw instruction (with a YMM destination register) copies two pairs of words at a time—one in the HO 128 bits and one in the LO 128 bits of the YMM destination register and 256-bit source locations, as shown in Figure 11-25.

f11025 — Figure 11-25: `vpshufhw` operation

11.7.4 The shufps and shufpd Instructions

The shuffle instructions (shufps and shufpd) extract single- or double-precision values from the source operands and place them in specified positions in the destination operand. The third operand, an 8-bit immediate value, selects which values to extract from the source to move into the destination register. The syntax for these two instructions is as follows:

shufps xmm_src1/dest, xmm_src2/mem₁₂₈, imm₈
shufpd xmm_src1/dest, xmm_src2/mem₁₂₈, imm₈

For the shufps instruction, the second source operand is an 8-bit immediate value that is actually a four-element array of 2-bit values.

imm₈ bits 0 and 1 select a single-precision value from one of the four lanes in the XMM_src1/dest operand to store into lane 0 of the destination operation. Bits 2 and 3 select a single-precision value from one of the four lanes in the XMM_src1/dest operand to store into lane 1 of the destination operation (the destination operand is also XMM_src1/dest).

imm₈ bits 4 and 5 select a single-precision value from one of the four lanes in the XMM_src2/mem_src2 operand to store into lane 2 of the destination operation. Bits 6 and 7 select a single-precision value from one of the four lanes in the XMM_src2/mem_src2 operand to store into lane 3 of the destination operation.

Figure 11-26 shows the operation of the shufps instruction.

f11026 — Figure 11-26: `shufps` operation

For example, the instruction

shufps xmm0, xmm1, 0E4h  ; 0E4h = 11 10 01 00

loads XMM0 with the following single-precision values:

XMM0[0 to 31] from XMM0[0 to 32]
XMM0[32 to 63] from XMM0[32 to 63]
XMM0[64 to 95] from XMM1[63 to 95]
XMM0[96 to 127] from XMM1[96 to 127]

If the second operand (XMM_src2/mem_src2) is the same as the first operand (XMM_src1/dest), it’s possible to rearrange the four single-precision values in the XMM_dest register (which is probably the source of the instruction name shuffle).

The shufpd instruction works similarly, shuffling double-precision values. As there are only two double-precision values in an XMM register, it takes only a single bit to choose between the values. Likewise, as there are only two double-precision values in the destination register, the instruction requires only two (single-bit) array elements to choose the destination. As a result, the third operand, the imm₈ value, is actually just a 2-bit value; the instruction ignores bits 2 to 7 in the imm₈ operand. Bit 0 of the imm₈ operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMM_src1/dest operand to place into lane 0 and bits 0 to 63 of XMM_dest. Bit 1 of the imm₈ operand selects either lane 0 and bits 0 to 63 (if it is 0) or lane 1 and bits 64 to 127 (if it is 1) from the XMM_src/mem₁₂₈ operand to place into lane 1 and bits 64 to 127 of XMM_dest. Figure 11-27 shows this operation.

f11027 — Figure 11-27: `shufpd` operation

11.7.5 The vshufps and vshufpd Instructions

The vshufps and vshufpd instructions are similar to shufps and shufpd. They allow you to shuffle the values in 128-bit XMM registers or 256-bit YMM registers.⁸ The vshufps and vshufpd instructions have four operands: a destination XMM or YMM register, two source operands (src₁ must be an XMM or a YMM register, and src₂ can be an XMM or a YMM register or a 128- or 256-bit memory location), and an imm₈ operand. Their syntax is the following:

vshufps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈
vshufpd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈

vshufps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆, imm₈
vshufpd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆, imm₈

Whereas the SSE shuffle instructions use the destination register as an implicit source operand, the AVX shuffle instructions allow you to specify explicit destination and source operands (they can all be different, or all the same, or any combination thereof).

For the 256-bit vshufps instructions, the imm₈ operand is an array of four 2-bit values (bits 0:1, 2:3, 4:5, and 6:7). These 2-bit values select one of four single-precision values from the source locations, as described in Table 11-5.

Table 11-5: vshufps Destination Selection

	Destination	imm₈ value
imm₈ bits		00	01	10	11
76 54 32 10	Dest[0 to 31]	Src₁[0 to 31]	Src₁[32 to 63]	Src₁[64 to 95]	Src₁[96 to 127]
	Dest[128 to 159]	Src₁[128 to 159]	Src₁[160 to 191]	Src₁[192 to 223]	Src₁[224 to 255]
76 54 32 10	Dest[32 to 63]	Src₁[0 to 31]	Src₁[32 to 63]	Src₁[64 to 95]	Src₁[96 to 127]
	Dest[160 to 191]	Src₁[128 to 159]	Src₁[160 to 191]	Src₁[192 to 223]	Src₁[224 to 255]
76 54 32 10	Dest[64 to 95]	Src₂[0 to 31]	Src₂[32 to 63]	Src₂[64 to 95]	Src₂[96 to 127]
	Dest[192 to 223]	Src₂[128 to 159]	Src₂[160 to 191]	Src₂[192 to 223]	Src₂[224 to 255]
76 54 32 10	Dest[96 to 127]	Src₂[0 to 31]	Src₂[32 to 63]	Src₂[64 to 95]	Src₂[96 to 127]
	Dest[224 to 255]	Src₂[128 to 159]	Src₂[160 to 191]	Src₂[192 to 223]	Src₂[224 to 255]

If both source operands are the same, you can shuffle around the single-precision values in any order you choose (and if the destination and both source operands are the same, you can arbitrarily shuffle the dwords within that register).

The vshufps instruction also allows you to specify XMM and 128-bit memory operands. In this form, it behaves quite similarly to the shufps instruction except that you get to specify two different 128-bit source operands (rather than only one 128-bit source operand), and it zeroes the HO 128 bits of the corresponding YMM register. If the destination operand is different from the first source operand, this can be useful. If the vshufps’s first source operand is the same XMM register as the destination operand, you should use the shufps instruction as its machine encoding is shorter.

The vshufpd instruction is an extension of shufpd to 256 bits (plus the addition of a second source operand). As there are four double-precision values present in a 256-bit YMM register, vshufpd needs 4 bits to select the source indexes (rather than the 2 bits that shufpd requires). Table 11-6 describes how vshufpd copies the data from the source operands to the destination operand.

Table 11-6: vshufpd Destination Selection

	Destination	imm₈ value
imm₈ bits		0	1
7654 3 2 1 0	Dest[0 to 63]	Src₁[0 to 63]	Src₁[64 to 127]
7654 3 2 1 0	Dest[64 to 127]	Src₂[0 to 63]	Src₂[64 to 127]
7654 3 2 1 0	Dest[128 to 191]	Src₁[128 to 191]	Src₁[192 to 255]
7654 3 2 1 0	Dest[192 to 255]	Src₂[128 to 191]	Src₂[192 to 255]

Like the vshufps instruction, vshufpd also allows you to specify XMM registers if you want a three-operand version of shufpd.

11.7.6 The (v)unpcklps, (v)unpckhps, (v)unpcklpd, and (v)unpckhpd Instructions

The unpack (and merge) instructions are a simplified variant of the shuffle instructions. These instructions copy single- and double-precision values from fixed locations in their source operands and insert those values into fixed locations in the destination operand. They are, essentially, shuffle instructions without the imm₈ operand and with fixed shuffle patterns.

The unpcklps and unpckhps instructions choose half their single-precision operands from one of two sources, merge these values (interleaving them), and then store the merged result into the destination operand (which is the same as the first source operand). The syntax for these two instructions is as follows:

unpcklps xmm_dest, xmm_src/mem₁₂₈
unpckhps xmm_dest, xmm_src/mem₁₂₈

The XMM_dest operand serves as both the first source operand and the destination operand. The XMM_src/mem₁₂₈ operand is the second source operand.

The difference between the two is the way they select their source operands. The unpcklps instruction copies the two LO single-precision values from the source operand to bit positions 32 to 63 (dword 1) and 96 to 127 (dword 3). It leaves dword 0 in the destination operand alone and copies the value originally in dword 1 to dword 2 in the destination. Figure 11-28 diagrams this operation.

f11028 — Figure 11-28: `unpcklps` instruction operation

The unpckhps instruction copies the two HO single-precision values from the two sources to the destination register, as shown in Figure 11-29.

f11029 — Figure 11-29: `unpckhps` instruction operation

The unpcklpd and unpckhpd instructions do the same thing as unpcklps and unpckhps except, of course, they operate on double-precision values rather than single-precision values. Figures 11-30 and 11-31 show the operation of these two instructions.

f11030 — Figure 11-30: `unpcklpd` instruction operation

f11031 — Figure 11-31: `unpckhpd` instruction operation

The vunpcklps, vunpckhps, vunpcklpd, and vunpckhpd instructions have the following syntax:

vunpcklps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vunpckhps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈

vunpcklps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vunpckhps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

They work similarly to the non-v variants, with a couple of differences:

The AVX variants support using the YMM registers as well as the XMM registers.
The AVX variants require three operands. The first (destination) and second (source₁) operands must be XMM or YMM registers. The third (source₂) operand can be an XMM or a YMM register or a 128- or 256-bit memory location. The two-operand form is just a special case of the three-operand form, where the first and second operands specify the same register name.
The 128-bit variants zero out the HO bits of the YMM register rather than leaving those bits unchanged.

Of course, the AVX instructions with the YMM registers interleave twice as many single- or double-precision values. The interleaving extension happens in the intuitive way, with vunpcklps (Figure 11-32):

The single-precision values in source₁, bits 0 to 31, are first written to bits 0 to 31 of the destination.
The single-precision values in source₂, bits 0 to 31, are written to bits 32 to 63 of the destination.
The single-precision values in source₁, bits 32 to 63, are written to bits 64 to 95 of the destination.
The single-precision values in source₂, bits 32 to 63, are written to bits 96 to 127 of the destination.
The single-precision values in source₁, bits 128 to 159, are first written to bits 128 to 159 of the destination.
The single-precision values in source₂, bits 128 to 159, are written to bits 160 to 191 of the destination.
The single-precision values in source₁, bits 160 to 191, are written to bits 192 to 223 of the destination.
The single-precision values in source₂, bits 160 to 191, are written to bits 224 to 256 of the destination.

f11032 — Figure 11-32: `vunpcklps` instruction operation

The vunpckhps instruction (Figure 11-33) does the following:

The single-precision values in source₁, bits 64 to 95, are first written to bits 0 to 31 of the destination.
The single-precision values in source₂, bits 64 to 95, are written to bits 32 to 63 of the destination.
The single-precision values in source₁, bits 96 to 127, are written to bits 64 to 95 of the destination.
The single-precision values in source₂, bits 96 to 127, are written to bits 96 to 127 of the destination.

f11033 — Figure 11-33: `vunpckhps` instruction operation

Likewise, vunpcklpd and vunpckhpd move double-precision values.

11.7.7 The Integer Unpack Instructions

The punpck* instructions provide a set of integer unpack instructions to complement the floating-point variants. These instructions appear in Table 11-7.

Table 11-7: Integer Unpack Instructions

Instruction	Description
`punpcklbw`	Unpacks low bytes to words
`punpckhbw`	Unpacks high bytes to words
`punpcklwd`	Unpacks low words to dwords
`punpckhwd`	Unpacks high words to dwords
`punpckldq`	Unpacks low dwords to qwords
`punpckhdq`	Unpacks high dwords to qwords
`punpcklqdq`	Unpacks low qwords to owords (double qwords)
`punpckhqdq`	Unpacks high qwords to owords (double qwords)

11.7.7.1 The punpck* Instructions

The punpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination SSE register. The syntax for these instructions is shown here:

punpcklbw  xmm_dest, xmm_src
punpcklbw  xmm_dest, mem_src
punpckhbw  xmm_dest, xmm_src
punpckhbw  xmm_dest, mem_src
punpcklwd  xmm_dest, xmm_src
punpcklwd  xmm_dest, mem_src
punpckhwd  xmm_dest, xmm_src
punpckhwd  xmm_dest, mem_src
punpckldq  xmm_dest, xmm_src
punpckldq  xmm_dest, mem_src
punpckhdq  xmm_dest, xmm_src
punpckhdq  xmm_dest, mem_src
punpcklqdq xmm_dest, xmm_src
punpcklqdq xmm_dest, mem_src
punpckhqdq xmm_dest, xmm_src
punpckhqdq xmm_dest, mem_src

Figures 11- 34 through 11-41 show the data transfers for each of these instructions.

f11034 — Figure 11-34: `punpcklbw` instruction operation

f11035 — Figure 11-35: `punpckhbw` operation

f11036 — Figure 11-36: `punpcklwd` operation

f11037 — Figure 11-37: `punpckhwd` operation

f11038 — Figure 11-38: `punpckldq` operation

f11039 — Figure 11-39: `punpckhdq` operation

f11040 — Figure 11-40: `punpcklqdq` operation

f11041 — Figure 11-41: `punpckhqdq` operation

11.7.7.2 The vpunpck* SSE Instructions

The AVX vpunpck* instructions provide a set of AVX integer unpack instructions to complement the SSE variants. These instructions appear in Table 11-8.

Table 11-8: AVX Integer Unpack Instructions

Instruction	Description
`vpunpcklbw`	Unpacks low bytes to words
`vpunpckhbw`	Unpacks high bytes to words
`vpunpcklwd`	Unpacks low words to dwords
`vpunpckhwd`	Unpacks high words to dwords
`vpunpckldq`	Unpacks low dwords to qwords
`vpunpckhdq`	Unpacks high dwords to qwords
`vpunpcklqdq`	Unpacks low qwords to owords (double qwords)
`vpunpckhqdq`	Unpacks high qwords to owords (double qwords)

The vpunpck* instructions extract half the bytes, words, dwords, or qwords from two different sources and merge these values into a destination AVX or SSE register. Here is the syntax for the SSE forms of these instructions:

vpunpcklbw  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpckhbw  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpcklwd  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpckhwd  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpckldq  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpckhdq  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpcklqdq xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpunpckhqdq xmm_dest, xmm_src1, xmm_src2/mem₁₂₈

Functionally, the only difference between these AVX instructions (vunpck*) and the SSE (unpck*) instructions is that the SSE variants leave the upper bits of the YMM AVX registers (bits 128 to 255) unchanged, whereas the AVX variants zero-extend the result to 256 bits. See Figures 11-34 through 11-41 for a description of the operation of these instructions.

11.7.7.3 The vpunpck* AVX Instructions

The AVX vunpck* instructions also support the use of the AVX YMM registers, in which case the unpack and merge operation extends from 128 bits to 256 bits. The syntax for these instructions is as follows:

vpunpcklbw  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpckhbw  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpcklwd  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpckhwd  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpckldq  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpckhdq  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpcklqdq ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
vpunpckhqdq ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

11.7.8 The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq Instructions

The (v)pextrb, (v)pextrw, (v)pextrd, and (v)pextrq instructions extract a byte, word, dword, or qword from a 128-bit XMM register and copy this data to a general-purpose register or memory location. The syntax for these instructions is the following:

pextrb  reg₃₂, xmm_src, imm₈   ; imm₈ = 0 to 15
pextrb  reg₆₄, xmm_src, imm₈   ; imm₈ = 0 to 15
pextrb  mem₈, xmm_src, imm₈    ; imm₈ = 0 to 15
vpextrb reg₃₂, xmm_src, imm₈  ; imm₈ = 0 to 15
vpextrb reg₆₄, xmm_src, imm₈  ; imm₈ = 0 to 15
vpextrb mem₈, xmm_src, imm₈   ; imm₈ = 0 to 15

pextrw  reg₃₂, xmm_src, imm₈  ; imm₈ = 0 to 7
pextrw  reg₆₄, xmm_src, imm₈  ; imm₈ = 0 to 7
pextrw  mem₁₆, xmm_src, imm₈  ; imm₈ = 0 to 7
vpextrw reg₃₂, xmm_src, imm₈  ; imm₈ = 0 to 7
vpextrw reg₆₄, xmm_src, imm₈  ; imm₈ = 0 to 7
vpextrw mem₁₆, xmm_src, imm₈  ; imm₈ = 0 to 7

pextrd  reg₃₂, xmm_src, imm₈  ; imm₈ = 0 to 3
pextrd  mem₃₂, xmm_src, imm₈  ; imm₈ = 0 to 3
vpextrd mem₆₄, xmm_src, imm₈  ; imm₈ = 0 to 3
vpextrd reg₃₂, xmm_src, imm₈  ; imm₈ = 0 to 3
vpextrd reg₆₄, xmm_src, imm₈  ; imm₈ = 0 to 3
vpextrd mem₃₂, xmm_src, imm₈  ; imm₈ = 0 to 3

pextrq  reg₆₄, xmm_src, imm₈  ; imm₈ = 0 to 1
pextrq  mem₆₄, xmm_src, imm₈  ; imm₈ = 0 to 1
vpextrq reg₆₄, xmm_src, imm₈  ; imm₈ = 0 to 1
vpextrq mem₆₄, xmm_src, imm₈  ; imm₈ = 0 to 1

The byte and word instructions expect a 32- or 64-bit general-purpose register as their destination (first operand) or a memory location that is the same size as the instruction (that is, pextrb expects a byte-sized memory operand, pextrw expects a word-sized operand, and so on). The source (second) operand is a 128-bit XMM register. The index (third) operand is an 8-bit immediate value that specifies an index (lane number). These instructions fetch the byte, word, dword, or qword in the lane specified by the 8-bit immediate value and copy that value into the destination operand. The double-word and quad-word variants require a 32-bit or 64-bit general-purpose register, respectively. If the destination operand is a 32- or 64-bit general-purpose register, the instruction zero-extends the value to 32 or 64 bits, if necessary.

11.7.9 The (v)pinsrb, (v)pinsrw, (v)pinsrd, and (v)pinsrq Instructions

The (v)pinsr{b,w,d,q} instructions take a byte, word, dword, or qword from a general-purpose register or memory location and store that data to a lane of an XMM register. The syntax for these instructions is the following:⁹

pinsrb  xmm_dest, reg₃₂, imm₈          ; imm₈ = 0 to 15
pinsrb  xmm_dest, mem₈, imm₈           ; imm₈ = 0 to 15
vpinsrb xmm_dest, xmm_src2, reg₃₂, imm₈   ; imm₈ = 0 to 15
vpinsrb xmm_dest, xmm_src2, mem₈, imm₈    ; imm₈ = 0 to 15

pinsrw  xmmdest, reg32, imm8          ; imm₈ = 0 to 7
pinsrw  xmm_dest, mem₁₆, imm₈          ; imm₈ = 0 to 7
vpinsrw xmm_dest, xmm_src2, reg₃₂, imm₈  ; imm₈ = 0 to 7
vpinsrw xmm_dest, xmm_src2, mem₁₆, imm₈  ; imm₈ = 0 to 7

pinsrd  xmm_dest, reg₃₂, imm₈          ; imm₈ = 0 to 3
pinsrd  xmm_dest, mem₃₂, imm₈          ; imm₈ = 0 to 3
vpinsrd xmm_dest, xmm_src2, reg₃₂, imm₈  ; imm₈ = 0 to 3
vpinsrd xmm_dest, xmm_src2, mem₃₂, imm₈  ; imm₈ = 0 to 3

pinsrq  xmm_dest, reg₆₄, imm₈          ; imm₈ = 0 to 1
pinsrq  xmm_dest, xmm_src2, mem₆₄, imm₈  ; imm₈ = 0 to 1
vpinsrq xmm_dest, xmm_src2, reg₆₄, imm₈  ; imm₈ = 0 to 1
vpinsrq xmm_dest, xmm_src2, mem₆₄, imm₈  ; imm₈ = 0 to 1

The destination (first) operand is a 128-bit XMM register. The pinsr* instructions expect a memory location or a 32-bit general-purpose register as their source (second) operand (except the pinsrq instructions, which require a 64-bit register). The index (third) operand is an 8-bit immediate value that specifies an index (lane number).

These instructions fetch a byte, word, dword, or qword from the general-purpose register or memory location and copy that to the lane in the XMM register specified by the 8-bit immediate value. The pinsr{b,w,d,q} instructions leave any HO bits in the underlying YMM register unchanged (if applicable).

The vpinsr{b,w,d,q} instructions copy the data from the XMM source register into the destination register and then copy the byte, word, dword, or quad word to the specified location in the destination register. These instructions zero-extend the value throughout the HO bits of the underlying YMM register.

11.7.10 The (v)extractps and (v)insertps Instructions

The extractps and vextractps instructions are functionally equivalent to pextrd and vpextrd. They extract a 32-bit (single-precision floating-point) value from an XMM register and move it into a 32-bit general-purpose register or a 32-bit memory location. The syntax for the (v)extractps instructions is shown here:

extractps  reg₃₂, xmm_src, imm₈
extractps  mem₃₂, xmm_src, imm₈
vextractps reg₃₂, xmm_src, imm₈
vextractps mem₃₂, xmm_src, imm₈

The insertps and vinsertps instructions insert a 32-bit floating-point value into an XMM register and, optionally, zero out other lanes in the XMM register. The syntax for these instructions is as follows:

insertps  xmm_dest, xmm_src, imm₈
insertps  xmm_dest, mem₃₂, imm₈
vinsertps xmm_dest, xmm_src1, xmm_src2, imm₈
vinsertps xmm_dest, xmm_src1, mem₃₂, imm₈

For the insertps and vinsertps instructions, the imm₈ operand has the fields listed in Table 11-9.

Table 11-9: imm₈ Bit Fields for insertps and vinsertps Instructions

Bit(s)	Meaning
6 to 7	(Only if the source operand is an XMM register): Selects the 32-bit lane from the source XMM register (0, 1, 2, or 3). If the source operand is a 32-bit memory location, the instruction ignores this field and uses the full 32 bits from memory.
4 to 5	Specifies the lane in the destination XMM register in which to store the single-precision value.
3	If set, zeroes lane 3 of XMM_dest.
2	If set, zeroes lane 2 of XMM_dest.
1	If set, zeroes lane 1 of XMM_dest.
0	If set, zeroes lane 0 of XMM_dest.

On CPUs with the AVX extensions, insertps does not modify the upper bits of the YMM registers; vinsertps zeroes the upper bits.

The vinsertps instruction first copies the XMM_src1 register to XMM_destbefore performing the insertion operation. The HO bits of the corresponding YMM register are set to 0.

The x86-64 does not provide (v)extractpd or (v)insertpd instructions.

11.8 SIMD Arithmetic and Logical Operations

The SSE and AVX instruction set extensions provide a variety of scalar and vector arithmetic and logical operations.

“SSE Floating-Point Arithmetic” in Chapter 6 has already covered floating-point arithmetic using the scalar SSE instruction set, so this section does not repeat that discussion. Instead, this section covers the vector (or packed) arithmetic and logical instructions.

The vector instructions perform multiple operations in parallel on the different data lanes in an SSE or AVX register. Given two source operands, a typical SSE instruction will calculate two double-precision floating-point results, two quad-word integer calculations, four single-precision floating-point operations, four double-word integer calculations, eight word integer calculations, or sixteen byte calculations, simultaneously. The AVX registers (YMM) double the number of lanes and therefore double the number of concurrent calculations.

Figure 11-42 shows how the SSE and AVX instructions perform concurrent calculations; a value is taken from the same lane in two source locations, the calculation is performed, and the instruction stores the result to the same lane in the destination location. This process happens simultaneously for each lane in the source and destination operands. For example, if a pair of XMM registers contains four single-precision floating-point values, a SIMD packed floating-point addition instruction would add the single-precision values in the corresponding lanes of the source operands and store the single-precision sums into the corresponding lanes of the destination XMM register.

f11042 — Figure 11-42: SIMD concurrent arithmetic and logical operations

Certain operations—for example, logical AND, ANDN (and not), OR, and XOR—don’t have to be broken into lanes, because those operations perform the same result regardless of the instruction size. The lane size is a single bit. Therefore, the corresponding SSE/AVX instructions operate on their entire operands without regard for a lane size.

11.9 The SIMD Logical (Bitwise) Instructions

The SSE and AVX instruction set extensions provide the logical operations shown in Table 11-10 (using C/C++ bitwise operator syntax).

Table 11-10: SSE/AVX Logical Instructions

Operation	Description
`andpd`	dest = dest and source (128-bit operands)
`vandpd`	dest = source1 and source2 (128-bit or 256-bit operands)
`andnpd`	dest = dest and ~source (128-bit operands)
`vandnpd`	dest = source1 and ~source2 (128-bit or 256-bit operands)
`orpd`	dest = dest \| source (128-bit operands)
`vorpd`	dest = source1 \| source2 (128-bit or 256-bit operands)
`xorpd`	dest = dest ^ source (128-bit operands)
`vxorpd`	dest = source1 ^ source2 (128-bit or 256-bit operands)

The syntax for these instructions is the following:

andpd   xmm_dest, xmm_src/mem₁₂₈
vandpd  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vandpd  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

andnpd  xmm_dest, xmm_src/mem₁₂₈
vandnpd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vandnpd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

orpd    xmm_dest, xmm_src/mem₁₂₈
vorpd   xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vorpd   ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

xorpd   xmm_dest, xmm_src/mem₁₂₈
vxorpd  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vxorpd  ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

The SSE instructions (without the v prefix) leave the HO bits of the underlying YMM register unchanged (if applicable). The AVX instructions (with the v prefix) that have 128-bit operands will zero-extend their result into the HO bits of the YMM register.

If the (second) source operand is a memory location, it must be aligned on an appropriate boundary (for example, 16 bytes for mem₁₂₈ values and 32 bytes for mem₂₅₆ values). Failure to do so will result in a runtime memory alignment fault.

11.9.1 The (v)ptest Instructions

The ptest instruction (packed test) is similar to the standard integer test instruction. The ptest instruction performs a logical AND between the two operands and sets the zero flag if the result is 0. The ptest instruction sets the carry flag if the logical AND of the second operand with the inverted bits of the first operand produces 0. The ptest instruction supports the following syntax:

ptest  xmm_src1, xmm_src2/mem₁₂₈
vptest xmm_src1, xmm_src2/mem₁₂₈
vptest ymm_src1, ymm_src2/mem₂₅₆

11.9.2 The Byte Shift Instructions

The SSE and AVX instruction set extensions also support a set of logical and arithmetic shift instructions. The first two to consider are pslldq and psrldq. Although they begin with a p, suggesting they are packed (vector) instructions, these instructions really are just 128-bit logical shift-left and shift-right instructions. Their syntax is as follows:

pslldq  xmm_dest, imm₈
vpslldq xmm_dest, xmm_src, imm₈
vpslldq ymm_dest, ymm_src, imm₈
psrldq  xmm_dest, imm₈
vpsrldq xmm_dest, xmm_src, imm₈
vpsrldq ymm_dest, ymm_src, imm₈

The pslldq instruction shifts its destination XMM register to the left by the number of bytes specified by the imm₈ operand. This instruction shifts 0s into the vacated LO bytes.

The vpslldq instruction takes the value in the source register (XMM or YMM), shifts that value to the left by imm₈ bytes, and then stores the result into the destination register. For the 128-bit variant, this instruction zero-extends the result into bits 128 to 255 of the underlying YMM register (on AVX-capable CPUs).

The psrldq and vpsrldq instructions operate similarly to (v)pslldq except, of course, they shift their operands to the right rather than to the left. These are logical shift-right operations, so they shift 0s into the HO bytes of their operand, and bits shifted out of bit 0 are lost.

The pslldq and psrldq instructions shift bytes rather than bits. For example, many SSE instructions produce byte masks 0 or 0FFh, representing Boolean results. These instructions shift the equivalent of a bit in one of these byte masks by shifting whole bytes at a time.

11.9.3 The Bit Shift Instructions

The SSE/AVX instruction set extensions also provide vector bit shift operations that work on two or more integer lanes, concurrently. These instructions provide word, dword, and qword variants of the logical shift-left, logical shift-right, and arithmetic shift-right operations, using the syntax

shift  xmm_dest, imm₈
shift  xmm_dest, xmm_src/mem₁₂₈
vshift xmm_dest, xmm_src, imm₈
vshift xmm_dest, xmm_src, mem₁₂₈
vshift ymm_dest, ymm_src, imm₈
vshift ymm_dest, ymm_src, xmm/mem₁₂₈

where shift = psllw, pslld, psllq, psrlw, psrld, psrlq, psraw, or psrad, and vshift = vpsllw, vpslld, vpsllq, vpsrlw, vpsrld, vpsrlq, vpsraw, vpsrad, or vpsraq.

The (v)psl* instructions shift their operands to the left; the (v)psr* instructions shift their operands to the right. The (v)psll* and (v)psrl* instructions are logical shift instructions and shift 0s into the bits vacated by the shift. Any bits shifted out of the operand are lost. The (v)psra* instructions are arithmetic shift-right instructions. They replicate the HO bit in each lane when shifting that lane’s bits to the right; all bits shifted out of the LO bit are lost.

The SSE two-operand instructions treat their first operand as both the source and destination operand. The second operand specifies the number of bits to shift (which is either an 8-bit immediate constant or a value held in an XMM register or a 128-bit memory location). Regardless of the shift count’s size, only the LO 4, 5, or 6 bits of the count are meaningful (depending on the lane size).

The AVX three-operand instructions specify a separate source and destination register for the shift operation. These instructions take the value from the source register, shift it the specified number of bits, and store the shifted result into the destination register. The source register remains unmodified (unless, of course, the instruction specifies the same register for the source and destination operands). For the AVX instructions, the source and destination registers can be XMM (128-bit) or YMM (256-bit) registers. The third operand is either an 8-bit immediate constant, an XMM register, or a 128-bit memory location. The third operand specifies the bit shift count (the same as the SSE instructions). You specify an XMM register for the count even when the source and destination registers are 256-bit YMM registers.

The w suffix instructions shift 16-bit operands (eight lanes for 128-bit destination operands, sixteen lanes for 256-bit destinations). The d suffix instructions shift 32-bit dword operands (four lanes for 128-bit destination operands, eight lanes for 256-bit destination operands). The q suffix instructions shift 64-bit operands (two lanes for 128-bit operands, four lanes for 256-bit operands).

11.10 The SIMD Integer Arithmetic Instructions

The SSE and AVX instruction set extensions deal mainly with floating-point calculations. They do, however, include a set of signed and unsigned integer arithmetic operations. This section describes the SSE/AVX integer arithmetic instructions.

11.10.1 SIMD Integer Addition

The SIMD integer addition instructions appear in Table 11-11. These instructions do not affect any flags and thus do not indicate when an overflow (signed or unsigned) occurs during the execution of these instructions. The program itself must ensure that the source operands are all within the appropriate range before performing an addition. If carry occurs during an addition, the carry is lost.

Table 11-11: SIMD Integer Addition Instructions

Instruction	Operands	Description
`paddb`	`xmm`_dest, `xmm`/`mem`₁₂₈	16-lane byte addition
`vpaddb`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	16-lane byte addition
`vpaddb`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	32-lane byte addition
`paddw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word addition
`vpaddw`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	8-lane word addition
`vpaddw`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	16-lane word addition
`paddd`	`xmm`_dest, `xmm`/`mem`₁₂₈	4-lane dword addition
`vpaddd`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	4-lane dword addition
`vpaddd`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	8-lane dword addition
`paddq`	`xmm`_dest, `xmm`/`mem`₁₂₈	2-lane qword addition
`vpaddq`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	2-lane qword addition
`vpaddq`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	4-lane qword addition

These addition instructions are known as vertical additions because if we stack the two source operands on top of each other (on a printed page), the lane additions occur vertically (one source lane is directly above the second source lane for the corresponding addition operation).

The packed additions ignore any overflow from the addition operation, keeping only the LO byte, word, dword, or qword of each addition. As long as overflow is never possible, this is not an issue. However, for certain algorithms (especially audio and video, which commonly use packed addition), truncating away the overflow can produce bizarre results.

A cleaner solution is to use saturation arithmetic. For unsigned addition, saturation arithmetic clips (or saturates) an overflow to the largest possible value that the instruction’s size can handle. For example, if the addition of two byte values exceeds 0FFh, saturation arithmetic produces 0FFh—the largest possible unsigned 8-bit value (likewise, saturation subtraction would produce 0 if underflow occurs). For signed saturation arithmetic, clipping occurs at the largest positive and smallest negative values (for example, 7Fh/+127 for positive values and 80h/–128 for negative values).

The x86 SIMD instructions provide both signed and unsigned saturation arithmetic, though the operations are limited to 8- and 16-bit quantities.¹⁰ The instructions appear in Table 11-12.

Table 11-12: SIMD Integer Saturation Addition Instructions

Instruction	Operands	Description
`paddsb`	`xmm`_dest, `xmm`/`mem`₁₂₈	16-lane byte signed saturation addition
`vpaddsb`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	16-lane byte signed saturation addition
`vpaddsb`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	32-lane byte signed saturation addition
`paddsw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word signed saturation addition
`vpaddsw`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	8-lane word signed saturation addition
`vpaddsw`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	16-lane word signed saturation addition
`paddusb`	`xmm`_dest, `xmm`/`mem`₁₂₈	16-lane byte unsigned saturation addition
`vpaddusb`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	16-lane byte unsigned saturation addition
`vpaddusb`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	32-lane byte unsigned saturation addition
`paddusw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word unsigned saturation addition
`vpaddusw`	`xmm`_dest, `xmm`_src1, `xmm`_src2/`mem`₁₂₈	8-lane word unsigned saturation addition
`vpaddusw`	`ymm`_dest, `ymm`_src1, `ymm`_src2/`mem`₂₅₆	16-lane word unsigned saturation addition

As usual, both padd* and vpadd* instructions accept 128-bit XMM registers (sixteen 8-bit additions or eight 16-bit additions). The padd* instructions leave the HO bits of any corresponding YMM destination undisturbed; the vpadd* variants clear the HO bits. Also note that the padd* instructions have only two operands (the destination register is also a source), whereas the vpadd* instructions have two source operands and a single destination operand. The vpadd* instructions with the YMM register provide double the number of parallel additions.

11.10.2 Horizontal Additions

The SSE/AVX instruction sets also support three horizontal addition instructions, listed in Table 11-13.

Table 11-13: Horizontal Addition Instructions

Instruction	Description
`(v)phaddw`	16-bit (word) horizontal add
`(v)phaddd`	32-bit (dword) horizontal add
`(v)phaddsw`	16-bit (word) horizontal add and saturate

The horizontal addition instructions add adjacent words or dwords in their two source operands and store the sum of the result into a destination lane, as shown in Figure 11-43.

f11043 — Figure 11-43: Horizontal addition operation

The phaddw instruction has the following syntax:

phaddw xmm_dest, xmm_src/mem₁₂₈

It computes the following:

temp[0 to 15]    = xmm_dest[0 to 15]        + xmm_dest[16 to 31]
temp[16 to 31]   = xmm_dest[32 to 47]       + xmm_dest[48 to 63]
temp[32 to 47]   = xmm_dest[64 to 79]       + xmm_dest[80 to 95]
temp[48 to 63]   = xmm_dest[96 to 111]      + xmm_dest[112 to 127]
temp[64 to 79]   = xmm_src/mem₁₂₈[0 to 15]   + xmm_src/mem₁₂₈[16 to 31]
temp[80 to 95]   = xmm_src/mem₁₂₈[32 to 47]  + xmm_src/mem₁₂₈[48 to 63]
temp[96 to 111]  = xmm_src/mem₁₂₈[64 to 79]  + xmm_src/mem₁₂₈[80 to 95]
temp[112 to 127] = xmm_src/mem₁₂₈[96 to 111] + xmm_src/mem₁₂₈[112 to 127]
xmm_dest = temp

As is the case with most SSE instructions, phaddw does not affect the HO bits of the corresponding YMM destination register, only the LO 128 bits.

The 128-bit vphaddw instruction has the following syntax:

vphaddw xmmdest, xmmsrc1, xmmsrc2/mem128

It computes the following:

xmm_dest[0 to 15]    = xmm_src1[0 to 15]         + xmm_src1[16 to 31]
xmm_dest[16 to 31]   = xmm_src1[32 to 47]        + xmm_src1[48 to 63]
xmm_dest[32 to 47]   = xmm_src1[64 to 79]        + xmm_src1[80 to 95]
xmm_dest[48 to 63]   = xmm_src1[96 to 111]       + xmm_src1[112 to 127]
xmm_dest[64 to 79]   = xmm_src2/mem₁₂₈[0 to 15]   + xmm_src2/mem₁₂₈[16 to 31]
xmm_dest[80 to 95]   = xmm_src2/mem₁₂₈[32 to 47]  + xmm_src2/mem₁₂₈[48 to 63]
xmm_dest[96 to 111]  = xmm_src2/mem₁₂₈[64 to 79]  + xmm_src2/mem₁₂₈[80 to 95]
xmm_dest[111 to 127] = xmm_src2/mem₁₂₈[96 to 111] + xmm_src2/mem₁₂₈[112 to 127]

The vphaddw instruction zeroes out the HO 128 bits of the corresponding YMM destination register.

The 256-bit vphaddw instruction has the following syntax:

vphaddw ymmdest, ymmsrc1, ymmsrc2/mem256

vphaddw does not simply extend the 128-bit version in the intuitive way. Instead, it mixes up computations as follows (where SRC1 is YMM_src1 and SRC2 is YMM_src2/mem₂₅₆):

ymm_dest[0 to 15]    = SRC1[16 to 31]   + SRC1[0 to 15]
ymm_dest[16 to 31]   = SRC1[48 to 63]   + SRC1[32 to 47]
ymm_dest[32 to 47]   = SRC1[80 to 95]   + SRC1[64 to 79]
ymm_dest[48 to 63]   = SRC1[112 to 127] + SRC1[96 to 111]
ymm_dest[64 to 79]   = SRC2[16 to 31]   + SRC2[0 to 15]
ymm_dest[80 to 95]   = SRC2[48 to 63]   + SRC2[32 to 47]
ymm_dest[96 to 111]  = SRC2[80 to 95]   + SRC2[64 to 79]
ymm_dest[112 to 127] = SRC2[112 to 127] + SRC2[96 to 111]
ymm_dest[128 to 143] = SRC1[144 to 159] + SRC1[128 to 143]
ymm_dest[144 to 159] = SRC1[176 to 191] + SRC1[160 to 175]
ymm_dest[160 to 175] = SRC1[208 to 223] + SRC1[192 to 207]
ymm_dest[176 to 191] = SRC1[240 to 255] + SRC1[224 to 239]
ymm_dest[192 to 207] = SRC2[144 to 159] + SRC2[128 to 143]
ymm_dest[208 to 223] = SRC2[176 to 191] + SRC2[160 to 175]
ymm_dest[224 to 239] = SRC2[208 to 223] + SRC2[192 to 207]
ymm_dest[240 to 255] = SRC2[240 to 255] + SRC2[224 to 239]

11.10.3 Double-Word–Sized Horizontal Additions

The phaddd instruction has the following syntax:

phaddd xmm_dest, xmm_src/mem₁₂₈

It computes the following:

temp[0 to 31]   = xmm_dest[0 to 31]       + xmm_dest[32 to 63]
temp[32 to 63]  = xmm_dest[64 to 95]      + xmm_dest[96 to 127]
temp[64 to 95]  = xmm_src/mem₁₂₈[0 to 31]  + xmm_src/mem₁₂₈[32 to 63]
temp[96 to 127] = xmm_src/mem₁₂₈[64 to 95] + xmm_src/mem₁₂₈[96 to 127]
xmm_dest = temp

The 128-bit vphaddd instruction has this syntax:

vphaddd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈

It computes the following:

xmmdest[0 to 31]     = xmmsrc1[0 to 31]        + xmmsrc1[32 to 63]
xmmdest[32 to 63]    = xmmsrc1[64 to 95]       + xmmsrc1[96 to 127]
xmmdest[64 to 95]    = xmmsrc2/mem128[0 to 31]  + xmmsrc2/mem128[32 to 63]
xmmdest[96 to 127]   = xmmsrc2/mem128[64 to 95] + xmmsrc2/mem128[96 to 127]
(ymmdest[128 to 255] = 0)

Like vphaddw, the 256-bit vphaddd instruction has the following syntax:

vphaddd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

It calculates the following:

ymm_dest[0 to 31]    = ymm_src1[32 to 63]         + ymm_src1[0 to 31]
ymm_dest[32 to 63]   = ymm_src1[96 to 127]        + ymm_src1[64 to 95]
ymm_dest[64 to 95]   = ymm_src2/mem₁₂₈[32 to 63]   + ymm_src2/mem₁₂₈[0 to 31]
ymm_dest[96 to 127]  = ymm_src2/mem₁₂₈[96 to 127]  + ymm_src2/mem₁₂₈[64 to 95]
ymm_dest[128 to 159] = ymm_src1[160 to 191]       + ymm_src1[128 to 159]
ymm_dest[160 to 191] = ymm_src1[224 to 255]       + ymm_src1[192 to 223]
ymm_dest[192 to 223] = ymm_src2/mem₁₂₈[160 to 191] + ymm_src2/mem₁₂₈[128 to 159]
ymm_dest[224 to 255] = ymm_src2/mem₁₂₈[224 to 255] + ymm_src2/mem₁₂₈[192 to 223]

If an overflow occurs during the horizontal addition, (v)phaddw and (v)phaddd simply ignore the overflow and store the LO 16 or 32 bits of the result into the destination location.

The (v)phaddsw instructions take the following forms:

phaddsw  xmm_dest, xmm_src/mem₁₂₈
vphaddsw xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vphaddsw ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

The (v)phaddsw instruction (horizontal signed integer add with saturate, word) is a slightly different form of (v)phaddw: rather than storing only the LO bits into the result in the destination lane, this instruction saturates the result. Saturation means that any (positive) overflow results in the value 7FFFh, regardless of the actual result. Likewise, any negative underflow results in the value 8000h.

Saturation arithmetic works well for audio and video processing. If you were using standard (wraparound/modulo) addition when adding two sound samples together, the result would be horrible clicking sounds. Saturation, on the other hand, simply produces a clipped audio signal. While this is not ideal, it sounds considerably better than the results from modulo arithmetic. Similarly, for video processing, saturation produces a washed-out (white) color versus the bizarre colors that result from modulo arithmetic.

Sadly, there is no horizontal add with saturation for double-word operands (for example, to handle 24-bit audio).

11.10.4 SIMD Integer Subtraction

The SIMD integer subtraction instructions appear in Table 11-14. As for the SIMD addition instructions, they do not affect any flags; any carry, borrow, overflow, or underflow information is lost. These instructions subtract the second source operand from the first source operand (which is also the destination operand for the SSE-only instructions) and store the result into the destination operand.

Table 11-14: SIMD Integer Subtraction Instructions

Instruction	Operands	Description
`psubb`	`xmm`_dest, `xmm`/`mem`₁₂₈	16-lane byte subtraction
`vpsubb`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	16-lane byte subtraction
`vpsubb`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	32-lane byte subtraction
`psubw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word subtraction
`vpsubw`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	8-lane word subtraction
`vpsubw`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	16-lane word subtraction
`psubd`	`xmm`_dest, `xmm`/`mem`₁₂₈	4-lane dword subtraction
`vpsubd`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	4-lane dword subtraction
`vpsubd`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	8-lane dword subtraction
`psubq`	`xmm`_dest, `xmm`/`mem`₁₂₈	2-lane qword subtraction
`vpsubq`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	2-lane qword subtraction
`vpsubq`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	4-lane qword subtraction

The (v)phsubw, (v)phsubd, and (v)phsubsw horizontal subtraction instructions work just like the horizontal addition instructions, except (of course) they compute the difference of the two source operands rather than the sum. See the previous sections for details on the horizontal addition instructions.

Likewise, there is a set of signed and unsigned byte and word saturating subtraction instructions (see Table 11-15). For the signed instructions, the byte-sized instructions saturate positive overflow to 7Fh (+127) and negative underflow to 80h (–128). The word-sized instructions saturate to 7FFFh (+32,767) and 8000h (–32,768). The unsigned saturation instructions saturate to 0FFFFh (+65,535) and 0.

Table 11-15: SIMD Integer Saturating Subtraction Instructions

Instruction	Operands	Description
`psubsb`	`xmm`_dest, `xmm`/`mem`₁₂₈	16-lane byte signed saturation subtraction
`vpsubsb`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	16-lane byte signed saturation subtraction
`vpsubsb`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	32-lane byte signed saturation subtraction
`psubsw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word signed saturation subtraction
`vpsubsw`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	8-lane word signed saturation subtraction
`vpsubsw`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	16-lane word signed saturation subtraction
`psubusb`	`xmm`_dest, `xmm`/`mem`₁₂₈	16-lane byte unsigned saturation subtraction
`vpsubusb`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	16-lane byte unsigned saturation subtraction
`vpsubusb`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	32-lane byte unsigned saturation subtraction
`psubusw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word unsigned saturation subtraction
`vpsubusw`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	8-lane word unsigned saturation subtraction
`vpsubusw`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	16-lane word unsigned saturation subtraction

11.10.5 SIMD Integer Multiplication

The SSE/AVX instruction set extensions somewhat support multiplication. Lane-by-lane multiplication requires that the result of an operation on two n-bit values fits in n bits, but n × n multiplication can produce a 2×n-bit result. So a lane-by-lane multiplication operation creates problems as overflow is lost. The basic packed integer multiplication multiplies a pair of lanes and stores the LO bits of the result in the destination lane. For extended arithmetic, packed integer multiplication instructions produce the HO bits of the result.

The instructions in Table 11-16 handle 16-bit multiplication operations. The (v)pmullw instruction multiplies the 16-bit values appearing in the lanes of the source operand and stores the LO word of the result into the corresponding destination lane. This instruction is applicable to both signed and unsigned values. The (v)pmulhw instruction computes the product of two signed word values and stores the HO word of the result into the destination lanes. For unsigned operands, (v)pmulhuw performs the same task. By executing both (v)pmullw and (v)pmulh(u)w with the same operands, you can compute the full 32-bit result of a 16×16-bit multiplication. (You can use the punpck* instructions to merge the results into 32-bit integers.)

Table 11-16: SIMD 16-Bit Packed Integer Multiplication Instructions

Instruction	Operands	Description
`pmullw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word multiplication, producing the LO word of the product
`vpmullw`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	8-lane word multiplication, producing the LO word of the product
`vpmullw`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	16-lane word multiplication, producing the LO word of the product
`pmulhuw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word unsigned multiplication, producing the HO word of the product
`vpmulhuw`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	8-lane word unsigned multiplication, producing the HO word of the product
`vpmulhuw`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	16-lane word unsigned multiplication, producing the HO word of the product
`pmulhw`	`xmm`_dest, `xmm`/`mem`₁₂₈	8-lane word signed multiplication, producing the HO word of the product
`vpmulhw`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	8-lane word signed multiplication, producing the HO word of the product
`vpmulhw`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	16-lane word signed multiplication, producing the HO word of the product

Table 11-17 lists the 32- and 64-bit versions of the packed multiplication instructions. There are no (v)pmulhd or (v)pmulhq instructions; see (v)pmuludq and (v)pmuldq to handle 32- and 64-bit packed multiplication.

Table 11-17: SIMD 32- and 64-Bit Packed Integer Multiplication Instructions

Instruction	Operands	Description
`pmulld`	`xmm`_dest, `xmm`/`mem`₁₂₈	4-lane dword multiplication, producing the LO dword of the product
`vpmulld`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	4-lane dword multiplication, producing the LO dword of the product
`vpmulld`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	8-lane dword multiplication, producing the LO dword of the product
`vpmullq`	`xmm`_dest, `xmm`_src, `xmm`/`mem`₁₂₈	2-lane qword multiplication, producing the LO qword of the product
`vpmullq`	`ymm`_dest, `ymm`_src, `ymm`/`mem`₂₅₆	4-lane qword multiplication, producing the LO qword of the product (available on only AVX-512 CPUs)

At some point along the way, Intel introduced (v)pmuldq and (v)pmuludq to perform signed and unsigned 32×32-bit multiplications, producing a 64-bit result. The syntax for these instructions is as follows:

pmuldq   xmm_dest, xmm/mem₁₂₈
vpmuldq  xmm_dest, xmm_src1, xmm/mem₁₂₈
vpmuldq  ymm_dest, ymm_src1, ymm/mem₂₅₆

pmuludq  xmm_dest, xmm/mem₁₂₈
vpmuludq xmm_dest, xmm_src1, xmm/mem₁₂₈
vpmuludq ymm_dest, ymm_src1, ymm/mem₂₅₆

The 128-bit variants multiply the double words appearing in lanes 0 and 2 and store the 64-bit results into qword lanes 0 and 1 (dword lanes 0 and 1 and 2 and 3). On CPUs with AVX registers,¹¹ pmuldq and pmuludq do not affect the HO 128 bits of the YMM register. The vpmuldq and vpmuludq instructions zero-extend the result to 256 bits. The 256-bit variants multiply the double words appearing in lanes 0, 2, 4, and 6, producing 64-bit results that they store in qword lanes 0, 1, 2, and 3 (dword lanes 0 and 1, 2 and 3, 4 and 5, and 6 and 7 ).

The pclmulqdq instruction provides the ability to multiply two qword values, producing a 128-bit result. Here is the syntax for this instruction:

pclmulqdq  xmm_dest, xmm/mem₁₂₈, imm₈
vpclmulqdq xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈

These instructions multiply a pair of qword values found in XMM_dest and XMM_src and leave the 128-bit result in XMM_dest. The imm₈ operand specifies which qwords to use as the source operands. Table 11-18 lists the possible combinations for pclmulqdq. Table 11-19 lists the combinations for vpclmulqdq.

Table 11-18: imm₈ Operand Values for pclmulqdq Instruction

imm₈	Result
00h	XMM_dest = XMM_dest[0 to 63] * XMM/mem₁₂₈[0 to 63]
01h	XMM_dest = XMM_dest[64 to 127] * XMM/mem₁₂₈[0 to 63]
10h	XMM_dest = XMM_dest[0 to 63] * XMM/mem₁₂₈[64 to 127]
11h	XMM_dest = XMM_dest[64 to 127] * XMM/mem₁₂₈[64 to 127]

Table 11-19: imm₈ Operand Values for vpclmulqdq Instruction

imm₈	Result
00h	XMM_dest = XMM_src1[0 to 63] * XMM_src2/mem₁₂₈[0 to 63]
01h	XMM_dest = XMM_src1[64 to 127] * XMM_src2/mem₁₂₈[0 to 63]
10h	XMM_dest = XMM_src1[0 to 63] * XMM_src2/mem₁₂₈[64 to 127]
11h	XMM_dest = XMM_src1[64 to 127] * XMM_src2/mem₁₂₈[64 to 127]

As usual, pclmulqdq leaves the HO 128 bits of the corresponding YMM destination register unchanged, while vpcmulqdq zeroes those bits.

11.10.6 SIMD Integer Averages

The (v)pavgb and (v)pavgw instructions compute the average of two sets of bytes or words. These instructions sum the value in the byte or word lanes of their source and destination operands, divide the result by 2, round the results, and leave the averaged results sitting in the destination operand lanes. The syntax for these instructions is shown here:

pavgb  xmm_dest, xmm/mem₁₂₈
vpavgb xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpavgb ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
pavgw  xmm_dest, xmm/mem₁₂₈
vpavgw xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpavgw ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

The 128-bit pavgb and vpavgb instructions compute 16 byte-sized averages (for the 16 lanes in the source and destination operands). The 256-bit variant of the vpavgb instruction computes 32 byte-sized averages.

The 128-bit pavgw and vpavgw instructions compute eight word-sized averages (for the eight lanes in the source and destination operands). The 256-bit variant of the vpavgw instruction computes 16 byte-sized averages.

The vpavgb and vpavgw instructions compute the average of the first XMM or YMM source operand and the second XMM, YMM, or mem source operand, storing the average in the destination XMM or YMM register.

Unfortunately, there are no (v)pavgd or (v)pavgq instructions. No doubt, these instructions were originally intended for mixing 8- and 16-bit audio or video streams (or photo manipulation), and the x86-64 CPU designers never felt the need to extend this beyond 16 bits (even though 24-bit audio is common among professional audio engineers).

11.10.7 SIMD Integer Minimum and Maximum

The SSE4.1 instruction set extensions added eight packed integer minimum and maximum instructions, as shown in Table 11-20. These instructions scan the lanes of a pair of 128- or 256-bit operands and copy the maximum or minimum value from that lane to the same lane in the destination operand.

Table 11-20: SIMD Minimum and Maximum Instructions

Instruction	Description
`(v)pmaxsb`	Destination byte lanes set to the maximum value of the two signed byte values found in the corresponding source lanes.
`(v)pmaxsw`	Destination word lanes set to the maximum value of the two signed word values found in the corresponding source lanes.
`(v)pmaxsd`	Destination dword lanes set to the maximum value of the two signed dword values found in the corresponding source lanes.
`vpmaxsq`	Destination qword lanes set to the maximum value of the two signed qword values found in the corresponding source lanes. (AVX-512 required for this instruction.)
`(v)pmaxub`	Destination byte lanes set to the maximum value of the two unsigned byte values found in the corresponding source lanes.
`(v)pmaxuw`	Destination word lanes set to the maximum value of the two unsigned word values found in the corresponding source lanes.
`(v)pmaxud`	Destination dword lanes set to the maximum value of the two unsigned dword values found in the corresponding source lanes.
`vpmaxuq`	Destination qword lanes set to the maximum value of the two unsigned qword values found in the corresponding source lanes. (AVX-512 required for this instruction.)
`(v)pminsb`	Destination byte lanes set to the minimum value of the two signed byte values found in the corresponding source lanes.
`(v)pminsw`	Destination word lanes set to the minimum value of the two signed word values found in the corresponding source lanes.
`(v)pminsd`	Destination dword lanes set to the minimum value of the two signed dword values found in the corresponding source lanes.
`vpminsq`	Destination qword lanes set to the minimum value of the two signed qword values found in the corresponding source lanes. (AVX-512- required for this instruction.)
`(v)pminub`	Destination byte lanes set to the minimum value of the two unsigned byte values found in the corresponding source lanes.
`(v)pminuw`	Destination word lanes set to the minimum value of the two unsigned word values found in the corresponding source lanes.
`(v)pminud`	Destination dword lanes set to the minimum value of the two unsigned dword values found in the corresponding source lanes.
`vpminuq`	Destination qword lanes set to the minimum value of the two unsigned qword values found in the corresponding source lanes. (AVX-512 required for this instruction.)

The generic syntax for these instructions is as follows:¹²

pmxxyz  xmm_dest, xmm_src/mem₁₂₈
vpmxxyz xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpmxxyz ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

The SSE instructions compute the minimum or maximum of the corresponding lanes in the source and destination operands and store the minimum or maximum result into the corresponding lanes in the destination register. The AVX instructions compute the minimum or maximum of the values in the same lanes of the two source operands and store the minimum or maximum result into the corresponding lanes of the destination register.

11.10.8 SIMD Integer Absolute Value

The SSE/AVX instruction set extensions provide three sets of instructions for computing the absolute values of signed byte, word, and double-word integers: (v)pabsb, (v)pabsw, and (v)pabsd.¹³ The syntax for these instructions is the following:

pabsb  xmm_dest, xmm_src/mem₁₂₈
vpabsb xmm_dest, xmm_src/mem₁₂₈
vpabsb ymm_dest, ymm_src/mem₂₅₆

pabsw  xmm_dest, xmm_src/mem₁₂₈
vpabsw xmm_dest, xmm_src/mem₁₂₈
vpabsw ymm_dest, ymm_src/mem₂₅₆

pabsd  xmm_dest, xmm_src/mem₁₂₈
vpabsd xmm_dest, xmm_src/mem₁₂₈
vpabsd ymm_dest, ymm_src/mem₂₅₆

When operating on a system that supports AVX registers, the SSE pabsb, pabsw, and pabsd instructions leave the upper bits of the YMM registers unmodified. The 128-bit versions of the AVX instructions (vpabsb, vpabsw, and vpabsd) zero-extend the result through the upper bits.

11.10.9 SIMD Integer Sign Adjustment Instructions

The (v)psignb, (v)psignw, and (v)psignd instructions apply the sign found in a source lane to the corresponding destination lane. The algorithm works as follows:

if source lane value is less than zero then
    negate the corresponding destination lane
else if source lane value is equal to zero
    set the corresponding destination lane to zero
else 
    leave the corresponding destination lane unchanged

The syntax for these instructions is the following:

psignb  xmm_dest, xmm_src/mem₁₂₈
vpsignb xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpsignb ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

psignw  xmm_dest, xmm_src/mem₁₂₈
vpsignw xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpsignw ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

psignd  xmm_dest, xmm_src/mem₁₂₈
vpsignd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpsignd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

As usual, the 128-bit SSE instructions leave the upper bits of the YMM register unchanged (if applicable), and the 128-bit AVX instructions zero-extend the result into the upper bits of the YMM register.

11.10.10 SIMD Integer Comparison Instructions

The (v)pcmpeqb, (v)pcmpeqw, (v)pcmpeqd, (v)pcmpeqq, (v)pcmpgtb, (v)pcmpgtw, (v)pcmpgtd, and (v)pcmpgtq instructions provide packed signed integer comparisons. These instructions compare corresponding bytes, word, dwords, or qwords (depending on the instruction suffix) in the various lanes of their operands.¹⁴ They store the result of the comparison instruction in the corresponding destination lanes.

11.10.10.1 SSE Compare-for-Equality Instructions

The syntax for the SSE compare-for-equality instructions (pcmpeq*) is shown here:

pcmpeqb xmm_dest, xmm_src/mem₁₂₈  ; Compares 16 bytes
pcmpeqw xmm_dest, xmm_src/mem₁₂₈  ; Compares 8 words
pcmpeqd xmm_dest, xmm_src/mem₁₂₈  ; Compares 4 dwords
pcmpeqq xmm_dest, xmm_src/mem₁₂₈  ; Compares 2 qwords

These instructions compute

xmm_dest[lane] = xmm_dest[lane] == xmm_src/mem₁₂₈[lane]

where lane varies from 0 to 15 for pcmpeqb, 0 to 7 for pcmpeqw, 0 to 3 for pcmpeqd, and 0 to 1 for pcmpeqq. The == operator produces a value of all 1 bits if the two values in the same lane are equal; it produces all 0 bits if the values are not equal.

11.10.10.2 SSE Compare-for-Greater-Than Instructions

The following is the syntax for the SSE compare-for-greater-than instructions (pcmpgt*):

pcmpgtb xmm_dest, xmm_src/mem₁₂₈  ; Compares 16 bytes
pcmpgtw xmm_dest, xmm_src/mem₁₂₈  ; Compares 8 words
pcmpgtd xmm_dest, xmm_src/mem₁₂₈  ; Compares 4 dwords
pcmpgtq xmm_dest, xmm_src/mem₁₂₈  ; Compares 2 qwords

These instructions compute

xmm_dest[lane] = xmm_dest[lane] > xmm_src/mem₁₂₈[lane]

where lane is the same as for the compare-for-equality instructions, and the > operator produces a value of all 1 bits if the signed integer in the XMM_dest lane is greater than the signed value in the corresponding XMM_src/MEM₁₂₈ lane.

On AVX-capable CPUs, the SSE packed integer comparisons preserve the value in the upper bits of the underlying YMM register.

11.10.10.3 AVX Comparison Instructions

The 128-bit variants of these instructions have the following syntax:

vpcmpeqb xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 16 bytes
vpcmpeqw xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 8 words
vpcmpeqd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 4 dwords
vpcmpeqq xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 2 qwords

vpcmpgtb xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 16 bytes
vpcmpgtw xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 8 words
vpcmpgtd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 4 dwords
vpcmpgtq xmm_dest, xmm_src1, xmm_src2/mem₁₂₈  ; Compares 2 qwords

These instructions compute as follows:

xmm_dest[lane] = xmm_src1[lane] == xmm_src2/mem₁₂₈[lane]
xmm_dest[lane] = xmm_src1[lane] >  xmm_src2/mem₁₂₈[lane]

These AVX instructions write 0s to the upper bits of the underlying YMM register.

The 256-bit variants of these instructions have the following syntax:

vpcmpeqb ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 32 bytes
vpcmpeqw ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 16 words
vpcmpeqd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 8 dwords
vpcmpeqq ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 4 qwords

vpcmpgtb ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 32 bytes
vpcmpgtw ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 16 words
vpcmpgtd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 8 dwords
vpcmpgtq ymm_dest, ymm_src1, ymm_src2/mem₂₅₆  ; Compares 4 qwords

These instructions compute as follows:

ymm_dest[lane] = ymm_src1[lane] == ymm_src2/mem₂₅₆[lane]
ymm_dest[lane] = ymm_src1[lane] >  ymm_src2/mem₂₅₆[lane]

Of course, the principal difference between the 256- and the 128-bit instructions is that the 256-bit variants support twice as many byte (32), word (16), dword (8), and qword (4) signed-integer lanes.

11.10.10.4 Compare-for-Less-Than Instructions

There are no packed compare-for-less-than instructions. You can synthesize a less-than comparison by reversing the operands and using a greater-than comparison. That is, if x < y, then it is also true that y > x. If both packed operands are sitting in XMM or YMM registers, swapping the registers is relatively easy (especially when using the three-operand AVX instructions). If the second operand is a memory operand, you must first load that operand into a register so you can reverse the operands (a memory operand must always be the second operand).

11.10.10.5 Using Packed Comparison Results

The question remains of what to do with the result you obtain from a packed comparison. SSE/AVX packed signed integer comparisons do not affect condition code flags (because they compare multiple values and only one of those comparisons could be moved into the flags). Instead, the packed comparisons simply produce Boolean results. You can use these results with the packed AND instructions (pand, vpand, pandn, and vpandn), the packed OR instructions (por and vpor), or the packed XOR instructions (pxor and vpxor) to mask or otherwise modify other packed data values. Of course, you could also extract the individual lane values and test them (via a conditional jump). The following section describes a straightforward way to achieve this.

11.10.10.6 The (v)pmovmskb Instructions

The (v)pmovmskb instruction extracts the HO bit from all the bytes in an XMM or YMM register and stores the 16 or 32 bits (respectively) into a general-purpose register. These instructions set all HO bits of the general-purpose register to 0 (beyond those needed to hold the mask bits). The syntax is

pmovmskb  reg, xmm_src
vpmovmskb reg, xmm_src
vpmovmskb reg, ymm_src

where reg is any 32-bit or 64-bit general-purpose integer register. The semantics for the pmovmskb and vpmovmskb instructions with an XMM source register are the same, but the encoding of pmovmskb is more efficient.

The (v)pmovmskb instruction copies the sign bits from each of the byte lanes into the corresponding bit position of the general-purpose register. It copies bit 7 from the XMM register (the sign bit for lane 0) into bit 0 of the destination register; it copies bit 15 from the XMM register (the sign bit for lane 1) into bit 1 of the destination register; it copies bit 23 from the XMM register (the sign bit for lane 2) into bit 2 of the destination register; and so on.

The 128-bit instructions fill only bits 0 through 15 of the destination register (zeroing out all other bits). The 256-bit form of the vpmovmskb instruction fills bits 0 through 31 of the destination register (zeroing out HO bits if you specify a 64-bit register).

You can use the pmovmskb instruction to extract a single bit from each byte lane in an XMM or a YMM register after a (v)pcmpeqb or (v)pcmpgtb instruction. Consider the following code sequence:

pcmpeqb  xmm0, xmm1
pmovmskb eax,  xmm0

After the execution of these two instructions, EAX bit 0 will be 1 or 0 if byte 0 of XMM0 was equal, or not equal, to byte 0 of XMM1, respectively. Likewise, EAX bit 1 will contain the result of comparing byte 1 of XMM0 to XMM1, and so on for each of the following bytes (up to bit 15, which compares 16-byte values in XMM0 and XMM1).

Unfortunately, there are no pmovmskw, pmovmskd, and pmovmsq instructions. You can achieve the same result as pmovmskw by using the following code sequence:

pcmpeqw  xmm0, xmm1
pmovmskb eax, xmm0
mov      cl, 0     ; Put result here
shr      ax, 1     ; Shift out lane 7 result
rcl      cl, 1     ; Shift bit into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 6 result
rcl      cl, 1     ; Shift lane 6 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 5 result
rcl      cl, 1     ; Shift lane 5 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 4 result
rcl      cl, 1     ; Shift lane 4 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 3 result
rcl      cl, 1     ; Shift lane 3 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 2 result
rcl      cl, 1     ; Shift lane 2 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 1 result
rcl      cl, 1     ; Shift lane 1 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 0 result
rcl      cl, 1     ; Shift lane 0 result into CL

Because pcmpeqw produces a sequence of words (which contain either 0000h or 0FFFFh) and pmovmskb expects byte values, pmovmskb produces twice as many results as we expect, and every odd-numbered bit that pmovmskb produces is a duplicate of the preceding even-numbered bit (because the inputs are either 0000h or 0FFFFh). This code grabs every odd-numbered bit (starting with bit 15 and working down) and skips over the even-numbered bits. While this code is easy enough to follow, it is rather long and slow. If you’re willing to live with an 8-bit result for which the lane numbers don’t match the bit numbers, you can use more efficient code:

pcmpeqw  xmm0, xmm1
pmovmskb eax, xmm0
shr      al, 1     ; Move odd bits to even positions
and      al, 55h   ; Zero out the odd bits, keep even bits
and      ah, 0aah  ; Zero out the even bits, keep odd bits
or       al, ah    ; Merge the two sets of bits

This interleaves the lanes in the bit positions as shown in Figure 11-44. Usually, it’s easy enough to work around this rearrangement in the software. Of course, you can also use a 256-entry lookup table (see Chapter 10) to rearrange the bits however you desire. Of course, if you’re just going to test the individual bits rather than use them as some sort of mask, you can directly test the bits that pmovmskb leaves in EAX; you don’t have to coalesce them into a single byte.

f11044 — Figure 11-44: Merging bits from `pcmpeqw`

When using the double-word or quad-word packed comparisons, you could also use a scheme such as the one provided here for pcmpeqw. However, the floating-point mask move instructions (see “The (v)movmskps, (v)movmskpd Instructions” on page 676) do the job more efficiently by breaking the rule about using SIMD instructions that are appropriate for the data type.

11.10.11 Integer Conversions

The SSE and AVX instruction set extensions provide various instructions that convert integer values from one form to another. There are zero- and sign-extension instructions that convert from a smaller value to a larger one. Other instructions convert larger values to smaller ones. This section covers these instructions.

11.10.11.1 Packed Zero-Extension Instructions

The move with zero-extension instructions perform the conversions appearing in Table 11-21.

Table 11-21: SSE4.1 and AVX Packed Zero-Extension Instructions

Syntax	Description
`pmovzxbw` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Zero-extends a set of eight byte values in the LO 8 bytes of XMM_src/mem₆₄to word values in XMM_dest.
`pmovzxbd` `xmm`_dest`,` `xmm`_src`/mem`₃₂	Zero-extends a set of four byte values in the LO 4 bytes of XMM_src/mem₃₂ to dword values in XMM_dest.
`pmovzxbq` `xmm`_dest`,` `xmm`_src`/mem`₁₆	Zero-extends a set of two byte values in the LO 2 bytes of XMM_src/mem₁₆ to qword values in XMM_dest.
`pmovzxwd` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Zero-extends a set of four word values in the LO 8 bytes of XMM_src/mem₆₄ to dword values in XMM_dest.
`pmovzxwq` `xmm`_dest`,` `xmm`_src`/mem`₃₂	Zero-extends a set of two word values in the LO 4 bytes of XMM_src/mem₃₂ to qword values in XMM_dest.
`pmovzxdq` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Zero-extends a set of two dword values in the LO 8 bytes of XMM_src/mem₆₄ to qword values in XMM_dest.

A set of comparable AVX instructions also exists (same syntax, but with a v prefix on the instruction mnemonics). The difference, as usual, is that the SSE instructions leave the upper bits of the YMM register unchanged, whereas the AVX instructions store 0s into the upper bits of the YMM registers.

The AVX2 instruction set extensions double the number of lanes by allowing the use of the YMM registers. They take similar operands to the SSE/AVX instructions (substituting YMM for the destination register and doubling the size of the memory locations) and process twice the number of lanes to produce sixteen words, eight dwords, or four qwords in a YMM destination register. See Table 11-22 for details.

Table 11-22: AVX2 Packed Zero-Extension Instructions

Syntax	Description
`vpmovzxbw` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	Zero-extends a set of sixteen byte values in the LO 16 bytes of XMM_src/mem₁₂₈to word values in YMM_dest.
`vpmovzxbd` `ymm`_dest`,` `xmm`_src`/mem`₆₄	Zero-extends a set of eight byte values in the LO 8 bytes of XMM_src/mem₆₄ to dword values in YMM_dest.
`vpmovzxbq` `ymm`_dest`,` `xmm`_src`/mem`₃₂	Zero-extends a set of four byte values in the LO 4 bytes of XMM_src/mem₃₂ to qword values in YMM_dest.
`vpmovzxwd` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	Zero-extends a set of eight word values in the LO 16 bytes of XMM_src/mem₁₂₈ to dword values in YMM_dest.
`vpmovzxwq` `ymm`_dest`,` `xmm`_src`/mem`₆₄	Zero-extends a set of four word values in the LO 8 bytes of XMM_src/mem₆₄ to qword values in YMM_dest.
`vpmovzxdq` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	Zero-extends a set of four dword values in the LO 16 bytes of XMM_src/mem₁₂₈to qword values in YMM_dest.

11.10.11.2 Packed Sign-Extension Instructions

The SSE/AVX/AVX2 instruction set extensions provide a comparable set of instructions that sign-extend byte, word, and dword values. Table 11-23 lists the SSE packed sign-extension instructions.

Table 11-23: SSE Packed Sign-Extension Instructions

Syntax	Description
`pmovsxbw` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Sign-extends a set of eight byte values in the LO 8 bytes of XMM_src/mem₆₄ to word values in XMM_dest.
`pmovsxbd` `xmm`_dest`,` `xmm`_src`/mem`₃₂	Sign-extends a set of four byte values in the LO 4 bytes of XMM_src/mem₃₂ to dword values in XMM_dest.
`pmovsxbq` `xmm`_dest`,` `xmm`_src`/mem`₁₆	Sign-extends a set of two byte values in the LO 2 bytes of XMM_src/mem₁₆ to qword values in XMM_dest.
`pmovsxwd` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Sign-extends a set of four word values in the LO 8 bytes of XMM_src/mem₆₄ to dword values in XMM_dest.
`pmovsxwq` `xmm`_dest`,` `xmm`_src`/mem`₃₂	Sign-extends a set of two word values in the LO 4 bytes of XMM_src/mem₃₂ to qword values in XMM_dest.
`pmovsxdq` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Sign-extends a set of two dword values in the LO 8 bytes of XMM_src/mem₆₄ to qword values in XMM_dest.

A set of corresponding AVX instructions also exists (whose mnemonics have the v prefix). As usual, the difference between the SSE and AVX instructions is that the SSE instructions leave the upper bits of the YMM register unchanged (if applicable), and the AVX instructions store 0s into those upper bits.

AVX2-capable processors also allow a YMM_dest destination register, which doubles the number of (output) values the instruction can handle; see Table 11-24.

Table 11-24: AVX Packed Sign-Extension Instructions

Syntax	Description
`vpmovsxbw` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	Sign-extends a set of sixteen byte values in the LO 16 bytes of XMM_src/mem₁₂₈ to word values in YMM_dest.
`vpmovsxbd` `ymm`_dest`,` `xmm`_src`/mem`₆₄	Sign-extends a set of eight byte values in the LO 8 bytes of XMM_src/mem₆₄ to dword values in YMM_dest.
`vpmovsxbq` `ymm`_dest`,` `xmms`_rc`/mem`₃₂	Sign-extends a set of four byte values in the LO 4 bytes of XMM_src/mem₃₂ to qword values in YMM_dest.
`vpmovsxwd` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	Sign-extends a set of eight word values in the LO 16 bytes of XMM_src/mem₁₂₈ to dword values in YMM_dest.
`vpmovsxwq` `ymm`_dest`,` `xmm`_src`/mem`₆₄	Sign-extends a set of four word values in the LO 8 bytes of XMM_src/mem₆₄ to qword values in YMM_dest.
`vpmovsxdq` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	Sign-extends a set of four dword values in the LO 16 bytes of XMM_src/mem₁₂₈ to qword values in YMM_dest.

11.10.11.3 Packed Sign Extension with Saturation

In addition to converting smaller signed or unsigned values to a larger format, the SSE/AVX/AVX2-capable CPUs have the ability to convert large values to smaller values via saturation; see Table 11-25.

Table 11-25: SSE Packed Sign-Extension with Saturation Instructions

Syntax	Description
`packsswb` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Packs sixteen signed word values (from two 128-bit sources) into sixteen byte lanes in a 128-bit destination register using signed saturation.
`packuswb` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Packs sixteen unsigned word values (from two 128-bit sources) into sixteen byte lanes in a 128-bit destination register using unsigned saturation.
`packssdw` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Packs eight signed dword values (from two 128-bit sources) into eight word values in a 128-bit destination register using signed saturation.
`packusdw` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Packs eight unsigned dword values (from two 128-bit sources) into eight word values in a 128-bit destination register using unsigned saturation.

The saturate operation checks its operand to see if the value exceeds the range of the result (–128 to +127 for signed bytes, 0 to 255 for unsigned bytes, –32,768 to +32,767 for signed words, and 0 to 65,535 for unsigned words). When saturating to a byte, if the signed source value is less than –128, byte saturation sets the value to –128. When saturating to a word, if the signed source value is less than –32,786, signed saturation sets the value to –32,768. Similarly, if a signed byte or word value exceeds +127 or +32,767, then saturation replaces the value with +127 or +32,767, respectively. For unsigned operations, saturation limits the value to +255 (for bytes) or +65,535 (for words). Unsigned values are never less than 0, so unsigned saturation clips values to only +255 or +65,535.

AVX-capable CPUs provide 128-bit variants of these instructions that support three operands: two source operands and an independent destination operand. These instructions (mnemonics the same as the SSE instructions, with a v prefix) have the following syntax:

vpacksswb  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpackuswb  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpackssdw  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vpackusdw  xmm_dest, xmm_src1, xmm_src2/mem₁₂₈

These instructions are roughly equivalent to the SSE variants, except that these instructions use XMM_src1 as the first source operand rather than XMM_dest (which the SSE instructions use). Also, the SSE instructions do not modify the upper bits of the YMM register (if present on the CPU), whereas the AVX instructions store 0s into the upper YMM register bits.

AVX2-capable CPUs also allow the use of the YMM registers (and 256-bit memory locations) to double the number of values the instruction can saturate (see Table 11-26). Of course, don’t forget to check for AVX2 (and AVX) compatibility before using these instructions.

Table 11-26: AVX Packed Sign-Extension with Saturation Instructions

Syntax	Description
`vpacksswb` `ymm`_dest`,` `ymm`_src1`,` `ymm`_src2`/mem`₂₅₆	Packs 32 signed word values (from two 256-bit sources) into 32 byte lanes in a 256-bit destination register using signed saturation.
`vpackuswb` `ymm`_dest`,` `ymm`_src1`,` `ymm`_src2`/mem`₂₅₆	Packs 32 unsigned word values (from two 256-bit sources) into 32 byte lanes in a 256-bit destination register using unsigned saturation.
`vpackssdw` `ymm`_dest`,` `ymm`_src1`,` `ymm`_src2`/mem`₂₅₆	Packs 16 signed dword values (from two 256-bit sources) into 16 word values in a 256-bit destination register using signed saturation.
`vpackusdw` `ymm`_dest`,` `ymm`_src1`,` `ymm`_src2`/mem`₂₅₆	Packs 16 unsigned dword values (from two 256-bit sources) into 16 word values in a 256-bit destination register using unsigned saturation.

11.11 SIMD Floating-Point Arithmetic Operations

The SSE and AVX instruction set extensions provide packed arithmetic equivalents for all the scalar floating-point instructions in “SSE Floating-Point Arithmetic” in Chapter 6. This section does not repeat the discussion of the scalar floating-point operations; see Chapter 6 for more details.

The 128-bit SSE packed floating-point instructions have the following generic syntax (where instr is one of the floating-point instructions in Table 11-27):

instrps xmm_dest, xmm_src/mem₁₂₈
instrpd xmm_dest, xmm_src/mem₁₂₈

The packed single (*ps) instructions perform four single-precision floating-point operations simultaneously. The packed double (*pd) instructions perform two double-precision floating-point operations simultaneously. As is typical for SSE instructions, these packed arithmetic instructions compute

xmm_dest[lane] = xmm_dest[lane] op xmm_src/mem₁₂₈[lane]

where lane varies from 0 to 3 for packed single-precision instructions and from 0 to 1 for packed double-precision instructions. op represents the operation (such as addition or subtraction). When the SSE instructions are executed on a CPU that supports the AVX extensions, the SSE instructions leave the upper bits of the AVX register unmodified.

The 128-bit AVX packed floating-point instructions have this syntax:¹⁵

vinstrps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈ ; For dyadic operations
vinstrpd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈ ; For dyadic operations
vinstrps xmm_dest, xmm_src/mem₁₂₈          ; For monadic operations
vinstrpd xmm_dest, xmm_src/mem₁₂₈          ; For monadic operations

These instructions compute

xmm_dest[lane] = xmm_src1[lane] op xmm_src2/mem₁₂₈[lane]

where op corresponds to the operation associated with the specific instruction (for example, vaddps does a packed single-precision addition). These 128-bit AVX instructions clear the HO bits of the underlying YMM_dest register.

The 256-bit AVX packed floating-point instructions have this syntax:

vinstrps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆ ; For dyadic operations
vinstrpd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆ ; For dyadic operations
vinstrps ymm_dest, ymm_src/mem₂₅₆          ; For monadic operations
vinstrpd ymm_dest, ymm_src/mem₂₅₆          ; For monadic operations

These instructions compute

ymm_dest[lane] = ymm_src1[lane] op ymm_src/mem₂₅₆[lane]

where op corresponds to the operation associated with the specific instruction (for example, vaddps is a packed single-precision addition). Because these instructions operate on 256-bit operands, they compute twice as many lanes of data as the 128-bit instructions. Specifically, they simultaneously compute eight single-precision (the v*ps instructions) or four double-precision results (the v*pd instructions).

Table 11-27 provides the list of SSE/AVX packed instructions.

Table 11-27: Floating-Point Arithmetic Instructions

Instruction	Lanes	Description
`addps`	4	Adds four single-precision floating-point values
`addpd`	2	Adds two double-precision floating-point values
`vaddps`	4/8	Adds four (128-bit/XMM operands) or eight (256-bit/YMM operands) single-precision values
`vaddpd`	2/4	Adds two (128-bit/XMM operands) or four (256-bit/YMM operands) double-precision values
`subps`	4	Subtracts four single-precision floating-point values
`subpd`	2	Subtracts two double-precision floating-point values
`vsubps`	4/8	Subtracts four (128-bit/XMM operands) or eight (256-bit/YMM operands) single-precision values
`vsubpd`	2/4	Subtracts two (128-bit/XMM operands) or four (256-bit/YMM operands) double-precision values
`mulps`	4	Multiplies four single-precision floating-point values
`mulpd`	2	Multiplies two double-precision floating-point values
`vmulps`	4/8	Multiplies four (128-bit/XMM operands) or eight (256-bit/YMM operands) single-precision values
`vmulpd`	2/4	Multiplies two (128-bit/XMM operands) or four (256-bit/YMM operands) double-precision values
`divps`	4	Divides four single-precision floating-point values
`divpd`	2	Divides two double-precision floating-point values
`vdivps`	4/8	Divides four (128-bit/XMM operands) or eight (256-bit/YMM operands) single-precision values
`vdivpd`	2/4	Divides two (128-bit/XMM operands) or four (256-bit/YMM operands) double-precision values
`maxps`	4	Computes the maximum of four pairs of single-precision floating-point values
`maxpd`	2	Computes the maximum of two pairs of double-precision floating-point values
`vmaxps`	4/8	Computes the maximum of four (128-bit/XMM operands) or eight (256-bit/YMM operands) pairs of single-precision values
`vmaxpd`	2/4	Computes the maximum of two (128-bit/XMM operands) or four (256-bit/YMM operands) pairs of double-precision values
`minps`	4	Computes the minimum of four pairs of single-precision floating-point values
`minpd`	2	Computes the minimum of two pairs of double-precision floating-point values
`vminps`	4/8	Computes the minimum of four (128-bit/XMM operands) or eight (256-bit/YMM operands) pairs of single-precision values
`vminpd`	2/4	Computes the minimum of two (128-bit/XMM operands) or four (256-bit/YMM operands) pairs of double-precision values
`sqrtps`	4	Computes the square root of four single-precision floating-point values
`sqrtpd`	2	Computes the square root of two double-precision floating-point values
`vsqrtps`	4/8	Computes the square root of four (128-bit/XMM operands) or eight (256-bit/YMM operands) single-precision values
`vsqrtpd`	2/4	Computes the square root of two (128-bit/XMM operands) or four (256-bit/YMM operands) double-precision values
`rsqrtps`	4	Computes the approximate reciprocal square root of four single-precision floating-point values^*
`vrsqrtps`	4/8	Computes the approximate reciprocal square root of four (128-bit/XMM operands) or eight (256-bit/YMM operands) single-precision values
^* The relative error is ≤ 1.5 × 2^-12.

The SSE/AVX instruction set extensions also include floating-point horizontal addition and subtraction instructions. The syntax for these instructions is as follows:

haddps  xmm_dest, xmm_src/mem₁₂₈
vhaddps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vhaddps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
haddpd  xmm_dest, xmm_src/mem₁₂₈
vhaddpd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vhaddpd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

hsubps  xmm_dest, xmm_src/mem₁₂₈
vhsubps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vhsubps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆
hsubpd  xmm_dest, xmm_src/mem₁₂₈
vhsubpd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈
vhsubpd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆

As for the integer horizontal addition and subtraction instructions, these instructions add or subtract the values in adjacent lanes in the same register and store the result in the destination register (lane 2), as shown in Figure 11-43.

11.12 SIMD Floating-Point Comparison Instructions

Like the integer packed comparisons, the SSE/AVX floating-point comparisons compare two sets of floating-point values (either single- or double-precision, depending on the instruction’s syntax) and store a resulting Boolean value (all 1 bits for true, all 0 bits for false) into the destination lane. However, the floating-point comparisons are far more comprehensive than those of their integer counterparts. Part of the reason is that floating-point arithmetic is more complex; however, an ever-increasing silicon budget for the CPU designers is also responsible for this.

11.12.1 SSE and AVX Comparisons

There are two sets of basic floating-point comparisons: (v)cmpps, which compares a set of packed single-precision values, and (v)cmppd, which compares a set of packed double-precision values. Instead of encoding the comparison type into the mnemonic, these instructions use an imm₈ operand whose value specifies the type of comparison. The generic syntax for these instructions is as follows:

cmpps  xmm_dest, xmm_src/mem₁₂₈, imm₈
vcmpps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈
vcmpps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆, imm₈

cmppd  xmm_dest, xmm_src/mem₁₂₈, imm₈
vcmppd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈
vcmppd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆, imm₈

The imm₈ operand specifies the type of the comparison. There are 32 possible comparisons, as listed in Table 11-28.

Table 11-28: imm₈ Values for cmpps and cmppd Instructions^†

imm₈	Description	Result				Signal
		A < B	A = B	A > B	Unord
00h	EQ, ordered, quiet	0	1	0	0	No
01h	LT, ordered, signaling	1	0	0	0	Yes
02h	LE, ordered, signaling	1	1	0	0	Yes
03h	Unordered, quiet	0	0	0	1	No
04h	NE, unordered, quiet	1	0	1	1	No
05h	NLT, unordered, signaling	0	1	1	1	Yes
06h	NLE, unordered, signaling	0	0	1	1	Yes
07h	Ordered, quiet	1	1	1	0	No
08h	EQ, unordered, quiet	0	1	0	1	No
09h	NGE, unordered, signaling	1	0	0	1	Yes
0Ah	NGT, unordered, signaling	1	1	0	1	Yes
0Bh	False, ordered, quiet	0	0	0	0	No
0Ch	NE, ordered, quiet	1	0	1	0	No
0Dh	GE, ordered, signaling	0	1	1	0	Yes
0Eh	GT, ordered, signaling	0	0	1	0	Yes
0Fh	True, unordered, quiet	1	1	1	1	No
10h	EQ, ordered, signaling	0	1	0	0	Yes
11h	LT, ordered, quiet	1	0	0	0	No
12h	LE, ordered, quiet	1	1	0	0	No
13h	Unordered, signaling	0	0	0	1	Yes
14h	NE, unordered, signaling	1	0	1	1	Yes
15h	NLT, unordered, quiet	0	1	1	1	No
16h	NLE, unordered, quiet	0	0	1	1	No
17h	Ordered, signaling	1	1	1	0	Yes
18h	EQ, unordered, signaling	0	1	0	1	Yes
19h	NGE, unordered, quiet	1	0	0	1	No
1Ah	NGT, unordered, quiet	1	1	0	1	No
1Bh	False, ordered, signaling	0	0	0	0	Yes
1Ch	NE, ordered, signaling	1	0	1	0	Yes
1Dh	GE, ordered, quiet	0	1	1	0	No
1Eh	GT, ordered, quiet	0	0	1	0	No
1Fh	True, unordered, signaling	1	1	1	1	Yes
^† The darker shaded entries are available only on CPUs that support AVX extensions.

The “true” and “false” comparisons always store true or false into the destination lanes. For the most part, these comparisons aren’t particularly useful. The pxor, xorps, xorpd, vxorps, and vxorpd instructions are probably better for setting an XMM or a YMM register to 0. Prior to AVX2, using a true comparison was the shortest instruction that would set all bits in an XMM or a YMM register to 1, though pcmpeqb is commonly used as well (be aware of microarchitectural inefficiencies with this latter instruction).

Note that non-AVX CPUs do not implement the GT, GE, NGT, and NGE instructions. On these CPUs, use the inverse operation (for example, NLT for GE) or swap the operands and use the opposite condition (as was done for the packed integer comparisons).

11.12.2 Unordered vs. Ordered Comparisons

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the ordered relationship is true when neither source operand is a NaN. Having ordered and unordered comparisons allows you to pass error conditions through comparisons as false or true, depending on how you interpret the final Boolean results appearing in the lanes. Unordered results, as their name implies, are incomparable. When you compare two values, one of which is not a number, you must always treat the result as a failed comparison.

To handle this situation, you use an ordered or unordered comparison to force the result to be false or true, the opposite of what you ultimately expect when using the comparison result. For example, suppose you are comparing a sequence of values and want the resulting masks to be true if all the comparisons are valid (for example, you’re testing to see if all the src₁ values are greater than the corresponding src₂ values). You would use an ordered comparison in this situation that would force a particular lane to false if one of the values being compared is NaN. On the other hand, if you’re checking to see if all the conditions are false after the comparison, you’d use an unordered comparison to force the result to true if any of the values are NaN.

11.12.3 Signaling and Quiet Comparisons

The signaling comparisons generate an invalid arithmetic operation exception (IA) when an operation produces a quiet NaN. The quiet comparisons do not throw an exception and reflect only the status in the MXCSR (see “SSE MXCSR Register” in Chapter 6). Note that you can also mask signaling exceptions in the MXCSR register; you must explicitly set the IM (invalid operation mask, bit 7) in the MXCSR to 0 if you want to allow exceptions.

11.12.4 Instruction Synonyms

MASM supports the use of certain synonyms so you don’t have to memorize the 32 encodings. Table 11-29 lists these synonyms. In this table, x1 denotes the destination operand (XMM_n or YMM_n), and x2 denotes the source operand (XMM_n/mem₁₂₈ or YMM_n/mem₂₅₆, as appropriate).

Table 11-29: Synonyms for Common Packed Floating-Point Comparisons

Synonym	Instruction	Synonym	Instruction
`cmpeqps` `x1,` `x2`	`cmpps` `x1,` `x2, 0`	`cmpeqpd` `x1,` `x2`	`cmppd` `x1,` `x2, 0`
`cmpltps` `x1,` `x2`	`cmpps` `x1,` `x2, 1`	`cmpltpd` `x1,` `x2`	`cmppd` `x1,` `x2, 1`
`cmpleps` `x1,` `x2`	`cmpps` `x1,` `x2, 2`	`cmplepd` `x1,` `x2`	`cmppd` `x1,` `x2, 2`
`cmpunordps` `x1,` `x2`	`cmpps` `x1,` `x2, 3`	`cmpunordpd` `x1,` `x2`	`cmppd x1, x2, 3`
`cmpneqps` `x1,` `x2`	`cmpps` `x1,` `x2, 4`	`cmpneqpd` `x1,` `x2`	`cmppd` `x1,` `x2, 4`
`cmpnltps` `x1,` `x2`	`cmpps` `x1,` `x2, 5`	`cmpnltpd` `x1,` `x2`	`cmppd` `x1,` `x2, 5`
`cmpnleps` `x1,` `x2`	`cmpps` `x1,` `x2, 6`	`cmpnlepd` `x1,` `x2`	`cmppd` `x1,` `x2, 6`
`cmpordps` `x1,` `x2`	`cmpps` `x1,` `x2, 7`	`cmpordpd` `x1,` `x2`	`cmppd` `x1,` `x2, 7`

The synonyms allow you to write instructions such as

cmpeqps  xmm0, xmm1

rather than

cmpps  xmm0, xmm1, 0       ; Compare xmm0 to xmm1 for equality

Obviously, using the synonym makes the code much easier to read and understand. There aren’t synonyms for all the possible comparisons. To create readable synonyms for the instructions MASM doesn’t support, you can use a macro (or a more readable symbolic constant). For more information on macros, see Chapter 13.

11.12.5 AVX Extended Comparisons

The AVX versions of these instructions allow three register operands: a destination XMM or YMM register, a source XMM or YMM register, and a source XMM or YMM register or 128-bit or 256-bit memory location (followed by the imm₈ operand specifying the type of the comparison). The basic syntax is the following:

vcmpps xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈
vcmpps ymm_dest, ymm_src1, ymm_src2/mem₂₅₆, imm₈

vcmppd xmm_dest, xmm_src1, xmm_src2/mem₁₂₈, imm₈
vcmppd ymm_dest, ymm_src1, ymm_src2/mem₂₅₆, imm₈

The 128-bit vcmpps instruction compares the four single-precision floating-point values in each lane of the XMM_src1 register against the values in the corresponding XMM_src2/mem₁₂₈ lanes and stores the true (all 1 bits) or false (all 0 bits) result into the corresponding lane of the XMM_dest register. The 256-bit vcmpps instruction compares the eight single-precision floating-point values in each lane of the YMM_src1 register against the values in the corresponding YMM_src2/mem₂₅₆ lanes and stores the true or false result into the corresponding lane of the YMM_dest register.

The vcmppd instructions compare the double-precision values in the two lanes (128-bit version) or four lanes (256-bit version) and store the result into the corresponding lane of the destination register.

As for the SSE compare instructions, the AVX instructions provide synonyms that eliminate the need to memorize 32 imm₈ values. Table 11-30 lists the 32 instruction synonyms.

Table 11-30: AVX Packed Compare Instructions

imm₈	Instruction
00h	`vcmpeqps` or `vcmpeqpd`
01h	`vcmpltps` or `vcmpltpd`
02h	`vcmpleps` or `vcmplepd`
03h	`vcmpunordps` or `vcmpunordpd`
04h	`vcmpneqps` or `vcmpneqpd`
05h	`vcmpltps` or `vcmpltpd`
06h	`vcmpleps` or `vcmplepd`
07h	`vcmpordps` or `vcmpordpd`
08h	`vcmpeq_uqps` or `vcmpeq_uqpd`
09h	`vcmpngeps` or `vcmpngepd`
0Ah	`vcmpngtps` or `vcmpngtpd`
0Bh	`vcmpfalseps` or `vcmpfalsepd`
0Ch	`vcmpneq_oqps` or `vcmpneq_oqpd`
0Dh	`vcmpgeps` or `vcmpgepd`
0Eh	`vcmpgtps` or `vcmpgtpd`
0Fh	`vcmptrueps` or `vcmptruepd`
10h	`vcmpeq_osps` or `vcmpeq_ospd`
11h	`vcmplt_oqps` or `vcmplt_oqpd`
12h	`vcmple_oqps` or `vcmple_oqpd`
13h	`vcmpunord_sps` or `vcmpunord_spd`
14h	`vcmpneq_usps` or `vcmpneq_uspd`
15h	`vcmpnlt_uqps` or `vcmpnlt_uqpd`
16h	`vcmpnle_uqps` or `vcmpnle_uqpd`
17h	`vcmpord_sps` or `vcmpord_spd`
18h	`vcmpeq_usps` or `vcmpeq_uspd`
19h	`vcmpnge_uqps` or `vcmpnge_uqpd`
1Ah	`vcmpngt_uqps` or `vcmpngt_uqpd`
1Bh	`vcmpfalse_osps` or `vcmpfalse_ospd`
1Ch	`vcmpneq_osps` or `vcmpneq_ospd`
1Dh	vcmpge_oqps or `vcmpge_oqpd`
1Eh	`vcmpgt_oqps` or `vcmpgt_oqpd`
1Fh	`vcmptrue_usps` or `vcmptrue_uspd`

11.12.6 Using SIMD Comparison Instructions

As for the integer comparisons (see “Using Packed Comparison Results” on page 662), the floating-point comparison instructions produce a vector of Boolean results that you use to mask further operations on data lanes. You can use the packed logical instructions (pand and vpand, pandn and vpandn, por and vpor, and pxor and vpxor) to manipulate these results. You could extract the individual lane values and test them with a conditional jump, though this is definitely not the SIMD way of doing things; the following section describes one way to extract these masks.

11.12.7 The (v)movmskps, (v)movmskpd Instructions

The movmskps and movmskpd instructions extract the sign bits from their packed single- and double-precision floating-point source operands and store these bits into the LO 4 (or 8) bits of a general-purpose register. The syntax is

movmskps  reg, xmm_src
movmskpd  reg, xmm_src 
vmovmskps reg, ymm_src
vmovmskpd reg, ymm_src

where reg is any 32-bit or 64-bit general-purpose integer register.

The movmskps instruction extracts the sign bits from the four single-precision floating-point values in the XMM source register and copies these bits to the LO 4 bits of the destination register, as shown in Figure 11-45.

The movmskpd instruction copies the sign bits from the two double-precision floating-point values in the source XMM register to bits 0 and 1 of the destination register, as Figure 11-46 shows.

The vmovmskps instruction extracts the sign bits from the four and eight single-precision floating-point values in the XMM and YMM source register and copies these bits to the LO 4 and 8 bits of the destination register. Figure 11-47 shows this operation with a YMM source register.

f11045 — Figure 11-45: `movmskps` operation

f11046 — Figure 11-46: `movmskpd` operation

f11047 — Figure 11-47: `vmovmskps` operation

The vmovmskpd instruction copies the sign bits from the four double-precision floating-point values in the source YMM register to bits 0 to 3 of the destination register, as shown in Figure 11-48.

f11048 — Figure 11-48: `vmovmskpd` operation

This instruction, with an XMM source register, will copy the sign bits from the two double-precision floating-point values into bits 0 and 1 of the destination register. In all cases, these instructions zero-extend the results into the upper bits of the general-purpose destination register. Note that these instructions do not allow memory operands.

Although the stated data type for these instructions is packed single-precision and packed double-precision, you will also use these instructions on 32-bit integers (movmskps and vmovmskps) and 64-bit integers (movmskpd and vmovmskpd). Specifically, these instructions are perfect for extracting 1-bit Boolean values from the various lanes after one of the (dword or qword) packed integer comparisons as well as after the single- or double-precision floating-point comparisons (remember that although the packed floating-point comparisons compare floating-point values, their results are actually integer values).

Consider the following instruction sequence:

         cmpeqpd  xmm0, xmm1
         movmskpd rax,  xmm0      ; Moves 2 bits into RAX
         lea      rcx,  jmpTable
         jmp      qword ptr [rcx][rax*8]

jmpTable qword    nene
         qword    neeq
         qword    eqne
         qword    eqeq

Because movmskpd extracts 2 bits from XMM0 and stores them into RAX, this code can use RAX as an index into a jump table to select four different branch labels. The code at label nene executes if both comparisons produce not equal; label neeq is the target when the lane 0 values are equal but the lane 1 values are not equal. Label eqne is the target when the lane 0 values are not equal but the lane 1 values are equal. Finally, label eqeq is where this code branches when both sets of lanes contain equal values.

11.13 Floating-Point Conversion Instructions

Previously, I described several instructions to convert data between various scalar floating-point and integer formats (see “SSE Floating-Point Conversions” in Chapter 6). Variants of these instructions also exist for packed data conversions. Table 11-31 lists many of these instructions you will commonly use.

Table 11-31: SSE Conversion Instructions

Instruction syntax	Description
`cvtdq2pd` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Converts two packed signed double-word integers from XMM_src/mem₆₄ to two packed double-precision floating-point values in XMM_dest. If YMM register is present, this instruction leaves the HO bits unchanged.
`vcvtdq2pd` `xmm`_dest`,` `xmm`_src`/mem`₆₄	(AVX) Converts two packed signed double-word integers from XMM_src/mem₆₄ to two packed double-precision floating-point values in XMM_dest. This instruction stores 0s into the HO bits of the underlying YMM register.
`vcvtdq2pd` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts four packed signed double-word integers from XMM_src/mem₁₂₈ to four packed double-precision floating-point values in YMM_dest.
`cvtdq2ps` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Converts four packed signed double-word integers from XMM_src/mem₁₂₈ to four packed single-precision floating-point values in XMM_dest. If YMM register is present, this instruction leaves the HO bits unchanged.
`vcvtdq2ps` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts four packed signed double-word integers from XMM_src/mem₁₂₈ to four packed single-precision floating-point values in XMM_dest. If YMM register is present, this instruction writes 0s to the HO bits.
`vcvtdq2ps` `ymm`_dest`,` `ymm`_src`/mem`₂₅₆	(AVX) Converts eight packed signed double-word integers from YMM_src/mem₂₅₆ to eight packed single-precision floating-point values in YMM_dest. If YMM register is present, this instruction writes 0s to the HO bits.
`cvtpd2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Converts two packed double-precision floating-point values from XMM_src/mem₁₂₈ to two packed signed double-word integers in XMM_dest. If YMM register is present, this instruction leaves the HO bits unchanged. The conversion from floating-point to integer uses the current SSE rounding mode.
`vcvtpd2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts two packed double-precision floating-point values from XMM_src/mem₁₂₈ to two packed signed double-word integers in XMM_dest. This instruction stores 0s into the HO bits of the underlying YMM register. The conversion from floating-point to integer uses the current AVX rounding mode.
`vcvtpd2dq` `xmm`_dest`,` `ymm`_src`/mem`₂₅₆	(AVX) Converts four packed double-precision floating-point values from YMM_src/mem₂₅₆ to four packed signed double-word integers in XMM_dest. The conversion of floating-point to integer uses the current AVX rounding mode.
`cvtpd2ps` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Converts two packed double-precision floating-point values from XMM_src/mem₁₂₈ to two packed single-precision floating-point values in XMM_dest. If YMM register is present, this instruction leaves the HO bits unchanged.
`vcvtpd2ps` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts two packed double-precision floating-point values from XMM_src/mem₁₂₈ to two packed single-precision floating-point values in XMM_dest. This instruction stores 0s into the HO bits of the underlying YMM register.
`vcvtpd2ps` `xmm`_dest`,` `ymm`_src`/mem`₂₅₆	(AVX) Converts four packed double-precision floating-point values from YMM_src/mem₂₅₆ to four packed single-precision floating-point values in YMM_dest.
`cvtps2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Converts four packed single-precision floating-point values from XMM_src/mem₁₂₈ to four packed signed double-word integers in XMM_dest. If YMM register is present, this instruction leaves the HO bits unchanged. The conversion of floating-point to integer uses the current SSE rounding mode.
`vcvtps2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts four packed single-precision floating-point values from XMM_src/mem₁₂₈ to four packed signed double-word integers in XMM_dest. This instruction stores 0s into the HO bits of the underlying YMM register. The conversion of floating-point to integer uses the current AVX rounding mode.
`vcvtps2dq` `ymm`_dest`,` `ymm`_src`/mem`₂₅₆	(AVX) Converts eight packed single-precision floating-point values from YMM_src/mem₂₅₆ to eight packed signed double-word integers in YMM_dest. The conversion of floating-point to integer uses the current AVX rounding mode.
`cvtps2pd` `xmm`_dest`,` `xmm`_src`/mem`₆₄	Converts two packed single-precision floating-point values from XMM_src/mem₆₄ to two packed double-precision values in XMM_dest. If YMM register is present, this instruction leaves the HO bits unchanged.
`vcvtps2pd` `xmm`_dest`,` `xmm`_src`/mem`₆₄	(AVX) Converts two packed single-precision floating-point values from XMM_src/mem₆₄ to two packed double-precision values in XMM_dest. This instruction stores 0s into the HO bits of the underlying YMM register.
`vcvtps2pd` `ymm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts four packed single-precision floating-point values from XMM_src/mem₁₂₈ to four packed double-precision values in YMM_dest.
`cvttpd2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Converts two packed double-precision floating-point values from XMM_src/mem₁₂₈ to two packed signed double-word integers in XMM_dest using truncation. If YMM register is present, this instruction leaves the HO bits unchanged.
`vcvttpd2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts two packed double-precision floating-point values from XMM_src/mem₁₂₈ to two packed signed double-word integers in XMM_dest using truncation. This instruction stores 0s into the HO bits of the underlying YMM register.
`vcvttpd2dq` `xmm`_dest`,` `ymm`_src`/mem`₂₅₆	(AVX) Converts four packed double-precision floating-point values from YMM_src/mem₂₅₆ to four packed signed double-word integers in XMM_dest using truncation.
`cvttps2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	Converts four packed single-precision floating-point values from XMM_src/mem₁₂₈ to four packed signed double-word integers in XMM_dest using truncation. If YMM register is present, this instruction leaves the HO bits unchanged.
`vcvttps2dq` `xmm`_dest`,` `xmm`_src`/mem`₁₂₈	(AVX) Converts four packed single-precision floating-point values from XMM_src/mem₁₂₈ to four packed signed double-word integers in XMM_dest using truncation. This instruction stores 0s into the HO bits of the underlying YMM register.
`vcvttps2dq` `ymm`_dest`,` `ymm`_src`/mem`₂₅₆	(AVX) Converts eight packed single-precision floating-point values from YMM_src/mem₂₅₆ to eight packed signed double-word integers in YMM_dest using truncation.

11.14 Aligning SIMD Memory Accesses

Most SSE and AVX instructions require their memory operands to be on a 16-byte (SSE) or 32-byte (AVX) boundary, but this is not always possible. The easiest way to handle unaligned memory addresses is to use instructions that don’t require aligned memory operands, like movdqu, movups, and movupd. However, the performance hit of using unaligned data movement instructions often defeats the purpose of using SSE/AVX instructions in the first place.

Instead, the trick to aligning data for use by SIMD instructions is to process the first few data items by using standard general-purpose registers until you reach an address that is aligned properly. For example, suppose you want to use the pcmpeqb instruction to compare blocks of 16 bytes in a large array of bytes. pcmpeqb requires its memory operands to be at 16-byte-aligned addresses, so if the memory operand is not already 16-byte-aligned, you can process the first 1 to 15 bytes in the array by using standard (non-SSE) instructions until you reach an appropriate address for pcmpeqb; for example:

cmpLp:  mov  al, [rsi]
        cmp  al, someByteValue
        je   foundByte
        inc  rsi
        test rsi, 0Fh
        jnz  cmpLp
 Use SSE instructions here, as RSI is now 16-byte-aligned

ANDing RSI with 0Fh produces a 0 result (and sets the zero flag) if the LO 4 bits of RSI contain 0. If the LO 4 bits of RSI contain 0, the address it contains is aligned on a 16-byte boundary.¹⁶

The only drawback to this approach is that you must process as many as 15 bytes individually until you get an appropriate address. That’s 6 × 15, or 90, machine instructions. However, for large blocks of data (say, more than about 48 or 64 bytes), you amortize the cost of the single-byte comparisons, and this approach isn’t so bad.

To improve the performance of this code, you can modify the initial address so that it begins at a 16-byte boundary. ANDing the value in RSI (in this particular example) with 0FFFFFFFFFFFFFFF0h (–16) modifies RSI so that it holds the address of the start of the 16-byte block containing the original address:¹⁷

           and  rsi, -16

To avoid matching unintended bytes before the start of the data structure, we can create a mask to cover the extra bytes. For example, suppose that we’re using the following instruction sequence to rapidly compare 16 bytes at a time:

           sub      rsi, 16
cmpLp:     add      rsi, 16
           movdqa   xmm0, xmm2   ; XMM2 contains bytes to test
           pcmpeqb  xmm0, [rsi]
           pmovmskb eax, xmm0
           ptest    eax, eax
           jz       cmpLp

If we use the AND instruction to align the RSI register prior to the execution of this code, we might get false results when we compare the first 16 bytes. To solve this, we can create a mask that will eliminate any bits from unintended comparisons. To create this mask, we start with all 1 bits and zero out any bits corresponding to addresses from the beginning of the 16-byte block to the first actual data item we’re comparing. This mask can be calculated using the following expression:

-1 << (startAdrs & 0xF)  ; Note: -1 is all 1 bits

This creates 0 bits in the locations before the data to compare and 1 bit thereafter (for the first 16 bytes). We can use this mask to zero out the undesired bit results from the pmovmskb instruction. The following code snippet demonstrates this technique:

           mov    rcx, rsi
           and    rsi, -16   ; Align to 16 bits
           and    ecx, 0fH   ; Strip out offset of start of data
           mov    ebx, -1    ; 0FFFFFFFFh – all 1 bits
           shl    ebx, cl    ; Create mask

; Special case for the first 1 to 16 bytes:

           movdqa   xmm0, xmm2
           pcmpeqb  xmm0, [rsi]
           pmovmskb eax, xmm0
           and      eax, ebx
           jnz      foundByte
cmpLp:     add      rsi, 16
           movdqa   xmm0, xmm2   ; XMM2 contains bytes to test
           pcmpeqb  xmm0, [rsi]
           pmovmskb eax, xmm0
           test     eax, eax
           jz       cmpLp
foundByte:
 Do whatever needs to be done when the block of 16 bytes
   contains at least one match between the bytes in XMM2
   and the data at RSI

Suppose, for example, that the address is already aligned on a 16-byte boundary. ANDing that value with 0Fh produces 0. Shifting –1 to the left zero positions produces –1 (all 1 bits). Later, when the code logically ANDs this with the mask obtained after the pcmpeqb and pmovmskb instructions, the result does not change. Therefore, the code tests all 16 bytes (as we would want if the original address is 16-byte-aligned).

When the address in RSI has the value 0001b in the LO 4 bits, the actual data starts at offset 1 into the 16-byte block. So, we want to ignore the first byte when comparing the values in XMM2 against the 16 bytes at [RSI]. In this case, the mask is 0FFFFFFFEh, which is all 1s except for a 0 in bit 0. After the comparison, if bit 0 of EAX contains a 1 (meaning the bytes at offset 0 match), the AND operation eliminates this bit (replacing it with 0) so it doesn’t affect the comparison. Likewise, if the starting offset into the block is 2, 3, . . . , 15, the shl instruction modifies the bit mask in EBX to eliminate bytes at those offsets from consideration in the first compare operation. The result is that it takes only 11 instructions to do the same work as (up to) 90+ instructions in the original (byte-by-byte comparison) example.

11.15 Aligning Word, Dword, and Qword Object Addresses

When aligning non-byte-sized objects, you increment the pointer by the size of the object (in bytes) until you obtain an address that is 16- (or 32-) byte-aligned. However, this works only if the object size is 2, 4, or 8 (because any other value will likely miss addresses that are multiples of 16).

For example, you can process the first several elements of an array of word objects (where the first element of the array appears at an even address in memory) on a word-by-word basis, incrementing the pointer by 2, until you obtain an address that is divisible by 16 (or 32). Note, though, that this scheme works only if the array of objects begins at an address that is a multiple of the element size. For example, if an array of word values begins at an odd address in memory, you will not be able to get an address that is divisible by 16 or 32 with a series of additions by 2, and you would not be able to use SSE/AVX instructions to process this data without first moving it to another location in memory that is properly aligned.

11.16 Filling an XMM Register with Several Copies of the Same Value

For many SIMD algorithms, you will want multiple copies of the same value in an XMM or a YMM register. You can use the (v)movddup, (v)movshdup, (v)pinsd, (v)pinsq, and (v)pshufd instructions for single-precision and double-precision floating-point values. For example, if you have a single-precision floating-point value, r4var, in memory and you want to replicate it throughout XMM0, you could use the following code:

movss  xmm0, r4var
pshufd xmm0, xmm0, 0    ; Lanes 3, 2, 1, and 0 from lane 0

To copy a pair of double-precision floating-point values from r8var into XMM0, you could use:

movsd  xmm0, r8var
pshufd xmm0, xmm0, 44h  ; Lane 0 to lanes 0 and 2, 1 to 1, and 3

Of course, pshufd is really intended for double-word integer operations, so additional latency (time) may be involved in using pshufd immediately after movsd or movss. Although pshufd allows a memory operand, that operand must be a 16-byte-aligned 128-bit-memory operand, so it’s not useful for directly copying a floating-point value through an XMM register.

For double-precision floating-point values, you can use movddup to duplicate a single 64-bit float in the LO bits of an XMM register into the HO bits:

movddup xmm0, r8var

The movddup instruction allows unaligned 64-bit memory operands, so it’s probably the best choice for duplicating double-precision values.

To copy byte, word, dword, or qword integer values throughout an XMM register, the pshufb, pshufw, pshufd, or pshufq instructions are a good choice. For example, to replicate a single byte throughout XMM0, you could use the following sequence:

movzx  eax, byteToCopy
movd   xmm0, eax
pxor   xmm1, xmm1   ; Mask to copy byte 0 throughout
pshufb xmm0, xmm1

The XMM1 operand is an array of bytes containing masks used to copy data from locations in XMM0 onto itself. The value 0 copies byte 0 in XMM0 throughout all the other bits in XMM0. This same code can be used to copy words, dwords, and qwords by simply changing the mask value in XMM1. Or you could use the pshuflw or pshufd instructions to do the job. Here’s another variant that replicates a byte throughout XMM0:

movzx     eax, byteToCopy
mov       ah, al
movd      xmm0, eax
punpcklbw xmm0, xmm0    ; Copy bytes 0 and 1 to 2 and 3
pshufd    xmm0, xmm0, 0 ; Copy LO dword throughout

11.17 Loading Some Common Constants Into XMM and YMM Registers

No SSE/AVX instructions let you load an immediate constant into a register. However, you can use a couple of idioms (tricks) to load certain common constant values into an XMM or a YMM register. This section discusses some of these idioms.

Loading 0 into an SSE/AVX register uses the same idiom that general-purpose integer registers employ: exclusive-OR the register with itself. For example, to set all the bits in XMM0 to 0s, you would use the following instruction:

pxor xmm0, xmm0

To set all the bits in an XMM or a YMM register to 1, you can use the pcmpeqb instruction, as follows:

pcmpeqb xmm0, xmm0

Because any given XMM or YMM register is equal to itself, this instruction stores 0FFh in all the bytes of XMM0 (or whatever XMM or YMM register you specify).

If you want to load the 8-bit value 01h into all 16 bytes of an XMM register, you can use the following code (this comes from Intel):

pxor    xmm0, xmm0
pcmpeqb xmm1, xmm1
psubb   xmm0, xmm1   ; 0 - (-1) is (1)

You can substitute psubw or psubd for psubb in this example if you want to create 16- or 32-bit results (for example, four 32-bit dwords in XMM0, each containing the value 00000001h).

If you would like the 1 bit in a different bit position (rather than bit 0 of each byte), you can use the pslld instruction after the preceding sequence to reposition the bits. For example, if you want to load the XMM0 register with 8080808080808080h, you could use the following instruction sequence:

pxor    xmm0, xmm0
pcmpeqb xmm1, xmm1
psubb   xmm0, xmm1
pslld   xmm0, 7         ; 01h -> 80h in each byte

Of course, you can supply a different immediate constant to pslld to load each byte in the register with 02h, 04h, 08h, 10h, 20h, or 40h.

Here’s a neat trick you can use to load 2ⁿ – 1 (all 1 bits up to the nth bit in a number) into all the lanes on an SSE/AVX register:¹⁸

; For 16-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psrlw    xmm0, 16 - n   ; Clear top 16 - n bits of xmm0

; For 32-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psrld    xmm0, 32 - n   ; Clear top 16 - n bits of xmm0

; For 64-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psrlq    xmm0, 64 - n   ; Clear top 16 - n bits of xmm0

You can also load the inverse (NOT(2ⁿ – 1), all 1 bits in bit position n through the end of the register) by shifting to the left rather than the right:

; For 16-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psllw    xmm0, n        ; Clear bottom n bits of xmm0

; For 32-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
pslld    xmm0, n        ; Clear bottom n bits of xmm0

; For 64-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psllq    xmm0, n        ; Clear bottom n bits of xmm0

Of course, you can also load a “constant” into an XMM or a YMM register by putting that constant into a memory location (preferably 16- or 32-byte-aligned) and then using a movdqu or movdqa instruction to load that value into a register. Do keep in mind, however, that such an operation can be relatively slow if the data in memory does not appear in cache. Another possibility, if the constant is small enough, is to load the constant into a 32- or 64-bit integer register and use movd or movq to copy that value into an XMM register.

11.18 Setting, Clearing, Inverting, and Testing a Single Bit in an SSE Register

Here’s another set of tricks suggested by Raymond Chen (https://blogs.msdn.microsoft.com/oldnewthing/20141222-00/?p=43333/) to set, clear, or test an individual bit in an XMM register.

To set an individual bit (bit n, assuming that n is a constant) with all other bits cleared, you can use the following macro:

; setXBit - Sets bit n in SSE register xReg.

setXBit  macro   xReg, n
         pcmpeqb xReg, xReg   ; Set all bits in xReg
         psrlq   xReg, 63     ; Set both 64-bit lanes to 01h
         if      n lt 64
         psrldq  xReg, 8      ; Clear the upper lane
         else
         pslldq  xReg, 8      ; Clear the lower lane
         endif
         if      (n and 3fh) ne 0
         psllq   xReg, (n and 3fh)
         endif
         endm

Once you can fill an XMM register with a single set bit, you can use that register’s value to set, clear, invert, or test that bit in another XMM register. For example, to set bit n in XMM1, without affecting any of the other bits in XMM1, you could use the following code sequence:

setXBit xmm0, n      ; Set bit n in XMM1 to 1 without
por     xmm1, xmm0   ; affecting any other bits

To clear bit n in an XMM register, you use the same sequence but substitute the vpandn (AND NOT) instruction for the por instruction:

setXBit xmm0, n            ; Clear bit n in XMM1 without
vpandn  xmm1, xmm0, xmm1   ; affecting any other bits

To invert a bit, simply substitute pxor for por or vpandn:

setXBit xmm0, n      ; Invert bit n in XMM1 without
pxor    xmm1, xmm0   ; affecting any other bits

To test a bit to see if it is set, you have a couple of options. If your CPU supports the SSE4.1 instruction set extensions, you can use the ptest instruction:

setXBit xmm0, n      ; Test bit n in XMM1
ptest   xmm1, xmm0
jnz     bitNisSet    ; Fall through if bit n is clear

If you have an older CPU that doesn’t support the ptest instruction, you can use pmovmskb as follows:

; Remember, psllq shifts bits, not bytes.
; If bit n is not in bit position 7 of a given
; byte, then move it there. For example, if n = 0, then
; (7 - (0 and 7)) is 7, so psllq moves bit 0 to bit 7.

movdqa   xmm0, xmm1
if       7 - (n and 7)
psllq    xmm0, 7 - (n and 7)
endif

; Now that the desired bit to test is sitting in bit position
; 7 of *some* byte, use pmovmskb to extract all bit 7s into AX:

pmovmskb eax, xmm0

; Now use the (integer) test instruction to test that bit:

test    ax, 1 shl (n / 8)
jnz     bitNisSet

11.19 Processing Two Vectors by Using a Single Incremented Index

Sometimes your code will need to process two blocks of data simultaneously, incrementing pointers into both blocks during the execution of the loop.

One easy way to do this is to use the scaled-indexed addressing mode. If R8 and R9 contain pointers to the data you want to process, you can walk along both blocks of data by using code such as the following:

          dec rcx
blkLoop:  inc rcx
          mov eax, [r8][rcx * 4]
          cmp eax, [r9][rcx * 4]
          je  theyreEqual
          cmp eax, sentinelValue
          jne blkLoop

This code marches along through the two dword arrays comparing values (to search for an equal value in the arrays at the same index). This loop uses four registers: EAX to compare the two values from the arrays, the two pointers to the arrays (R8 and R9), and then the RCX index register to step through the two arrays.

It is possible to eliminate RCX from this loop by incrementing the R8 and R9 registers in this loop (assuming it’s okay to modify the values in R8 and R9):

          sub r8, 4
          sub r9, 4
blkLoop:  add r8, 4
          add r9, 4
          mov eax, [r8]
          cmp eax, [r9]
          je  theyreEqual
          cmp eax, sentinelValue
          jne blkLoop

This scheme requires an extra add instruction in the loop. If the execution speed of this loop is critical, inserting this extra addition could be a deal breaker.

There is, however, a sneaky trick you can use so that you have to increment only a single register on each iteration of the loop:

          sub r9, r8            ; R9 = R9 - R8
          sub r8, 4
blkLoop:  add r8, 4
          mov eax, [r8]
          cmp eax, [r9][r8 * 1] ; Address = R9 + R8
          je  theyreEqual
          cmp eax, sentinelValue
          jne blkLoop

The comments are there because they explain the trick being used. At the beginning of the code, you subtract the value of R8 from R9 and leave the result in R9. In the body of the loop, you compensate for this subtraction by using the [r9][r8 * 1] scaled-indexed addressing mode (whose effective address is the sum of R8 and R9, thus restoring R9 to its original value, at least on the first iteration of the loop). Now, because the cmp instruction’s memory address is the sum of R8 and R9, adding 4 to R8 also adds 4 to the effective address used by the cmp instruction. Therefore, on each iteration of the loop, the mov and cmp instructions look at successive elements of their respective arrays, yet the code has to increment only a single pointer.

This scheme works especially well when processing SIMD arrays with SSE and AVX instructions because the XMM and YMM registers are 16 and 32 bytes each, so you can’t use normal scaling factors (1, 2, 4, or 8) to index into an array of packed data values. You wind up having to add 16 (or 32) to your pointers when stepping through the arrays, thus losing one of the benefits of the scaled-indexed addressing mode. For example:

; Assume R9 and R8 point at (32-byte-aligned) arrays of 20 double values.
; Assume R10 points at a (32-byte-aligned) destination array of 20 doubles.

          sub     r9, r8     ; R9 = R9 - R8
          sub     r10, r8    ; R10 = R10 – R8
          sub     r8, 32
          mov     ecx, 5     ; Vector with 20 (5 * 4) double values
addLoop:  add     r8, 32
          vmovapd ymm0, [r8]
          vaddpd  ymm0, ymm0, [r9][r8 * 1] ; Address = R9 + R8
          vmovapd [r10][r8 * 1], ymm0      ; Address = R10 + R8
          dec     ecx
          jnz     addLoop

11.20 Aligning Two Addresses to a Boundary

The vmovapd and vaddpd instructions from the preceding example require their memory operands to be 32-byte-aligned or you will get a general protection fault (memory access violation). If you have control over the placement of the arrays in memory, you can specify an alignment for the arrays. If you have no control over the data’s placement in memory, you have two options: working with the unaligned data regardless of the performance loss, or moving the data to a location where it is properly aligned.

If you must work with unaligned data, you can substitute an unaligned move for an aligned move (for example, vmovupd for vmovdqa) or load the data into a YMM register by using an unaligned move and then operate on the data in that register by using your desired instruction. For example:

addLoop:  add     r8, 32
          vmovupd ymm0, [r8]
          vmovupd ymm1, [r9][r8 * 1]  ; Address = R9 + R8
          vaddpd  ymm0, ymm0, ymm1
          vmovupd [r10][r8 * 1], ymm0 ; Address = R10 + R8
          dec     ecx
          jnz     addLoop

Sadly, the vaddpd instruction does not support unaligned access to memory, so you must load the value from the second array (pointed at by R9) into another register (YMM1) before the packed addition operation. This is the drawback to unaligned access: not only are unaligned moves slower, but you also may need to use additional registers and instructions to deal with unaligned data.

Moving the data to a memory location whose alignment you can control is an option when you have a data operand you will be using over and over again in the future. Moving data is an expensive operation; however, if you have a standard block of data you’re going to compare against many other blocks, you can amortize the cost of moving that block to a new location over all the operations you need to do.

Moving the data is especially useful when one (or both) of the data arrays appears at an address that is not an integral multiple of the sub-elements’s size. For example, if you have an array of dwords that begin at an odd address, you will never be able to align a pointer to that array’s data to a 16-byte boundary without moving the data.

11.21 Working with Blocks of Data Whose Length Is Not a Multiple of the SSE/AVX Register Size

Using SIMD instructions to march through a large data set processing 2, 4, 8, 16, or 32 values at a time often allows a SIMD algorithm (a vectorized algorithm) to run an order of magnitude faster than the SISD (scalar) algorithm. However, two boundary conditions create problems: the start of the data set (when the starting address might not be properly aligned) and the end of the data set (when there might not be a sufficient number of array elements to completely fill an XMM or a YMM register). I’ve addressed the issues with the start of the data set (misaligned data) already. This section takes a look at the latter problem.

For the most part, when you run out of data at the end of the array (and the XMM and YMM registers need more for a packed operation), you can use the same technique given earlier for aligning a pointer: load more data than is necessary into the register and mask out the unneeded results. For example, if only 8 bytes are left to process in a byte array, you can load 16 bytes, do the operation, and ignore the results from the last 8 bytes. In the comparison loop examples I’ve been using through these past sections, you could do the following:

movdqa   xmm0, [r8]
pcmpeqd  xmm0, [r9]
pmovmskb eax, xmm0
and      eax, 0ffh     ; Mask out the last 8 compares
cmp      eax, 0ffh
je       matchedData

In most cases, accessing data beyond the end of the data structures (either the data pointed at by R8, R9, or both in this example) is harmless. However, as you saw in “Memory Access and 4K Memory Management Unit Pages” in Chapter 3, if that extra data happens to cross a memory management unit page, and that new page doesn’t allow read access, the CPU will generate a general protection fault (memory access or segmentation fault). Therefore, unless you know that valid data follows the array in memory (at least to the extent the instruction references), you shouldn’t access that memory area; doing so could crash your software.

This problem has two solutions. First, you can align memory accesses on an address boundary that is the same size as the register (for example, 16-byte alignment for XMM registers). Accessing data beyond the end of the data structure with an SSE/AVX instruction will not cross a page boundary (because 16-byte accesses aligned on 16-byte boundaries will always fall within the same MMU page, and ditto for 32-byte accesses on 32-byte boundaries).

The second solution is to examine the memory address prior to accessing memory. While you cannot access the new page without possibly triggering an access fault,¹⁹ you can check the address itself and see if accessing 16 (or 32) bytes at that address will access data in a new page. If it would, you can take some precautions before accessing the data on the next page. For example, rather than continuing to process the data in SIMD mode, you could drop down to SISD mode and finish processing the data to the end of the array by using standard scalar instructions.

To test if a SIMD access will cross an MMU page boundary, supposing that R9 contains the address at which you’re about to access 16 bytes in memory using an SSE instruction, use code like the following:

mov  eax, r9d
and  eax, 0fffh
cmp  eax, 0ff0h
ja   willCrossPage

Each MMU page is 4KB long and is situated on a 4KB address boundary in memory. Therefore, the LO 12 bits of an address provide an index into the MMU page associated with that address. The preceding code checks whether the address has a page offset greater than 0FF0h (4080). If so, then accessing 16 bytes starting at that address will cross a page boundary. Check for a value of 0FE0h if you need to check for a 32-byte access.

11.22 Dynamically Testing for a CPU Feature

At the beginning of this chapter, I mentioned that when testing the CPU feature set to determine which extensions it supports, the best solution is to dynamically select a set of functions based on the presence or absence of certain capabilities. To demonstrate dynamically testing for, and using (or avoiding), certain CPU features—specifically, testing for the presence of AVX extensions—I’ll modify (and expand) the print procedure that I’ve been using in examples up to this point.

The print procedure I’ve been using is very convenient, but it doesn’t preserve any SSE or AVX registers that a call to printf() could (legally) modify. A generic version of print should preserve the volatile XMM and YMM registers as well as general-purpose registers.

The problem is that you cannot write a generic version of print that will run on all CPUs. If you preserve the XMM registers only, the code will run on any x86-64 CPU. However, if the CPU supports the AVX extensions and the program uses YMM0 to YMM5, the print routine will preserve only the LO 128 bits of those registers, as they are aliased to the corresponding XMM registers. If you save the volatile YMM registers, that code will crash on a CPU that doesn’t support the AVX extensions. So, the trick is to write code that will dynamically determine whether the CPU has the AVX registers and preserve them if they are present, and otherwise preserve only the SSE registers.

The easy way to do this, and probably the most appropriate solution for the print function, is to simply stick the cpuid instruction inside print and test the results immediately before preserving (and restoring) the registers. Here’s a code fragment that demonstrates how this could be done:

AVXSupport  =     10000000h              ; Bit 28

print       proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx                  ; CPUID messes with EBX
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

; Reserve space on the stack for the AVX/SSE registers.
; Note: SSE registers need only 96 bytes, but the code
; is easier to deal with if we reserve the full 128 bytes
; that the AVX registers need and ignore the extra 64
; bytes when running SSE code.

            sub     rsp, 192

; Determine if we have to preserve the YMM registers:

            mov     eax, 1
            cpuid
            test    ecx, AVXSupport      ; Test bits 19 and 20
            jnz     preserveAVX

; No AVX support, so just preserve the XXM0 to XXM3 registers:

            movdqu  xmmword ptr [rsp + 00], xmm0
            movdqu  xmmword ptr [rsp + 16], xmm1
            movdqu  xmmword ptr [rsp + 32], xmm2
            movdqu  xmmword ptr [rsp + 48], xmm3
            movdqu  xmmword ptr [rsp + 64], xmm4
            movdqu  xmmword ptr [rsp + 80], xmm5
            jmp     restOfPrint

; YMM0 to YMM3 are considered volatile, so preserve them:

preserveAVX: 
            vmovdqu ymmword ptr [rsp + 000], ymm0
            vmovdqu ymmword ptr [rsp + 032], ymm1
            vmovdqu ymmword ptr [rsp + 064], ymm2
            vmovdqu ymmword ptr [rsp + 096], ymm3
            vmovdqu ymmword ptr [rsp + 128], ymm4
            vmovdqu ymmword ptr [rsp + 160], ymm5

restOfPrint:
        The rest of the print function goes here

At the end of the print function, when it’s time to restore everything, you could do another test to determine whether to restore XMM or YMM registers.²⁰

For other functions, when you might not want the expense of cpuid (and preserving all the registers it stomps on) incurred on every function call, the trick is to write three functions: one for SSE CPUs, one for AVX CPUs, and a special function (that you call only once) that selects which of these two you will call in the future. The bit of magic that makes this efficient is indirection. You won’t directly call any of these functions. Instead, you’ll initialize a pointer with the address of the function to call and indirectly call one of these three functions by using the pointer. For the current example, we’ll name this pointer print and initialize it with the address of the third function, choosePrint:

          .data
print     qword   choosePrint

Here’s the code for choosePrint:

; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:
 
choosePrint proc
            push    rax             ; Preserve registers that get
            push    rbx             ; tweaked by CPUID
            push    rcx
            push    rdx
            
            mov     eax, 1
            cpuid
            test    ecx, AVXSupport ; Test bit 28 for AVX
            jnz     doAVXPrint
            
            lea     rax, print_SSE  ; From now on, call
            mov     print, rax      ; print_SSE directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_SSE
            
doAVXPrint: lea     rax, print_AVX  ; From now on, call
            mov     print, rax      ; print_AVX directly
            
; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_AVX

choosePrint endp

The print_SSE procedure runs on CPUs without AVX support, and the print_AVX procedure runs on CPUs with AVX support. The choosePrint procedure executes the cpuid instruction to determine whether the CPU supports the AVX extensions; if so, it initializes the print pointer with the address of the print_AVX procedure, and if not, it stores the address of print_SSE into the print variable.

choosePrint is not an explicit initialization procedure you must call prior to calling print. The choosePrint procedure executes only once (assuming you call it via the print pointer rather than calling it directly). After the first execution, the print pointer contains the address of the CPU-appropriate print function, and choosePrint no longer executes.

You call the print pointer just as you would make any other call to print; for example:

call print
byte "Hello, world!", nl, 0

After setting up the print pointer, choosePrint must transfer control to the appropriate print procedure (print_SSE or print_AVX) to do the work the user is expecting. Because preserved register values are sitting on the stack, and the actual print routines expect only a return address, choosePrint will first restore all the (general-purpose) registers it saved and then jump to (not call) the appropriate print procedure. It does a jump, rather than a call, because the return address pointing to the format string is already sitting on the top of the stack. On return from the print_SSE or print_AVX procedure, control will return to whomever called choosePrint (via the print pointer).

Listing 11-5 shows the complete print function, with print_SSE and print_AVX, and a simple main program that calls print. I’ve extended print to accept arguments in R10 and R11 as well as in RDX, R8, and R9 (this function reserves RCX to hold the address of the format string following the call to print).

; Listing 11-5
 
; Generic print procedure and dynamically
; selecting CPU features.

        option  casemap:none

nl          =       10

; SSE4.2 feature flags (in ECX):

SSE42       =       00180000h       ; Bits 19 and 20
AVXSupport  =       10000000h       ; Bit 28

; CPUID bits (EAX = 7, EBX register)

AVX2Support  =      20h             ; Bit 5 = AVX

            .const
ttlStr      byte    "Listing 11-5", 0

            .data
            align   qword
print       qword   choosePrint     ; Pointer to print function

; Floating-point values for testing purposes:

fp1         real8   1.0
fp2         real8   2.0
fp3         real8   3.0
fp4         real8   4.0
fp5         real8   5.0
            
            .code
            externdef printf:proc
            
; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

***************************************************************

; print - "Quick" form of printf that allows the format string to
;         follow the call in the code stream. Supports up to five
;         additional parameters in RDX, R8, R9, R10, and R11.

; This function saves all the Microsoft ABI–volatile,
; parameter, and return result registers so that code
; can call it without worrying about any registers being
; modified (this code assumes that Windows ABI treats
; YMM4 to YMM15 as nonvolatile).

; Of course, this code assumes that AVX instructions are
; available on the CPU.

; Allows up to 5 arguments in:

;  RDX - Arg #1
;  R8  - Arg #2
;  R9  - Arg #3
;  R10 - Arg #4
;  R11 - Arg #5

; Note that you must pass floating-point values in
; these registers, as well. The printf function
; expects real values in the integer registers. 

; There are two versions of this function, one that
; will run on CPUs without AVX capabilities (no YMM
; registers) and one that will run on CPUs that
; have AVX capabilities (YMM registers). The difference
; between the two is which registers they preserve
; (print_SSE preserves only XMM registers and will
; run properly on CPUs that don't have YMM register
; support; print_AVX will preserve the volatile YMM
; registers on CPUs with AVX support).

; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:

choosePrint proc
            push    rax             ; Preserve registers that get
            push    rbx             ; tweaked by CPUID
            push    rcx
            push    rdx
            
            mov     eax, 1
            cpuid
            test    ecx, AVXSupport ; Test bit 28 for AVX
            jnz     doAVXPrint
            
            lea     rax, print_SSE  ; From now on, call
            mov     print, rax      ; print_SSE directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_SSE
            
doAVXPrint: lea     rax, print_AVX  ; From now on, call
            mov     print, rax      ; print_AVX directly
            
; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_AVX

choosePrint endp

; Version of print that will preserve volatile
; AVX registers (YMM0 to YMM3):

print_AVX   proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11
            
; YMM0 to YMM7 are considered volatile, so preserve them:

            sub     rsp, 256
            vmovdqu ymmword ptr [rsp + 000], ymm0
            vmovdqu ymmword ptr [rsp + 032], ymm1
            vmovdqu ymmword ptr [rsp + 064], ymm2
            vmovdqu ymmword ptr [rsp + 096], ymm3
            vmovdqu ymmword ptr [rsp + 128], ymm4
            vmovdqu ymmword ptr [rsp + 160], ymm5
            vmovdqu ymmword ptr [rsp + 192], ymm6
            vmovdqu ymmword ptr [rsp + 224], ymm7
            
            push    rbp

returnAdrs  textequ <[rbp + 328]>

            mov     rbp, rsp
            sub     rsp, 128
            and     rsp, -16
            
; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:

            mov     rcx, returnAdrs

; To handle more than 3 arguments (4 counting
; RCX), you must pass data on stack. However, to the
; print caller, the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp + 32], r10
            mov     [rsp + 40], r11
            call    printf
            
; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.
            
            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx
            
            leave
            vmovdqu ymm0, ymmword ptr [rsp + 000]
            vmovdqu ymm1, ymmword ptr [rsp + 032]
            vmovdqu ymm2, ymmword ptr [rsp + 064]
            vmovdqu ymm3, ymmword ptr [rsp + 096]
            vmovdqu ymm4, ymmword ptr [rsp + 128]
            vmovdqu ymm5, ymmword ptr [rsp + 160]
            vmovdqu ymm6, ymmword ptr [rsp + 192]
            vmovdqu ymm7, ymmword ptr [rsp + 224]
            add     rsp, 256
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_AVX   endp

; Version that will run on CPUs without
; AVX support and will preserve the
; volatile SSE registers (XMM0 to XMM3):

print_SSE   proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11
            
; XMM0 to XMM3 are considered volatile, so preserve them:

            sub     rsp, 128
            movdqu  xmmword ptr [rsp + 00],  xmm0
            movdqu  xmmword ptr [rsp + 16],  xmm1
            movdqu  xmmword ptr [rsp + 32],  xmm2
            movdqu  xmmword ptr [rsp + 48],  xmm3
            movdqu  xmmword ptr [rsp + 64],  xmm4
            movdqu  xmmword ptr [rsp + 80],  xmm5
            movdqu  xmmword ptr [rsp + 96],  xmm6
            movdqu  xmmword ptr [rsp + 112], xmm7
            
            push    rbp

returnAdrs  textequ <[rbp + 200]>

            mov     rbp, rsp
            sub     rsp, 128
            and     rsp, -16
            
; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:

            mov     rcx, returnAdrs
            
; To handle more than 3 arguments (4 counting
; RCX), you must pass data on stack. However, to the
; print caller, the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp + 32], r10
            mov     [rsp + 40], r11
            call    printf
            
; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.
            
            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx
            
            leave
            movdqu  xmm0, xmmword ptr [rsp + 00] 
            movdqu  xmm1, xmmword ptr [rsp + 16] 
            movdqu  xmm2, xmmword ptr [rsp + 32] 
            movdqu  xmm3, xmmword ptr [rsp + 48] 
            movdqu  xmm4, xmmword ptr [rsp + 64] 
            movdqu  xmm5, xmmword ptr [rsp + 80] 
            movdqu  xmm6, xmmword ptr [rsp + 96] 
            movdqu  xmm7, xmmword ptr [rsp + 112] 
            add     rsp, 128
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_SSE   endp 
            
***************************************************************
            
; Here is the "asmMain" function.
        
            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

; Trivial example, no arguments:

            call    print
            byte    "Hello, world!", nl, 0
            
; Simple example with integer arguments:

            mov     rdx, 1          ; Argument #1 for printf
            mov     r8, 2           ; Argument #2 for printf
            mov     r9, 3           ; Argument #3 for printf
            mov     r10, 4          ; Argument #4 for printf
            mov     r11, 5          ; Argument #5 for printf
            call    print
            byte    "Arg 1=%d, Arg2=%d, Arg3=%d "
            byte    "Arg 4=%d, Arg5=%d", nl, 0
            
; Demonstration of floating-point operands. Note that
; args 1, 2, and 3 must be passed in RDX, R8, and R9.
; You'll have to load parameters 4 and 5 into R10 and R11.

            mov     rdx, qword ptr fp1
            mov     r8,  qword ptr fp2
            mov     r9,  qword ptr fp3
            mov     r10, qword ptr fp4
            mov     r11, qword ptr fp5
            call    print
            byte    "Arg1=%6.1f, Arg2=%6.1f, Arg3=%6.1f "
            byte    "Arg4=%6.1f, Arg5=%6.1f ", nl, 0
                         
allDone:    leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

Listing 11-5: Dynamically selected print procedure

Here’s the build command and output for the program in Listing 11-5:

C:\>build listing11-5

C:\>echo off
 Assembling: listing11-5.asm
c.cpp

C:\>listing11-5
Calling Listing 11-5:
Hello, World!
Arg 1=1, Arg2=2, Arg3=3 Arg 4=4, Arg5=5
Arg1=   1.0, Arg2=   2.0, Arg3=   3.0 Arg4=   4.0, Arg5=   5.0
Listing 11-5 terminated

11.23 The MASM Include Directive

As you’ve seen already, including the source code for the print procedure in every sample listing in this book wastes a lot of space. Including the new version from the previous section in every listing would be impractical. In Chapter 15, I discuss include files, libraries, and other functionality you can use to break large projects into manageable pieces. In the meantime, however, it’s worthwhile to discuss the MASM include directive so this book can eliminate a lot of unnecessary code duplication in sample programs.

The MASM include directive uses the following syntax:

include  source_filename

where source_filename is the name of a text file (generally in the same directory of the source file containing this include directive). MASM will take the source file and insert it into the assembly at the point of the include directive, exactly as though the text in that file had appeared in the source file being assembled.

For example, I have extracted all the source code associated with the new print procedure (the choosePrint, print_AVX, and print_SSE procedures, and the print qword variable), and I’ve inserted them into the print.inc source file.²¹ In listings that follow in this book, I’ll simply place the following directive in the code in place of the print function:

include print.inc

I’ve also put the getTitle procedure into its own header file (getTitle.inc) to be able to remove that common code from sample listings.

11.24 And a Whole Lot More

This chapter doesn’t even begin to describe all the various SSE, AVX, AVX2, and AVX512 instructions. As already mentioned, most of the SIMD instructions have a specific purpose (such as interleaving or deinterleaving bytes associated with video or audio information) that aren’t very useful outside their particular problem domain. Other instructions (at least, as this book was being written) are sufficiently new that they won’t execute on many CPUs in use today. If you’re interested in learning about more of the SIMD instructions, check out the information in the next section.

11.25 For More Information

For more information about the cpuid instruction on AMD CPUs, see the 2010 AMD document “CPUID Specification” (https://www.amd.com/system/files/TechDocs/25481.pdf). For Intel CPUs, check out “Intel Architecture and Processor Identification with CPUID Model and Family Numbers” (https://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers/).

Microsoft’s website (particularly the Visual Studio documentation) has additional information on the MASM segment directive and x86-64 segments. A search for MASM Segment Directive on the internet, for example, brought up the page https://docs.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-160/.

The complete discussion of all the SIMD instructions can be found in Intel’s documentation: Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference.

You can easily find this documentation online at Intel’s website; for example:

https://software.intel.com/en-us/articles/intel-sdm/

https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html

AMD’s variant can be found at https://www.amd.com/system/files/TechDocs/40332.pdf.

Although this chapter has presented many of the SSE/AVX/AVX2 instructions and what they do, it has not spent much time describing how you would use these instructions in a typical program. You can easily find lots of useful high-performance algorithms that use SSE and AVX instructions on the internet. The following URLs provide some examples:

Tutorials on SIMD programming

SSE Arithmetic, by Stefano Tommesani, https://tommesani.com/index.php/2010/04/24/sse-arithmetic/

x86/x64 SIMD Instruction List, https://www.officedaytime.com/simd512e/

Basics of SIMD Programming, Sony Computer Entertainment, http://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html

Sorting algorithms

“A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 on Intel Skylake,” by Berenger Bramas, https://arxiv.org/pdf/1704.08579.pdf

“Register Level Sort Algorithm on Multi-Core SIMD Processors” by Tian Xiaochen et al., http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/xiaochen_rocki_IA3_SC13.pdf

“Fast Quicksort Implementation Using AVX Instructions” by Shay Gueron and Vlad Krasnov, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1009.7773&rep=rep1&type=pdf

Search algorithms

“SIMD-Friendly Algorithms for Substring Searching” by Wojciech Mula, http://0x80.pl/articles/simd-strfind.html

“Fast Multiple String Matching Using Streaming SIMD Extensions Technology” by Simone Faro and M. Oğuzhan Külekci, https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1041.3831&rep=rep1&type=pdf

“k-Ary Search on Modern Processors” by Benjamin Schlegel et al., https://event.cwi.nl/damon2009/DaMoN09-KarySearch.pdf

11.26 Test Yourself

How can you determine whether a particular SSE or AVX feature is available on the CPU?

Why is it important to check the manufacturer of the CPU?

What EAX setting do you use with cpuid to obtain the feature flags?

What feature flag bit tells you that the CPU supports SSE4.2 instructions?

What is the name of the default segment used by the following directives?

.code

.data

.data?

.const

What is the default segment alignment?

How would you create a data segment aligned on a 64-byte boundary?

Which instruction set extensions support the YMMx registers?

What is a lane?

What is the difference between a scalar instruction and a vector instruction?

SSE memory operands (XMM) must usually be aligned on what memory boundary?

AVX memory operands (YMM) must usually be aligned on what memory boundary?

AVX-512 memory operands (ZMM) must usually be aligned on what memory boundary?

What instruction would you use to move the data from a 32-bit general-purpose integer register into the LO 32 bits of an XMM and a YMM register?

What instruction would you use to move the data from a 64-bit general-purpose integer register into the LO 64 bits of an XMM and a YMM register?

What three instructions would you use to load 16 bytes from an aligned memory location into an XMM register?

What three instructions would you use to load 16 bytes from an arbitrary memory address into an XMM register?

If you want to move the HO 64 bits of an XMM register into the HO 64 bits of another XMM register without affecting the LO 64 bits of the destination, what instruction would you use?

If you want to duplicate a double-precision value in the LO 64 bits of an XMM register in the two qwords (LO and HO) of another XMM register, what instruction would you use?

Which instruction would you use to rearrange the bytes in an XMM register?

Which instruction would you use to rearrange the dword lanes in an XMM register?

Which instructions would you use to extract bytes, words, dwords, or qwords from an XMM register and move them into a general-purpose register?

Which instructions would you use to take a byte, word, dword, or qword in a general-purpose register and insert it somewhere in an XMM register?

What does the andnpd instruction do?

Which instruction would you use to shift the bytes in an XMM register one byte position to the left (8 bits)?

Which instruction would you use to shift the bytes in an XMM register one byte position to the right (8 bits)?

If you want to shift the two qwords in an XMM register n bit positions to the left, what instruction would you use?

If you want to shift the two qwords in an XMM register n bit positions to the right, what instruction would you use?

What happens in a paddb instruction when a sum will not fit into 8 bits?

What is the difference between a vertical addition and a horizontal addition?

Where does the pcmpeqb instruction put the result of the comparison? How does it indicate the result is true?

There is no pcmpltq instruction. Explain how to compare lanes in a pair of XMM registers for the less-than condition.

What does the pmovmskb instruction do?

How many simultaneous additions are performed by the following?

addps

addpd

If you have a pointer to data in RAX and want to force that address to be aligned on a 16-byte boundary, what instruction would you use?

How can you set all the bits in the XMM0 register to 0?

How can you set all the bits in the XMM1 register to 1?

What directive do you use to insert the content of a source file into the current source file during assembly?

^1. Yes, MASM uses the same directive, ends, for ending both structures and segments.

^2. xmmn represents XMM0 through XMM15.

^3. ymmn represents YMM0 through YMM15.

^4. The vmovlps and vmovlpd instructions will zero-extend to the HO bits of the corresponding YMM register, regardless of what happens in the XMM register.

^5. This is probably a bug. It may be corrected in later versions of MASM.

^6. Other than, of course, the issue of zeroing and preserving the HO bits of YMM registers when operating on 128-bit data sets.

^7. The AVX-512 extensions also allow the use of 512-bit ZMM registers for the vshufb instruction.

^8. They also allow you to shuffle values in ZMM registers. However, this book is largely ignoring the AVX-512 instruction set extensions. See the Intel and AMD documentation if you are interested in using the 512-bit variants of these instructions.

^9. Intel and AMD’s documentation swap the second and third operands. This book uses the Intel syntax.

^10. 8-bit addition is generally sufficient for video, with 16-bit addition certainly sufficient for high-end video encoding. 16-bit saturation is suitable for normal audio, though high-end audio requires 24-bit arithmetic.

^11. Where the SSE4_1 feature flag for the legacy 128-bit version is set. See the Intel documentation for full details.

^12. xx = ax or in, y = s or u, and z = b, w, d, or q.

^13. The AVX-512 instruction set actually includes a fourth set of absolute value instructions (vpvasq); see the Intel documentation for more details.

^14. Qword comparisons are available only on CPUs that support the SSE4.1 instruction set extensions.

^15. Dyadic operations have two operands; for example, addition is dyadic: x + y. Monadic operations have a single operand; for example, sqrt(x).

^16. Logically AND with the value 1Fh for 32-byte alignment.

^17. One nice feature of the two’s complement numbering system is that negating a power of 2 produces all 1 bits except for the LO log₂(pwrOf2) bits of the number. For example, –32 has 0s in the LO 5 bits, –16 has 0s in the LO 4 bits, –8 has 0s in the LO 3 bits, –4 has 0s in the LO 2 bits, and –2 has a 0 in the LO bit.

^18. Suggested by Raymond Chen at https://blogs.msdn.microsoft.com/oldnewthing/.

^19. As far as I know, at least while this is being written, there is no convenient way to test a byte in memory to see if it is accessible without causing a fault; in theory, you could put in an exception handler, but triggering and handling the exception is far too expensive to consider.

^20. You could save the cpuid results and just test the flag, if that is more convenient for you.

^21. .inc is the typical suffix MASM programmers use for include files.