Today I will cover a few techniques on how SIMD (Single instruction - multiple data) processing is possible with just plain integer instructions with the C language. I primarily show techniques related to image manipulation as this is the field where I mostly applied these. With these you may hopefully be able to write faster cross-platform image manipulation routines.
So why would one use plain integer arithmetic when just about any reasonable target for image manipulation today has native SIMD instruction set? There are a few probable goals.
Number one is being cross platform. Plain C integer instructions are available everywhere while every probable target has a different SIMD instruction set requiring specialized (assembly) code.
Number two is (probably) clarity. If you write the routines in plain C instead of some arcane assembler dialect, it might (not necessarily!) be more easy to understand by other people (Or you if you pick up your code two years later).
To be portable, it might be a good practice to provide fast assembly solutions for a set of known targets while providing the integer solution to targets which weren't considered at the time of development. This way the code will compile, and although with limited performance, the program will run.
The basic idea
Using wide (like 32 bits) integers to process narrower (like 8 bits) quantities is possible if you know the basic arithmetic operations work. But hey, first of all, why would one do this?
The goal of this is performance, and sometimes also smaller code, which in some cases might even turn out to be cleaner than not doing it. Good candidates for manipulating more pieces of data in one operation are situations where the data natively comes and goes in packs, the most common example being an (alpha)RGB color with each component taking 8 bits.
If you examine the basic arithmetic, you will find out that with most of the operations in certain conditions sufficiently separated parts of a number behave independently. With addition and subtraction it is clear that unless there is a carry over, the higher order positions will be unaffected by the lower order ones. So you can for example add together two sets of four 8 bit numbers as two 32 bit numbers if you know that neither 8 bit addition will overflow.
The more important part is realizing how multiplication works as on some architectures it is even a costly operation, so it pays off packing them where possible.
For this let's get back to the traditional pen and paper multiplication, with an example following:
Pen and paper multiplication example 230067 * 94 ------ 2070603 920268 -------- 21626298 Note that it also calculated 23*94=2162 and 67*94=6298 independently!
If you look at it carefully, you will see why it has this interesting property (That it also nicely calculates those two other multiplications): It just boils down to addition. The lower order multiplication can not affect the higher order one until the multiplier is less or equal to 101, as even then it will only make at most 9999, so there will be no carry over that two zeros of "padding". If you translate this knowledge to hexadecimal (Which is more useful at programming), you will get you can multiply a hexadecimal number like 0xXX00YY with one up to 0x101, and you will get 0xXX x multiplier in the high order 16 bits, and 0xYY x multiplier in the low order bits.
You can of course exploit it with different bit depths as well if you don't want to multiply 8 bit numbers, but that's the most common. For example with 4 bit ones you would get: 0xX0Y0Z0Q multiplied with up to 0x11, and you will get four distinct 8 bit results.
That's basically it. Note that I omitted division: for division, there is no such technique. You should avoid dividing anyways in performance coding as it is a very costly operation on all common architectures.
Real life usage
So how these goodies could be used then? Let's get some examples!
The most common would be calculating an alpha blending. You get an Alpha-RGB colored pixel, which you want to plot on some surface. So you need to get the RGB color of the appropriate pixel on the surface, compute the alpha blending with the Alpha-RGB pixel, and write it back. Preferably a lot of times for blitting an image.
The code below does it with only 4 multiplications, nice and clean:
Alpha blending example srcp: The source pixel; 32bit ARGB order (A=255 is translucent). dstp: The destination pixel; 32bit -RGB order. mp=srcp>>24; /* Retrieve source alpha for multiplying the destination */ mn=256-mp; /* Complement it for the source pixel */ mp++; /* Make it in 1-256 range, so blending will be clean */ dstp=( ((((srcp & 0x00FF00FF)*mn)+((dstp & 0x00FF00FF)*mp)) & 0xFF00FF00)+ ((((srcp & 0x0000FF00)*mn)+((dstp & 0x0000FF00)*mp)) & 0x00FF0000) )>>8;
It basically uses all the techniques presented above. For the multiplication it packs the Red and Blue components together to do them in one operation, and for merging the Red-Blue with Green it also does a packed 8 bit addition where it is known that no carry over will occur (Since mn+mp=0x101, the added result of the two multiplications can not exceed 0xFFFF).
To speed things up, you may also pre-multiply image data (Alpha-RGB indexed images are particularly good for this as they can be pre-multiplied real fast!). This case half of the multiplications will go to the pre-multiplying, so the actual blitting done many times will only take two multiplications.
If you want to work fast, you may also rather do a dirty 16 level alpha blend the following way:
Fast & dirty alpha blending example srcp: The source pixel; 32bit ARGB order (A=255 is translucent). dstp: The destination pixel; 32bit -RGB order. mp=srcp>>28; /* Retrieve source alpha for multiplying the destination */ mn=16-mp; /* Complement it for the source pixel */ mp++; /* Make it in 1-16 range, so blending will be clean */ dstp=((((srcp & 0x0F0F0F0)*mn)+((dstp & 0x0F0F0F0)*mp)) & 0x0F0F0F00)>>8; dstp+=dstp<<4;
Just two multiplications, and this might still be suitable to blit images which are just anti-aliased by alpha (Note that though it reduces the depth of each component to 4 bits unless you use an if branch for processing non-translucent pixels). If you pre-multiply accordingly, you can reduce the multiplication cost of plotting one pixel to a single multiplication.
Nice, huh? :)