A brief discussion about texturemapper innerloops using ADDX

Originally written in 1997 for Azure's Optimized Amiga Programming page

Texturemapping Innerloops

A brief discussion about texturemapper innerloops using ADDX

Written by Mikael Kalms (mikael@kalms.org, Scout/C-Lous^Artwork^Appendix)
HTML-Version by Azure

Note: This is no basic-introduction to texturemapping - its just an article on optimizing innerloops.

If you have U and V (the two texture coordinates) as 8.8 fixed point, then you can have:

; d0	U	----UUuu	(UU == integer bits, uu = fractional)
; d1	V	----VVvv
; d2	dU/dx	----UUuu	"horizontal slope of U"
; d3	dV/dx	----VVvv

.pixel
	move.w	d1,d4		; d4 (offset reg) = VVxx
	move.w	d0,d5		; Temporarily...
	lsr.w	#8,d5		; ...to get UU in lower byte of a reg
	move.b	d5,d4		; d4 = VVUU = correct offset
	move.b	(a0,d4.w),(a1)+
	add.w	d2,d0
	add.w	d3,d1
	dbf	d7,.pixel

There is however a very nifty instruction which we can make use of here: addx.

Addx is intended to be used for adding arbitrarily large integers. What it does, is that it adds together the two operands, and if X flag is set then it also adds 1 to the result.
The X flag will be set by the previous add/addx, so the X flag is what "binds together" the adds. (It carries the overflow from a lower part of the large number to the next.)

Imagine that you want to add the 96-bit numebr in d2:d1:d0 to another 96-bit number in d5:d4:d3.

Then you would proceed like this:

	add.l	d0,d3		; First add lowermost part
	addx.l	d1,d4		; .. then the next -- with X flag
	addx.l	d2,d5		; .. and finally the highest (with X)

[There is a special form of addx: "addx -(am),-(an)" It is intended for use when the two numbers are lying around in memory. Quite useful for handling arbitrarily large numbers, or just numbers of different sizes. (There's no use for this in realtime applications though :))]

This can be very valuable when dealing with fixed-point too, though;

since you can split

	add.w	d1,d0

into

	add.b	d1,d0
	addx.b	d3,d2

... then you would be able to get at the upper 8 bits of an 8.8 fixed point number without shifting!

There's however one more thing to realize before we try implementing this.

One can perform two word-additions using only one add.l by having the values packed into two registers; there will sometimes be an overflow from the lower part to the upper part of the answer, but that overflow is always 1 so it is in some applications negligible.

Consider this:

; d0	00aa00bb
; d1	00cc00dd

	add.l	d1,d0

... then d0 will be = (aa + cc) shifted up 16 + (bb + dd); it will take 256 times of repeating until there comes any overflow into the (aa+cc) calculation.

Now we look at our tmapper's stepping:

	add.w	d1,d0		; Interpolate V
	add.b	d3,d2		; Interpolate U fraction
	addx.b	d5,d4		; Interpolate U integer

... and here we [after some thinking :)] realize that we can make the "add.b d3,d2" in the uppermost bytes of d0 and d1 instead:

; d0	V	**--VVvv	(this is how I denote different contents
; d0	U	uu--****	 in the same register)

; d1	dV/dx	**--VVvv
; d1	dU/dx	uu--****

; d2	U	------UU

; d3	dU/dx	------UU

.pixel
	move.w	d0,d4		; d4 = VVvv
	move.b	d2,d4		; d4 = VVUU
	move.b	(a0,d4.w),(a5)+
	add.l	d1,d0
	addx.b	d3,d2
	dbf	d7,.pixel

now that's pretty short. :) In fact this is what is commonly referred to as "a 5inst tmapper" (5 instructions not counting the dbf). There is no obvious way of speeding this up, it has looked like this since 1994 at least.

Notice what the add/addx thing can be thought of to look like:

d2:d0 + d3:d1 = UU:uu--VVvv + UU:uu--VVvv

This shows that the add/addx is nothing but two adds packed into one (but the "one" add happens to be executed through two instructions) -- we are therefore not doing anything excessively weird yet.

The error that gets carried from the VVvv parts into the uu part is, if you initialize the '--' parts to 00, max 1 u per 256 iterations = 1 U per 65536 iterations. That error is highly negligible!
Even when skipping the 00izing of '--' the max error is 1 U per 256 iterations, so it is only if you happen to be drawing very large polygons that you'll need to init those bits.

But -- perhaps you want more accuracy? 16 fractional bits? It is usually not necessary on the Amiga in 320x256 resolution, but it is good to know that it is possible.

Let us first begin with the problem of only interpolating one value:

; d0	C	--CCcccc	(C for colour, might be for a gouraud rout)
; d1	dC/dx	--CCcccc

usually one would do:

	add.l	d1,d0
	move.l	d0,d2
	swap	d2

... and then use d2.

That can be changed into:

	add.w	d1,d0
	addx.b	d3,d2

... however, look at this code if we unroll (repeat) it several times:

	add.w	d1,d0
	addx.b	d3,d2
	add.w	d1,d0
	addx.b	d3,d2
	add.w	d1,d0
	addx.b	d3,d2

Notice that after the first "addx.b d3,d2", the subsequent "add.w d1,d0" could be done at the top of d2/d3 in the addx! (remember that an addx is just like an add -- plus the X flag)

Therefore, or code could also look like this:

; d0	C	----cccc
; d1	dC/dx	----cccc

; d2	C	cccc--CC
; d3	dC/dx	cccc--CC

	add.w	d1,d0
	addx.l	d3,d2
	addx.l	d3,d2
	addx.l	d3,d2

and just to get rid of the need for d0/d1 at the beginning:

	move.l	d3,d4
	clr.w	d4
	add.l	d4,d2		; Step only fractional part -- "init"
	addx.l	d3,d2
	addx.l	d3,d2
	addx.l	d3,d2

(actually one should end this chain with "addx.w", but that only matters if one needs the result value at the end of the operation -- normally one doesn't, at least not in tmappers.)

This is a very interesting approach because it creates something which I like to call a "cyclic add", which so-to-say never ends.

Building this with two values (16.16) could look like this:

; d0		bbbbAAAA
; d1		aaaaBBBB
; d2 & d3 same, but d?/dx

	move.l	d3,d4
	clr.w	d4
	add.l	d4,d1		; start the chain (adding last fractional
				; part)
	addx.l	d2,d0
	addx.l	d3,d1
; iteration 1 done
	addx.l	d2,d0
	addx.l	d3,d1
; iteration 2 done
	...

What kind of setup gives us the look of d1:d0?

Check them out as "normal" and as "finished" in 64bit format:

d1:d0 "normal" = BBBBbbbb:AAAAaaaa
d1:d0 "2addx"  = aaaaBBBB:bbbbAAAA

... which means, that the setup operation was -- in theory, of course -- "ror.q #16,d1:d0".

This shows us why we should add the uppermost word of the 64bit value first (the move.l/clr.w/add.l init): Because that's where the lowest bits of the original value are.

Now let us finally implement this into a texturemapper:

; d0	U	**----UU
; d0	V	vv----**

; d1	dU/dx	**----UU
; d1	dV/dx	vv----**

; d2	U	uuuu****
; d2	V	****VVvv

; d3	dU/dx	uuuu****
; d3	dV/dx	****VVvv

... and the code:

	move.l	d3,d4
	clr.w	d4
	add.l	d4,d2		; Start the X-flag in the "cyclic add"

.pixel
	move.w	d2,d4		; d4 = VVvv
	move.b	d0,d4		; d4 = VVUU
	move.b	(a0,d4.w),(a1)+
	addx.l	d1,d0
	addx.l	d3,d2
	dbf	d7,.pixel

These loops are nice and fast on 020/030, sure, but how about 040/060?

There the instructions should be re-ordered a bit to remove stalls. (The move.w/move.b causes a 0.5 cycle stall on 060, and the closeness between move.b and pixel-copying move.b causes 1 cycle stall on 040 and 060.)

Reorder the 8.8 loop to this:

.pixel
	move.w	d0,d4		; d4 = VVvv
	add.l	d1,d0
	move.b	d2,d4		; d4 = VVUU
	addx.b	d3,d2
	move.b	(a0,d4.w),(a5)+
	dbf	d7,.pixel

And the 16.16 loop to this:

.pixel
	move.w	d2,d4		; d4 = VVvv
	addx.l	d1,d0
	move.b	d0,d4		; d4 = VVUU
	addx.l	d3,d2
	move.b	(a0,d4.w),(a1)+
	dbf	d7,.pixel

The above loops can be sped up a tiny bit more, but any more optimization is left as an exercise for the readers. ;)

Oh, and you might want to keep the upper word of d4 cleared (through a "moveq #0,d4" before the pixel-loop), since you then can use d4.l as offset into the texture and a0 thus doesn't need to point to the middle of the texture (if you have size 256x256 textures). It all depends on the circumstances though...

Last change: Friday, October 3rd, 1997.
For questions referring to this pages content mail to: Kalms