Agner wrote:
I don't think I understand your problem. You have four
RGBA points in each their vector register. All of
these should be multiplied by a factor and then it
should all be added together. It's just multiplication
and addition of 8-bit integers. You may zero-extend
all to 16 bits to avoid loss of precision; then shift
right and compress back to 8 bits.
In pseudo-ASM it looks somewhat like this:loop: ; generate memory addresses
shr r0, u_coord, 24
shr r1, v_coord, 24
shl r1, 8
add r4, r0, r1
add r2, r0, 1
and r2, 255
add r5, r2, r1
add r3, r1, 256
and r3, 65535
add r6, r0, r3
add r7, r2, r3 ; load the 4 texels
mov v0, int32 [texture_adr + r4*4]
mov v1, int32 [texture_adr + r4*5]
mov v2, int32 [texture_adr + r4*6]
mov v3, int32 [texture_adr + r4*7] ; expand components from 8bits to 16bits to avoid wrap-around problems
expand.u8.u16 v0
expand.u8.u16 v1
expand.u8.u16 v2
expand.u8.u16 v3 ; generate interpolation factors
shr r8, u_coord, 16
and r8, 255
mov broadcast.u16x4 v4, r8
shr r9, v_coord, 16
and r9, 255
mov broadcast.u16x4 v5, r9 ; linear interpolate
sub.s16 v1, v0
sub.s16 v3, v2
mul.s16 v1, v4
mul.s16 v3, v4
shr.s16 v1, 8
shr.s16 v3, 8
add.s16 v1, v0
add.s16 v3, v2
sub.s16 v3, v1
mul.s16 v3, v5
shr.s16 v3, 8
add.s16 v3, v1 ; alpha blend
mov v6, u32 [screen_ptr]
expand.u8.u16 v6
mov broadcast.s16x4 v7, v3[3]
sub.s16 v3, v6
mul.s16 v3, v7
shr.s16 v3, 8
add.s16 v3, v6
shrink.u16.u8 v6
mov v6, u32 [screen_ptr] ; increment and loop
add screen_ptr, 4
add u_coord, u_delta
add v_coord, v_delta
cmp r10, screen_ptr, span_end_ptr
jz r10 loop
So the problem I'm seeing is not as much the linear interpolation as having to do ~50 instructions per pixel to do your addressing, interpolation, alpha blending and so forth. I guess you could compute multiple pixels together to soften the blow a bit - which I guess would be good for the interpolation and alpha blending parts, although it doesn't help too much for the addressing part.
|