Hacking into GGC 6 for the Motorola 680x0 / Amiga

The gcc compiler is a huge project. There is still support for the M68k in the gcc-6-branch but support for the AmigaOS was abandoned.
After re-adding/-inventing the AmigaOS support I am starting to add further optimizations for the M68k.
The generated code for M68K is not bad - it's often the best available compiler. But there are code snippets where the code could be better.
My attempts may add only little or no improvements to your code, but I have fun doing this!

optimizing strcpy

The generated code veries depending on the -O setting. This is the code with -O3:

_strcpy:
	move.l a0,d0
.L2:
	move.b (a1)+,d1
	move.b d1,(a0)+
;	tst.b d1 -- not visible in output
	jeq .L7
	move.b (a1)+,d1
	move.b d1,(a0)+
;	tst.b d1 -- not visible in output
	jne .L2
.L7:
	rts

The code is ok but not the best. Can't we spare the d1 register?

hack into existing optimization passes

My first attempt was modifying the "combine" pass of the gcc. And I really got:

opt_strcpy

_strcpy:
	move.l a0,d0
.L2:
	move.b (a1)+,(a0)+
	jeq .L7
	move.b (a1)+,(a0)+
	jne .L2
.L7:
	rts

To achieve this the register and also the invisible tst statement were removed.
But in complex programs - with inlined strcpy - there may be illegal registers at the combine stage and you might have code like:

	...
	move.l a0,d0
.L2:
	move.b (d2)+,(a0)+
	jne .L2
.L7:
	...

The later running "reload" pass will kill-fix this:

	...
	move.l a0,d0
.L2:
	move.l d2,a1
	move.b (a1)+,(a0)+
	move.l a1,d2
	jne .L2
.L7:
	...

And the the condition is killed. Since the d0 register and the test were omitted there is no way to handle it during "reload".

write an own optimization pass

With an own optimization pass a correct handling is possible. That pass is invoked after reload is done and the first step is handling the moves added by reload

	...
	move.l a0,d0
.L2:
	move.l d2,a1
	move.b (a1)+,d1
	move.b d1,(a0)+
	move.l a1,d2
;	tst.b d1 -- not visible in output
	jne .L2
.L7:
	...

propagate_moves

I guess a previous pass might be able to handle this, but you can't invoke every pass anywhere. So what's the idea here? I'll transform the loop into:

	...
	move.l a0,d0
	move.l d2,a1
.L2:
	move.b (a1)+,d1
	move.b d1,(a0)+
;	tst.b d1 -- not visible in output
	jne .L2
.L7:
	sub.l #1,a1
	move.l a1,d2
	...

Now the strcpy_opt can by applied again:

	...
	move.l a0,d0
	move.l d2,a1
.L2:
	move.b (a1)+,(a0)+
	jne .L2
.L7:
	sub.l #1,a1
	move.l a1,d2
	...

And if d2 is not used discard the superfluous statements.
Now strcpy is ok and also inlined strcpy seems to be ok - my test cases (from reported bugs) are all ok.

optimizing strncpy

Let's look into strncpy. The code is a bit more complex but still easy enough to judge:

_strncpy:
	move.l a2,-(sp)
	move.l d2,-(sp)
	move.l 12(sp),d0
	move.l 16(sp),a2
	move.l 20(sp),d1
	jeq .L11
	move.l d0,a1
.L6:
	lea (1,a1),a0
	move.b (a2)+,d2
	move.b d2,(a1)
	jeq .L15
	subq.l #1,d1
	move.l a0,a1
	jne .L6
.L11:
	move.l (sp)+,d2
	move.l (sp)+,a2
	rts
.L15:
	add.l d1,a1
	moveq #1,d2
	cmp.l d1,d2
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

So let's apply opt_strcpy and propagate_moves:

_strncpy:
	move.l a2,-(sp)
	move.l d2,-(sp)
	move.l 12(sp),d0
	move.l 16(sp),a2
	move.l 20(sp),d1
	jeq .L11
	move.l d0,a1
.L6:
	lea (1,a1),a0
	move.b (a2)+,(a1)
	jeq .L15
	subq.l #1,d1
	move.l a0,a1
	jne .L6
.L11:
	move.l (sp)+,d2
	move.l (sp)+,a2
	rts
.L15:
	add.l d1,a1
	moveq #1,d2
	cmp.l d1,d2
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

The propagate_moves does not work since there is a lea statement which does an add. Fortunately this can be converted into a move with using an auto increment in the next statement:

commute_add_move

This alorithm will convert

	lea (1,a1),a0
	move.b (a2)+,(a1)

	move a1,a0
	move.b (a2)+,(a0+)

Altogether the optimizations are yielding now:

_strncpy:
	move.l a2,-(sp)
	move.l d2,-(sp)
	move.l 12(sp),d0
	move.l 16(sp),a2
	move.l 20(sp),d1
	jeq .L11
	move.l d0,a1
	move.l d0,a0
.L6:
	move.b (a2)+,(a0)+
	jeq .L15
	subq.l #1,d1
	jne .L6
	move.l a0,a1
.L11:
	move.l (sp)+,d2
	move.l (sp)+,a2
	rts
.L15:
	lea (-1,a0),a1
	add.l d1,a1
	moveq #1,d2
	cmp.l d1,d2
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

Now the code after label .L15 could be better.

const_cmp_to_sub

Since d1 is no longer used the compare with the constant can be replaced with a sub statement:

_strncpy:
	move.l a2,-(sp)
	move.l d2,-(sp)
	move.l 12(sp),d0
	move.l 16(sp),a2
	move.l 20(sp),d1
	jeq .L11
	move.l d0,a1
	move.l d0,a0
.L6:
	move.b (a2)+,(a0)+
	jeq .L15
	subq.l #1,d1
	jne .L6
	move.l a0,a1
.L11:
	move.l (sp)+,d2
	move.l (sp)+,a2
	rts
.L15:
	lea (-1,a0),a1
	add.l d1,a1
	subq.l #1,d1
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

merge_add

And now there are two identical subtracts, whcih can be combined:

_strncpy:
	move.l a2,-(sp)
	move.l d2,-(sp)
	move.l 12(sp),d0
	move.l 16(sp),a2
	move.l 20(sp),d1
	jeq .L11
	move.l d0,a1
	move.l d0,a0
.L6:
	move.b (a2)+,(a0)+
	jeq .L15
	subq.l #1,d1
	jne .L6
	move.l a0,a1
.L11:
	move.l (sp)+,d2
	move.l (sp)+,a2
	rts
.L15:
	move.l a0,a1
	subq.l #1,d1
	add.l d1,a1
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

elim_dead_assign

There is a dead assignment left before .L11. Let's remove all dead assignments:

_strncpy:
	move.l a2,-(sp)
	move.l d2,-(sp)
	move.l 12(sp),d0
	move.l 16(sp),a2
	move.l 20(sp),d1
	jeq .L11

	move.l d0,a0
.L6:
	move.b (a2)+,(a0)+
	jeq .L15
	subq.l #1,d1
	jne .L6

.L11:
	move.l (sp)+,d2
	move.l (sp)+,a2
	rts
.L15:
	move.l a0,a1
	subq.l #1,d1
	add.l d1,a1
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

regrename

Now regrename can be run again.

shrink_stack_frame

Since the stack frame already exists, some stack cleanup must be done. Push/pop for unused regs is removed:

_strncpy:
	move.l 4(sp),d0
	move.l 8(sp),a1
	move.l 12(sp),d1
	jeq .L11
	move.l d0,a0
.L6:
	move.b (a1)+,(a0)+
	jeq .L15
	subq.l #1,d1
	jne .L6
.L11:
	rts
.L15:
	move.l a0,a1
	subq.l #1,d1
	add.l d1,a1
	jeq .L11
.L8:
	clr.b (a0)+
	cmp.l a1,a0
	jeq .L11
	clr.b (a0)+
	cmp.l a1,a0
	jne .L8
	jra .L11

Result

Writing the same function in assemble will yield a still better version. But for a compiler output, this result is quite ok.

Thoughts

The way how gcc-6 is working:

work with a abstract cpu
use an infinite set of pseudo registers
combine instructions to create instructions for the real cpu and use hard (real) registers
fix registers to conform the cpu constraints might be improvable. As far as I understand it now, it requires some fundamental changes, e.g. inventing a register mode which determines a pointer. Then the combine stage could use that information to chose correct registers. But this is a bigger modification.

My additional pass can be considered as non-intrusive and can be easily disabled via command line switch.

rev: 1.11