Optimizing Assembly (Fast 68K Decompression)

April 28th, 2016 | Category: Bit Bucket | Author: Steve

Kid-optimized-15-percent

Are you a 68K assembly language guru, or just good at optimizing code? I’m working on a project that’s a replacement ROM for old 68K-based Macintosh computers, part of which involves decompressing a ~5MB disk image from ROM into RAM to boot the machine. This needs to happen fast, to avoid a lengthy wait whenever the computer boots up. I selected liblzg specifically for its simplicity and speed of decompression, even though it doesn’t compress as well as some alternatives. And the whole thing works! But I want it to be faster.

The compressor is a regular Windows/Mac/Linux program, and the decompressor is a hand-written 680000 assembly routine from the lzg authors. It works well, but decompression on a Mac IIsi or IIci only runs about 600K/sec of decompressed data, so it takes around 5-20 seconds to decompress the whole rom disk depending on its size and the Mac’s CPU speed.

The meat of the 68000 decompression routine isn’t too long. It’s a fairly simple Lempel-Ziv algorithm that encodes repeated data as (distance,length) references to the first appearance of the data. There’s a brief summary of lzg’s specific algorithm here. Anyone see any obvious ways to substantially optimize this code? It was written for a vanilla 68000, but for this Mac ROM it’ll always be running on a 68020 or ‘030. Maybe there are some ‘030-specific instructions that could be used to help speed it up? Or some kind of cache prefetch? There’s also some bounds-checking code that could be removed, though the liblzg web site says this provides only a ~12% improvement.

The meat-of-the-meat where data gets copied from source to dest is a two-instruction dbf loop:

_loop1:	move.b	(a4)+,(a1)+
	dbf	d6,_loop1

If any ‘030-specific tricks could improve that, it would help the most. One improvement would be to copy 4 bytes at a time with move.l instead of move.b. But the additional instructions needed to handle 4-byte alignment and 1-3 extra bytes might outweigh the savings for smaller blocks being copied. I think the average block size is around 10 bytes, though some are up to 127 bytes.

The loop might also be unrolled, for certain pre-defined block sizes.

Here’s the entirety of the decompressor’s main loop:

_LZG_LENGTH_DECODE_LUT:
	dc.b	1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
	dc.b	17,18,19,20,21,22,23,24,25,26,27,28,34,47,71,127
	
_lzg_decompress:
	// a0 = src
	// a1 = dst
	// a2 = inEnd = in + inSize
	// a3 = outEnd = out + decodedSize
	// a6 = out
	move.b	(a0)+,d1			// d1 = marker1
	move.b	(a0)+,d2			// d2 = marker2
	move.b	(a0)+,d3			// d3 = marker3
	move.b	(a0)+,d4			// d4 = marker4
 
	lea		_LZG_LENGTH_DECODE_LUT(pc),a5	// a5 = _LZG_LENGTH_DECODE_LUT
	
	// Main decompression loop
	move.l	#2056,d0			// Keep the constant 2056 in d0 (for marker1)
	cmp.l	a2,a0
_mainloop:
	bcc.s	_fail				// Note: cmp.l a2,a0 must be performed prior to this!
	move.b	(a0)+,d7			// d7 = symbol
 
	cmp.b	d1,d7				// marker1?
	beq.s	_marker1
	cmp.b	d2,d7				// marker2?
	beq.s	_marker2
	cmp.b	d3,d7				// marker3?
	beq.s	_marker3
	cmp.b	d4,d7				// marker4?
	beq.s	_marker4
 
_literal:
	cmp.l	a3,a1
	bcc.s	_fail
	move.b	d7,(a1)+
	cmp.l	a2,a0
	bcs.s	_mainloop
 
	// We're done
_done:
	// irrelevant code removed
	bra _exit
 
	// marker4 - "Near copy (incl. RLE)"
_marker4:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsr.b	#5,d5
	addq.w	#1,d5				// offset = (b >> 5) + 1
	bra.s	_copy
 
	// marker3 - "Short copy"
_marker3:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	move.l	d5,d6
	lsr.b	#6,d6
	addq.w	#2,d6				// length-1 = (b >> 6) + 2
	and.b	#0x3f,d5
	addq.w	#8,d5				// offset = (b & 0x3f) + 8
	bra.s	_copy
 
	// marker2 - "Medium copy"
_marker2:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	cmp.l	a2,a0
	bcc.s	_fail
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsl.w	#3,d5
	move.b	(a0)+,d5
	addq.w	#8,d5				// offset = (((b & 0xe0) << 3) | b2) + 8
	bra.s	_copy
 
	// marker1 - "Distant copy"
_marker1:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	lea		1(a0),a4
	cmp.l	a2,a4
	bcc.s	_fail
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsr.w	#5,d5
	swap	d5
	move.b	(a0)+,d5
	lsl.w	#8,d5
	move.b	(a0)+,d5
	add.l	d0,d5				// offset = (((b & 0xe0) << 11) | (b2 << 8) | (*src++)) + 2056
 
	// Copy corresponding data from history window
	// d5 = offset
	// d6 = length-1
_copy:
	lea		(a1,d6.l),a4
	cmp.l	a3,a4
	bcc	_fail
 
	move.l	a1,a4
	sub.l	d5,a4
 
	cmp.l	a6,a4
	bcs	_fail
 
_loop1:	move.b	(a4)+,(a1)+
	dbf	d6,_loop1
 
	cmp.l	a2,a0
	bcs	_mainloop
	bra	_done

Another thing to note is that about half of all the data is literals rather than (distance,length) markers, and goes to the _literal branch above. That involves an awful lot of instructions to copy a single byte. A faster method of determining whether a byte is a marker or a literal would help - I plan to try a 256-entry lookup table instead of the four compare and branch instructions.

My final idea would involve changing the lzg algorithm itself, and making the compression slightly worse. For longish sequences of literals, the decompressor just copies bytes from input to output, but it goes through the whole _literal loop for each byte. I'm thinking of introducing a 5th marker byte that means "copy the next N bytes directly to output", for some hand-tuned value of N. Then those N bytes could be copied using a much higher performance loop.

Read 35 comments and join the conversation

35 Comments so far

Steve - April 28th, 2016 8:56 pm
I collected some stats on the decompressor’s behavior with my disk image, so I could see the typical block sizes and other things. That may help inform where and how to optimize.

Each token in the compressed data is one of three things:
- A marker followed by a (distance,length) pair of previous data to copy. There are 4 marker subtypes.
- An escaped marker, for when a marker byte itself should appear in the output
- A literal byte that’s not a marker, and is just copied to the output
In my data, they appear with these frequencies:
```
marker1 "Distant copy" = 44686 
marker2 "Medium copy" = 53308 
marker3 "Short copy" = 35166 
marker4 "Near copy (incl. RLE)" = 12714 
escaped marker = 4333
literal byte = 1056435
```
There are a LOT of literal bytes, so making that case fast is important. For the 4 types of markers, each will copy a block of bytes from a previous section of the output data. Those blocks can be anywhere from 3 to 128 bytes. I recorded this distribution of block sizes:
```
lengths[3] = 26476            
lengths[4] = 26262            
lengths[5] = 28504          
lengths[6] = 17860          
lengths[7] = 10462           
lengths[8] = 7217              
lengths[9] = 4720              
lengths[10] = 3858             
lengths[11] = 2872             
lengths[12] = 2215             
lengths[13] = 1519             
lengths[14] = 1351             
lengths[15] = 967              
lengths[16] = 1026             
lengths[17] = 712              
lengths[18] = 706              
lengths[19] = 575              
lengths[20] = 490              
lengths[21] = 372              
lengths[22] = 376              
lengths[23] = 349              
lengths[24] = 333              
lengths[25] = 247              
lengths[26] = 253              
lengths[27] = 205              
lengths[28] = 231              
lengths[29] = 995                         
lengths[35] = 1043                    
lengths[48] = 898                       
lengths[72] = 752                      
lengths[128] = 2028
```
80% of blocks are 8 bytes or less! That makes me question whether unrolling the copy loop or using copy.l would help, or hurt. If those changes required more than a couple of extra instructions for checking the block size and dealing with alignment and tailing bytes, then the extra overhead would probably outweigh the savings for such short blocks.

On the other hand, even though the number of large blocks isn’t great, there are many bytes in each one. Maybe I should be looking at the total number of bytes in all instances of a given block sizes, rather than just the number of instances of that size.
George Phillips - April 28th, 2016 11:54 pm

LZG’s clever scheme of picking marker bytes depending on the data to reduce expansion certainly makes it harder to have a fast case. When I’ve done custom decompression for 8 and 16 bit CPUs I like use use control bytes that look like “NNNNN FF” which the “F” bits are the function and the N bits are the parameter. I load a byte, then do a right shift and branch on carry which might go to the most common operation which leaves the remaining bits ready to use as a parameter. Or it might go to another right shift which will pick between two more possibilities.

A 256 entry lookup table to dispatch to the M1, M2, M3, M4 and LITERAL handlers is likely the best way to go. The 68020 has scaled indexing modes so I can see having a 2 instruction decode. My 68000 is very rusty so I’ll write in C pseudo-code:

d0 = *a0++; // get byte
goto a1[d0 * 4]; // table lookup with indirect

If your decompression code can live in the low 64 K of memory the table only needs 16 bit entries.

And remember to shave off a jump by putting the dispatch at the end of each handler.

There is the possibility of a faster dispatch, but the cost of instructions and the particular features of the 68020 (especially the small instruction cache) might not make it worthwhile. It goes something like this:

d0 = *a0++;
goto a1 + d0 * 16;

The idea is that every 16 bytes starting at “a1” you have the handler for that byte. Most of them will be literals where they’ll simply copy d0 to the output and then do their own dispatch. Scaled indexing could be a help here as it will do a multiply by up to 8 for you, but 8 bytes may not be enough room to handle a literal. For the others you’d just to a jump to the routine rather than having to expand the (sparse) table of handlers even more.

But maybe scaled indexing is the way to go. Altering d0 is a pain because the all the handlers will have to undo the shift to use the value. Even if 8 bytes is too tight to put in the dispatch code it could still be better to have a “jmp a2” to return back to the main loop.
George Phillips - April 28th, 2016 11:56 pm

[ A second comment because my first attempt was too long. ]

Wikipedia claims the 68020 does not have data alignment constraints so you could likely take advantage of that. Unaligned access is more expensive but generally the penalty is easily worth it. However, beware the “RLE” copies where the back pointer is shorter than the length. Those are overlapping and usually must be done a byte at a time to work properly. I guess that only matters for M4 “near copy” operations.

I also see the 68020 has a 256 byte instruction cache. That might work against schemes which unroll or otherwise use too much code.

Concerning the inner copy loops, the scaled indexing modes with PC-relative addressing can pretty quickly go into a series of “*a3++ = *a0++;” instructions so you can unroll them without much overhead. I imagine some optimized “memcpy()” code for 68000 would be a good reference.
MagerValp - April 29th, 2016 3:45 am

I’m not a 68K assembly language guru, but I know where to find them: http://ada.untergrund.net/?p=boardforums&forum=4

Forgive me for saying so, but lzg doesn’t seem to be particularly well designed. It’s using whole bytes for markers when two bits would suffice, and as you note it only has single byte literals (rather than a byte count). I’d say ditch it and design something yourself. There’s no reason a simple byte based LZ scheme shouldn’t approach memcpy in speed – I’ll second George’s suggestion to see if you can find an optimized implementation to work from. I’d also investigate keeping the compressed stream word aligned, the compression hit shouldn’t be too bad and it could help a lot with performance. I’d start playing with a stream format along the lines of:

# 00000000’00000000 EOF
# 00cccccc’cccccccc 14-bit literal string
# 01cccccc’bbbbbbbb 6-bit RLE count, 8-bit byte
# 11ccccoo’oooooooo 4-bit count, 10-bit offset
# 11cccccc’cccccccc’oooooooo’oooooooo 14-bit count, 16-bit offset
Steve - April 29th, 2016 6:33 am

Yeah, maybe lzg isn’t the best choice, and really I have no experience at all with compression algorithms. I did implement a lookup table to see if a byte is a marker, and it gave a roughly 15% speedup. A jump/dispatch table that branches to the right handler for the marker or literal should be faster still. But I’m looking for a 2-4x speedup if possible, so something more drastic is probably needed. A block copy of the entire uncompressed disk image into ram is plenty fast (under a second, I haven’t timed it exactly) so I know the memory bandwidth is enough at least.

My instinct is that having literal strings that aren’t a whole number of bytes would be problematic and too slow, though I may be wrong. In MagerValp’s suggestion above, maybe replace the second line with:

# 00cccccc Next count bytes are a literal string

I think that would be faster to decode, but might kill the compression ratio.

George’s faster dispatch idea looks cool, but unless I’ve misunderstood, it would require the marker bytes to be known at compile time, or else require an additional level of indirection.

For the inner copy loops, the source code for Apple’s 68K BlockMove is here: http://toddp.com/classic/Software%20Install/Apple%20Source%20Code/System%207.1%20Source/OS/MemoryMgr/BlockMove.a But I’m not sure if there are really enough long block copies to make something like that worthwhile.

68020 has a 256-byte instruction cache. 68030 (most of the machines this will be used with) has a 256-byte instruction cache and 256-byte data cache.

The problem with all my projects is that there are so many interesting sidelines and diversions, I begin to lose track of whatever I meant to accomplish in the first place!
Steve - April 29th, 2016 6:36 am

Also, making the compressor deal with alignment is a good idea. In general, the compressor can do a lot of hard work to optimize the decompression step, given intimate knowledge of the decompression routine’s implementation.
MagerValp - April 29th, 2016 7:10 am

LZ is so much fun though 🙂

If you want a simple compression scheme to start playing with, you can look at the one I did for 6502:

https://github.com/MagerValp/u4remastered/blob/master/tools/backpack.c
https://github.com/MagerValp/u4remastered/blob/master/src/u4loader/lzmv.s

It uses the following markers:

# 00000000 EOF
# 00cccccc 6-bit literal
# 01cccccc 6-bit RLE
# 1cccoooo’oooooooo 3-bit count, 12-bit offset

It should be relatively easy to extend it to work on 16-bit words.
MagerValp - April 29th, 2016 7:13 am

Another thought: is it necessary to decompress the entire disk image in one go, or could you decompress tracks on the fly as needed? Total decompression time might be higher, but it would eliminate the long wait.
Steve - April 29th, 2016 8:36 am

Thanks, I’ll definitely check that out! Decompression on the fly is definitely a possibility too, and maybe my time would be better spent there than on inventing a new compression scheme. But it would replace a single delay at bootup with overall slower disk performance, at least for the first time any disk sector is read.

I added a proper timer instead of measuring performance with my wristwatch, and here’s where things stand after removing the safety checks and adding the isMarker lookup table:

Mac IIci BlockMove (for 2MB block) = 6761K/sec
Mac IIsi lzg decompress = 583K/sec
Mac IIci lzg decompress = 898K/sec
MESS Mac IIci emulator lzg decompress = 858K/sec

I’d like to reach about 2000K/sec on the IIci, but not sure if that’s realistic.
Eric - April 29th, 2016 9:53 am

Haven’t looked too closely at it, but what about this LZ4 format? There’s a C reference version, and assembly decompressors for 6502 and 65c816

http://cyan4973.github.io/lz4/

More on the 65c816 version
http://www.brutaldeluxe.fr/products/crossdevtools/lz4/index.html
George Phillips - April 29th, 2016 11:11 am

Steve, my faster dispatch idea was predicated around generating the handler code into RAM. Then the markers don\’t have to be known at compile time nor do you need an extra level of indirection.

I agree with Eric that LZ4 permits a faster implementation. Seems worthwhile to try compressing the hard disk image with LZ4 just to see if the compressed size is acceptable. I\’d tweak the LZ4 compressor to store offsets in big-endian format to save time and a big of code.

Can you reach 2000K/second on the IIci? At 25 MHz that\’s 12.2 cycles/byte on average. A straightforward implementation will need one \”move.b (a0)+,(a1)+\” instruction for each byte. Seems to be no easy way to estimate instruction timing. The 68020 manual said maybe best case 6, worse case 9 cycles, best I could tell. That leaves very little time to decode though unrolled copy loops help here as decoding overhead isn\’t per-byte but per code. You could measure an unrolled byte copy loop to see how fast that is. If it runs slower than 2000K/second then it looks dubious without resorting to more complex code using word and long moves. If significantly faster then it might well be attainable.

A slightly more complex model of the program would give a better feel for feasibility. Say estimate the cost of dispatch, multiply by the number of commands and add in the cost per byte and see what that requires.
Eric - April 29th, 2016 11:55 am

And there’s some Atari ST 68k assembly for LZ4 here, and I’m sure elsewhere too.

http://www.atari-forum.com/viewtopic.php?t=26825
rasz_pl - April 29th, 2016 12:26 pm

try different algorithm?
https://github.com/Cyan4973/lz4 seems to achieve ~0.3 speed of ram when decompressing
rasz_pl - April 29th, 2016 12:28 pm

note to self: refresh page before commenting 🙂
Steve - April 29th, 2016 12:52 pm

As I look more carefully at lzg, even within the marker format it defines, I don’t think the compressor makes the best choices as to how to compress. Consider this sequence:

AA AA AA AA AA 00 AA AA AA AA AA AA AA AA AA AA

should compress as

AA M4_backref(d=1,l=4) 00 AA M4_backref(d=1,l=9)

where each M4_backref is 2 bytes. So that’s 7 bytes. But instead it does:

AA M4_backref(d=1,l=4) 00 M4_backref(d=6,l=5) M4_backref(d=1,l=5)

which is 8 bytes. It seems when it encounters a long string that could be RLE encoded, it prefers to encode it as multiple backrefs to shorter substrings rather than a single longer RLE. I see why that’s so, but not sure how to fix it. At the time when it’s evaluating the first AA byte in the second group of AA’s, its only choices are to emit an AA literal or a backref(d=6,l=5). The backref looks like the better choice at that point. Only by applying some look-ahead logic could it see that a better “delayed gratification” choice exists.
Eric - April 29th, 2016 2:38 pm

I think the fix for that is the “optimal parsing” approach to parsing. Requires more time/space in compression, right?
Steve - April 29th, 2016 2:55 pm

Probably, though I’m not exactly sure how that would be implemented in the general case. I tried adding some code that checks if the current position is the start of a run, and if so whether the length of that run is longer than the length of a previous match that could also be backref’d from that position. It seems to work and it fixes the simple example I gave above, but on real data it only improved compression by 0.01%. 🙁
uridium - April 29th, 2016 4:42 pm

Are you enabling the cache on the 020 / 030? You need to turn bit 0 on in the CACR or it’s off by default. Same goes for the 030. The 030’s burst-fill mode will probably be most useful when experimenting with memcopy() type routines on the IIsi as you can move four contigious per op.

I’m coming from an amiga background, so please excuse the n00b’ness.
uridium - April 29th, 2016 4:43 pm

oh.. this is nice. Yay google.

https://winuae-mod.googlecode.com/svn/tags/winuae_2_0_2/!docs/Cache%20programming.txt
David Kuder - April 29th, 2016 9:01 pm

I would investigate other compression schemes, maybe one which separates the markers and literals completely except for small blocks of less than 32 bits, and optimizes for blocks of 32bits that are aligned for optimum use of move.l?
Steve - April 29th, 2016 10:20 pm

Oy, this is getting too deep in the weeds. The point was never to design the perfect compression algorithm, but to load this rom disk faster, and I think I’m losing focus. I decided I didn’t want to start over again with a new algorithm, so I’ve just concentrated on making optimizations to lzg.

I’ve made two big changes:

1. The inner copy loop does some loop unrolling and (unaligned) long and word-sized moves as well as byte-sized moves, which gives a 10% speedup. This could probably be further improved.

2. I prefix each string of literals with a byte for the string length, so they can be copied in a fast loop without testing each byte to see if it’s a marker. This gave a further 35% speedup, but it made the compression ratio worse. Now the compressed file is 68% of original size instead of 63% for the original lzg implementation. (For comparison zip achieves 58% on the same data.)

Decompression speed is now 1470K/sec on the IIci.

Instruction and data caches are already on, thanks to Mac’s init code.

I’m still not doing anything in the compressor to try to align the input or output to word or long boundaries. And I’m still reading the input and copying literal strings one byte at a time instead in words or longs. There are probably some decent speedups still to be had there.

I don’t know how much of the lost compression ratio can be won back. Adding an extra byte to each string of literals is expensive. Maybe I can make the markers smaller somehow – I think there’s a possibility. But I’m really tired and need to put this away for a while before my head bursts.
Mikko Ohtamaa - April 30th, 2016 3:53 pm

Do these CPUs allow self-modifying code like x86? This was a trick many CPU intensive loops in games used by the time. Unlike modern machines, there was little or no penalty to modify code in fly because there was no heavy RAM cache. This could replace some jump table parts in a clever design.
Steve - April 30th, 2016 4:51 pm

I’m not certain. I think self-modifying code is OK, but you have to explicitly flush the instruction cache.
Paul C. Pratt - April 30th, 2016 8:07 pm

I like MagerValp\’s idea of decompressing a sector at a time, if what you are doing is implementing your own disk driver that reads from compressed ROM. If you decompress the whole thing at once, I would assume you would need to allocate RAM for the whole uncompressed disk. Just decompressing the sectors as they are asked for would not need any extra RAM. The operating system is designed to be usable when booted off floppy, so it will still feel very fast booting from ROM, even with almost any sort of decompression code.
Anon - April 30th, 2016 10:58 pm

Could any of this compression be applied to ROM disk in your ROM-Inator product? It would be nice to have more than a megabyte of ROM disk space to work with!
Decompression on the fly? That way it need not be held in The tiny amount of RAM contained in a 128k/512k?
- May 1st, 2016 3:39 am

You’re probably already doing that, but if you prefix each literal string by its length, don’t forget that you then don’t need to escape bytes that could be considered markers.
Jeff - May 2nd, 2016 3:06 pm

The Mac Classic ROM contained a bootable copy of MacOS 6.0.4 – just hold down Command-Option-X-O at startup.
Perhaps crib some compression tips from there?
Anon - May 3rd, 2016 9:08 pm

There was compression in the Mac Classic ROM disk? That’s news to me!
Is there evidence of this anywhere?
Steve - May 3rd, 2016 9:14 pm

I don’t believe the Mac Classic rom disk is compressed. But it is stored in a super-efficient way, with toolbox code that’s already in the main rom not duplicated in the System file of the disk image.
Paul C. Pratt - May 3rd, 2016 10:38 pm

I notice that there is a _UnpackBits call at offset 3E454 in the Mac Classic ROM, inside the EDisk driver. And that looking at a random location somewhere in the middle of the 256k data for the ROM disk, and the corresponding section of the 400k mounted ROM disk, it looks like it could be in PackBits format. Which is a kind of compression, a rather simple one. I don\’t remember noticing that before.
Anon - May 4th, 2016 3:04 am

So there may already be a compression algorithm resident in the ROM disk driver that has been reused in the ROM-inator AND in the IIsi ROM?
Anon - May 4th, 2016 5:50 am

If the Macintosh Classic was using PackBits on its disk image: was it decompressing on the fly? Or was it unPacking into RAM and running from there?
If it was on-the-fly: no-one ever complained about access speed on the Mac Classic ROM disk before!
Stephane.D - May 10th, 2016 12:42 am

I’m doing development on a 68000 based system (Megadrive) and I recently designed a derived version of LZ4 compression to take benefit of data word alignement on 68000 CPU as i really wanted to have “live GFX data decompression”.

I originally used the classic LZ4 implementation with the best optimized assembly code i could do for decompression and i obtained about 250-360 KB/second of decompressed data on a 7.7 Mhz 68000.
With my new LZ4W compression scheme i was able to obtain 600 up to 950 KB/second on a simple 7.7 Mhz 68000, i think there is no faster decompression algorithm on the 68000, at least i was not able to find one.

The compression level is almost as good then LZ4 for small packet of data but get worst for larger one.

Still you can give a try in your are interested in. You can find here the compressor (wrote in java) and the 68000 assembly decompressor :
https://www.dropbox.com/sh/smgwbi8g6y50n60/AAC47jpBZKxPqeY_NXo703vQa?dl=0

To obtain that decompression speed i had to heavily rely on lookup table so the decompression code is not really small (probably ~4 KB in binary format).
denis - February 24th, 2019 2:28 pm

Hi

You can change the end of marker1 to this to safe 6 cycles:

move.b (a5,d6.w),d6
lsl.w #3,d5
move.b (a0)+,d5
lsl.l #8,d5
move.b (a0)+,d5
add.l d0,d5

Denis
denis - February 26th, 2019 5:22 am

Hi

found another one:

all the jumps back to _mainloop with bcs jumps then to bcc _fail. this it not necessary. simply jump to the next line after _mainloop.

Denis

Optimizing Assembly (Fast 68K Decompression)

35 Comments so far

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.