Acmlm's Board - I3 Archive -- Programming: Recursive compression; finding the best match

~~(Link to AcmlmWiki)~~ Offline: thank ||bass

Register \| Login
Views: 13,040,846	Main \| Memberlist \| Active users \| Calendar \| Chat \| Online users Ranks \| FAQ \| ACS \| Stats \| Color Chart \| Search \| Photo album	06-26-24 09:02 AM

0 users currently in Programming.

Acmlm's Board - I3 Archive - Programming - Recursive compression; finding the best match

New poll |

Add to favorites | Next newer thread | Next older thread

User Post

HyperHacker

Star Mario
Finally being paid to code in VB! If only I still enjoyed that. <_<
Wii #7182 6487 4198 1828

Since: 11-18-05
From: Canada, w00t!
My computer's specs, if anyone gives a damn.
STOP TRUNCATING THIS >8^(

Last post: 6338 days
Last view: 6338 days

Posted on 05-02-06 04:56 AM

Link | Quote

If you used my MIO0 compressor before, you probably noticed (or read in its readme

) that it doesn't compress as well as Nintendo's. I was hoping to address that issue, but I've hit a wall. Between it being 3:00 AM and me not being good at recursive things, I'm quite out of ideas.

For those who don't know MIO0: It's an LZ77-style compression. You have two sections, one which contains raw data and one which contains 16-bit length/distance pairs (4 bytes length, 12 bytes distance), and a bitmap indicating which to read from.

When you compress MIO0 data you look for repeated strings of bytes. Given a string of 3 to 18 bytes, if you find this same string within the previous 4095 bytes, instead of storing the string again you store the location and length of the match relative to the repeated string. That is, if the same 5 bytes appear at 0x10 and 0x50, you just store a length/distance pair; length=5, distance=0x40 (0x50 - 0x40 = 0x10).

Now I don't really feel like typing all this again, so I'll just post part of a document I already wrote.

It's long, but you can pretty much just quickly skim through up to the

; it's just describing how to go about compressing the data. The most important part to note is that any match takes up two bytes.

Suppose you find a 10-byte match. Make a note of its position, but check the next 10 bytes for other matches. If you find a match of even one byte, keep going even past the 10th to see how big a match it is. Let's say the very last byte of the 10-byte match is the first byte of an 18-byte match.
Now, it looks like if you skip the 10-byte match in favour of the 18-byte one, you save 8 bytes. However, first you need to take into account the compression overhead. You add two bytes to the filesize for each match, as this is how many bytes it takes to store the length and position, so subtract 2 from the size of both matches. This means by taking an 18-byte match you really only save 16 bytes. This is why the minimum match size is 3 bytes; it's the smallest that does any good.
OK, but this still means you save 8 bytes right? No. You've added more overhead! By skipping the 10-byte match in favour of the 18-byte one, you've added 8 bytes to the filesize - the 10 matching bytes that you decided to ignore, minus the two that they'd consume if you matched them. So you aren't saving any space at all!
But wait, there's more. If you matched those 10 bytes, you haven't removed all 18 of the other matching bytes. There is only one byte of overlap - the last byte of the 10-match is the first byte of the 18-match. By matching the 10, you eliminate that first byte - BUT you still have a 17-byte match, saving 15 bytes, plus the 8 you save by matching the 10, for a total of 23.

Another example: You have an 18-byte match, but the last 10 bytes of it are a 9-byte match and the first of a 10-byte match. 10+9 = 19, so you should skip it, right? Well, consider the overhead. By taking the 18, you save 16 bytes. By taking the 10 and the 9, you save 17, right? No! Only 15. You subtract 2 from the size of each match, not the total - it's not (10+9)-2=17, it's (10-2)+(9-2)=15, versus 18-2=16.
So clearly you're best taking the 18-byte match here. And once again, since this only consumes the first byte of the 10-byte match, you still have a 9-byte match following. Do the math: (18-2)+(9-2) = 16+7=23 bytes saved (again).
Another way to think of this is to subtract 2*(# of matches - 1) from the total. You subtract 1 from the # of matches because you're still going to have one match if you take the 18 bytes.

What really complicates the issue is, as usual, recursiveness. Suppose you find an X-byte match and a Y-byte match overlap. Once you've determined whether to take X or Y (we'll assume Y for simplicity's sake), you then need to check within the match you chose to see whether Y overlaps an even better match.
Even more confusing: Suppose you have match X, and you're checking each byte within it for better matches. You find a match at position A (in the last 4K) which is, including overhead, a better match than X. OK, discard X and take this one. But you must still look through all 4K, so see if you can find one even better (at position B), and do this same procedure to see if A or B is better (having already determined that both are better than X) - check all the bytes in both A and B to see if they're overlapping even better matches, to determine whether A or B is better. Then if you find overlapping matches, check all their bytes, and so on...

Probably the best way to do this is a function. Give it the address and size of a match, and it checks each byte that matches to see if you should take it or not. It returns the address of the match it determines you should take; you would then just add in all bytes up to that address raw, and take the match at the address it returns. However, if the address it returns is not the address it was passed (meaning it found a better match within these bytes) then it would call itself passing the info of the match it did find, to determine whether it should stick with the new match it found, or if there's an even better one.
Very simplified pseudo-code:

function GetBestMatch(MatchAddr, MatchSize)

{

	//Some code here would check the bytes at MatchAddr for a better match.

	//Set BestMatchAddr = MatchAddr if none is found, else set it to the

	//address it was found at. Then:

	If BestMatchAddr == MatchAddr

		return MatchAddr

	else

		return GetBestMatch(BestMatchAddr, BestMatchSize)

}

One important note is, however, that this could skip a good match. Suppose that by skipping Match A you can take Match B and save more bytes. In this scenario it will call itself (instance 2) to check Match B. Now suppose it finds that by skipping Match B, you can save more bytes by taking Match C. Again it calls itself (instance 3), and finds that Match C is best. So instance 3 returns Match C; instance 2 returns Match C; instance 1 returns Match C.
See the problem? Instance 2 is only checking Match B - it doesn't know about Match A at all. In this case Match C could very well be 20 bytes from the address passed to instance 1.
What's happened here is it's determined that skipping A to take B is better and skipping B to take C is even better. So now you have to determine if you can take both A and C, and if so, should you? Essentially, ignore B, and compare A and C.

So the question is how to deal with the problem described with this recursive comparison function. I'm way too tired to deal with it now; I figured maybe someone here might have an idea. If anyone's even bothered to read all that.

Guy Perfect

Since: 11-18-05

Last post: 6339 days
Last view: 6337 days

Posted on 05-03-06 02:32 PM

Link | Quote

I've been considering various possibilities as to how to find the best matches, and I came across an idea when I was making the F-Zero X editor.

To save space, I re-use any course names and descriptions that are used more than once. For example, "Mute City - Figure Eight" and "Mute City - Starlight Theater" both have "Mute City" as their name, so I can just write that data to the ROM once and point both track names to it.

However, what if you have "Mute City" and "Big Mute City"? "Mute City" gets written to the ROM, then I come across "Big Mute City" and say "Well, no other course prior to this has this name, so I'll write it to the ROM as well."

Thing is, the data already written to the output file could have been pointed to the middle of "Big Mute City"... Just because the first part of the second string doesn't match any part of the first string doesn't mean the first string doesn't match part of the second.

Try to apply this to MIO0. It may lead to the answer you're looking for.

HyperHacker

Star Mario
Finally being paid to code in VB! If only I still enjoyed that. <_<
Wii #7182 6487 4198 1828

Since: 11-18-05
From: Canada, w00t!
My computer's specs, if anyone gives a damn.
STOP TRUNCATING THIS >8^(

Last post: 6338 days
Last view: 6338 days

Posted on 05-03-06 10:34 PM

Link | Quote

OK, I think I understand, but it sounds like you're just describing the problem. In your example there wouldn't be any match for "Big Mute City", so I add the 'B' raw and look for a match for "ig Mute City", and so on. Eventually I'd have added "Big " raw and would find a match for "Mute City". The problem is when the strings are longer. Let's say I don't find a match for "Mute City", only for "Mute". I can take that and save 2 bytes; the 4 bytes that match, minus the 2 bytes it takes to store a match. But what if there's also a match for "ute City" somewhere? I would need to check each byte of the match I find (the word "Mute") to see if such a better match exists. However, if I do find a better match ("ute City"), I need to check all the bytes of that to see if an even better one exists, ad infinium. Eventually I will find a match that doesn't overlap a better match (or hit the end of the file), but by then I could be, say, 400 bytes from the first match. My result then would be "include these 400 bytes raw and take this next match", which just doesn't make sense.

(edited by HyperMackerel on 05-03-06 09:35 PM)

Add to favorites | Next newer thread | Next older thread

Acmlm's Board - I3 Archive - Programming - Recursive compression; finding the best match

ABII

Acmlmboard 1.92.999, 9/17/2006
©2000-2006 Acmlm, Emuz, Blades, Xkeeper

Page rendered in 0.011 seconds; used 375.63 kB (max 463.63 kB)