comp.compression
[Top] [All Lists]

Re: Magic Compression? (Why won't this work) (Code provided)

Subject: Re: Magic Compression? Why won't this work) (Code provided
From: Thomas Richter
Date: Mon, 18 Feb 2008 12:31:09 +0100
Newsgroups: comp.compression

partner50145703@xxxxxxxxxxxxxxx schrieb:
> I entirely accept the premise that there is no way to compress all
> random data, but I'd love some advice on why you can't avoid this
> using a time-check and verify system.
> 
> I have an "algorithm" (read: simple shell script)
> that I think would successfully substantially compress Mark Nelson
> AMillionRandomDigits
> [1] and successfully decode it.
> I'd love to get some feedback. I've included my proof-of-concept code
> at the bottom of this post.
> 
> 
> It would seem that it is possible to compress the file, and return it
> to it's original form, if we can successfully leverage the time-used
> to create it as a variable.
> 
> On any given machine, we can re-create any file by starting from a
> blank file, and then appending..  00000, then modifying to 00001,
> 00010, expanding until we have tried every iteration until we have, by
> brute-force, found our file.
> 
> While this certainly can create our file, the problem is that the
> solution solves too much.. While it DOES create our file, it also
> creates all other files. That's certainly not going to be very useful.
> 
> If we're clever, and willing to accept a very small chance of loss
> [2], I contend that we can re-create this file successfully.
> 
> Begin by iterating through all possible numbers, out to the number of
> digits in the original. Every time we iterate, compare the MD5 (or
> SHA, or whatever hash you prefer) to the hash of the original file.
> Continue to iterate until you have created every possible file given
> the requisite number of digits.
> 
> "But wait", you say, "You will get multiple files back, since you are
> brute-force cracking the hash"
> 
> You're right, but that isn't a problem. When we encode the file in the
> first place, we should go through the entire process of creating each
> file, and time how long it takes until we create the correct one. We
> can save this as a *very rough* time format. (Unix EPOCH is probably
> fine). This should be a rough time format to avoid wasting bits..
> Seconds is ALMOST CERTAINLY close enough.
> 
> When decoding, we again generate every possible file given the number
> of bits.. We compare each against the hash of the original, and get a
> list of matches.. Then, we can compare the elapsed time of each of the
> matches, to the time required to create the original.
> 
> This is among the slowest ways to compress a file, but it should
> result in quite a savings. The chance of loss is present, but very
> small.. On most PCs, it's extraordinarily unlikely that they will have
> enough speed to generate multiple collisions in one second. In the
> Very-Rare event that it does, and you generate the wrong file, it is
> still possible to re-create the original by re-running the decoder,
> and telling it to ignore the false result.
> 
> 
> 
> Granted, this method is insanely slow, and there is a SLIGHT chance of
> corruption, but I'd love to hear more about why it wouldn't
> successfully pass Mark's test.

Very simple. Instead of marking the time, you can also think of your
script as counting the number of tries it takes until it creates the
original file back. If the time measurement is precise enough, this
should do it, as it does in your method, and both can be considered
equivalent.

Now, consider, how many tries do we need to encode, i.e. how large the
number has to be? Of course we have to encode as many digits as there
are files generating the same MD5-sum (that's approximately the same
precision required to encode the time).

The solution to your riddle is that even though your MD5-sum is small,
the number of files with the same MD5-sum is *huge*, and the precision
of your time stamp, or the size of the number required to encode the
number of runs is immense. If an MD5-sum contains 128 bits (I don't
know, anyhow), then there are 2^128 different MD5 sums. On the other
hand, if you want to encode a million byte file, then there are (8 bits
per digit) 2^(1000000 * 8) of them, and thus this requires your count
(or your time stamp) to be of the precision of 2^(1000000 * 8) / 2^128 =
2^(1000000 * 8 - 128), which is a *huge* number.(Consider, there are
approximately 10^80 ~= 2^300 particles in this universe!). In fact, it
is only 128 bits shorter than the original file, compensating exactly
for the 128 bits in the MD5-sum, so nothing's gained.

This also explains why your time stamp method is unpractical: Every
algorithm working this way would require *a lot* more time than the age
of the universe.

So long,
        Thomas

<Prev in Thread] Current Thread [Next in Thread>