comp.compression
[Top] [All Lists]

Magic Compression? (Why won't this work) (Code provided)

Subject: Magic Compression? Why won't this work) (Code provided
From:
Date: Mon, 18 Feb 2008 03:02:14 -0800 PST
Newsgroups: comp.compression

I entirely accept the premise that there is no way to compress all
random data, but I'd love some advice on why you can't avoid this
using a time-check and verify system.

I have an "algorithm" (read: simple shell script)
that I think would successfully substantially compress Mark Nelson
AMillionRandomDigits
[1] and successfully decode it.
I'd love to get some feedback. I've included my proof-of-concept code
at the bottom of this post.


It would seem that it is possible to compress the file, and return it
to it's original form, if we can successfully leverage the time-used
to create it as a variable.

On any given machine, we can re-create any file by starting from a
blank file, and then appending..  00000, then modifying to 00001,
00010, expanding until we have tried every iteration until we have, by
brute-force, found our file.

While this certainly can create our file, the problem is that the
solution solves too much.. While it DOES create our file, it also
creates all other files. That's certainly not going to be very useful.

If we're clever, and willing to accept a very small chance of loss
[2], I contend that we can re-create this file successfully.

Begin by iterating through all possible numbers, out to the number of
digits in the original. Every time we iterate, compare the MD5 (or
SHA, or whatever hash you prefer) to the hash of the original file.
Continue to iterate until you have created every possible file given
the requisite number of digits.

"But wait", you say, "You will get multiple files back, since you are
brute-force cracking the hash"

You're right, but that isn't a problem. When we encode the file in the
first place, we should go through the entire process of creating each
file, and time how long it takes until we create the correct one. We
can save this as a *very rough* time format. (Unix EPOCH is probably
fine). This should be a rough time format to avoid wasting bits..
Seconds is ALMOST CERTAINLY close enough.

When decoding, we again generate every possible file given the number
of bits.. We compare each against the hash of the original, and get a
list of matches.. Then, we can compare the elapsed time of each of the
matches, to the time required to create the original.

This is among the slowest ways to compress a file, but it should
result in quite a savings. The chance of loss is present, but very
small.. On most PCs, it's extraordinarily unlikely that they will have
enough speed to generate multiple collisions in one second. In the
Very-Rare event that it does, and you generate the wrong file, it is
still possible to re-create the original by re-running the decoder,
and telling it to ignore the false result.



Granted, this method is insanely slow, and there is a SLIGHT chance of
corruption, but I'd love to hear more about why it wouldn't
successfully pass Mark's test.


My proof of concept shell script follows. The Encoder and Decoder
should take nearly identical amounts of time to encode/decode the
file.

The Decoder will generate all possible files given the file length,
and then compare the time it took to build them against the time it
took to build the original in the encoder.
It returns the closest match as the requested file. This is almost
always going to be correct, assuming that the PC is in a pristine
state when running each set of code.


kari:pi$ cat encode.sh
#!/bin/bash
if [ ${#} -eq 0 ]; then echo "usage: $0 filename"; exit; fi
echo "" > file
sleep 3;
digits=`cat $1 | wc -c`
digits=`expr $digits + 1`

count=`echo 0`
hash=`cat $1 | md5`

STARTTIME=`date +%s`

while [ `echo $count | wc -c ` -lt $digits ];
        do
        count=`expr $count + 1`
        ELAPSED=`expr $STARTTIME - $(date +%s)`
        ELAPSED=`echo $ELAPSED |  awk ' { if($1>=0) { print $1} else {print
$1*-1 }}'`

        if [ `echo $count | md5` = $hash ]; then echo $count:`echo
$ELAPSED`>> file; fi
        if [ "$count" == `cat $1` ]; then echo $digits:$hash $ELAPSED; fi
        done
------

kari:pi$ cat decode.sh
#!/bin/bash
if [ ${#} -ne 2 ]; then echo "usage: $0 hash expected_elapsed"; exit;
fi
EXPECTEDELAPSED=`echo $2`
echo "" > file

digits=`echo $1| awk -F: {'print $1'}`
digits=`expr $digits`
hash=`echo $1| awk -F: {'print $2'}`

count=`echo -1`
echo $digits

STARTTIME=`date +%s`
#find all the hash matches
while [ `echo $count | wc -c ` -lt $digits ];
        do
        count=`expr $count + 1`
#       echo -ne $count | wc -c;
        ELAPSED=`expr $STARTTIME - $(date +%s)`
        ELAPSED=`echo $ELAPSED  |  awk ' { if($1>=0) { print $1} else {print
$1*-1 }}'`
        if [ `echo $count | md5` = $hash ]; then echo $count:`echo
$ELAPSED`>> file; fi
        done


#find the closest to the original
LOWESTTIME=`echo 999999999999999`

#find the closest to the original
for i in `cat file`;
        do
        CURRENT=`echo $i| awk -F: {'print $2'}`
        DIFFERENCE=`expr $CURRENT - $EXPECTEDELAPSED`
        DIFFERENCEA=`echo $DIFFERENCE | awk ' { if($1>=0) { print $1}
else {print $1*-1 }}'`
        if  [ "$DIFFERENCEA" -lt "$LOWESTTIME" ]
                then
                   LOWESTTIME=`echo $DIFFERENCEA`
                   CLOSEST=`echo $i|awk -F: {'print $1'}`
        fi
        done;

echo  $CLOSEST

--------




[1] http://marknelson.us/2006/06/20/million-digit-challenge/
[2] This is a very small chance. You're unlikely to have two hash
collisions in the same second, on most speed PCs.





f436af1486f68768f40319f00e0c1681

<Prev in Thread] Current Thread [Next in Thread>