The Calgary corpus Compression & SHA-1 crack Challenge

Latest News

May 21, 2016: 20 years since its inception, the Challenge has concluded. If you're interested in submitting your result, please refer to Matt Mahoney's page.

On July 2, 2010, the challenge was accepted by Alexander Rhatushnyak who sent me an entry of size 580170 (payout of $121.17 received). It continues the series of PAQAR-based entries, using the relaxed memory usage rules.

The entry consists of two files - a PPMd archive file and a separate data file. The size of the entry is computed as a sum of: the length of the PPMd file (7700), the length of the data file (572465), 3 bytes for the length of the data file, and 2 bytes for the data file name (arbitrary 1-character name) including the terminator.

For convenience, the contents of the PPMd archive are also available in a ZIP file.

August 18, 2006: Without compromizing the compression part of the challenge, it is now also a SHA-1 hash crack challenge!

August 18, 2006: There is now a more lucrative contest - the Hutter Prize. I am relaxing the memory limitations of my challenge to match those of the Hutter Prize.

May 21, 2006: It has been ten years since the inception of the Challenge.

Challenge Statement (for archival purposes only)
History
Glossary: archive
Glossary: executable

Challenge Statement (for archival purposes only)

I, Leonid A. Broukhis, will NOT pay the amount of

(580,170.00 - X) / 111US dollars (but no less than $33.33) to the first person who sends me an archive of length X bytes, containing an executable and possibly other files, where the said executable file, run repeatedly with arguments being the names of other files contained in the original archive file one at a time (or run once without arguments if no other files are present) on a computer with no permanent storage or communication devices accessible to the running process(es) produces 14 new files, so that a 1-to-1 relationship of bytewise identity may be established between the SHA-1 hash values of those new files and the SHA-1 hash values of the files in the original Calgary corpus, specifically,
3f86203a59b9d823c784f0414dd1920bcb62d067 bib 673c583d45544003eb0edd57f32a683b3c414a18 book1 b855cfafe7374942a0ae54c3bd90f0bce7b73fab book2 5cf652cfcc8e556ffb5e118fc29bcffef0aa71ab geo afd9f190c621f45216a321485a543c00786bc76b news d155a7f8c68d24e9914b3274d5b5a6aa720e8d58 obj1 e02c588e271f242fd00ecc68a931d9c5485323a0 obj2 aef6dac8838b1e9b35a46a6c1ccf1876a63486b4 paper1 93d9bf0d3b4eae5198cf589336b30af3d6607feb paper2 96f7ab3d975ea4d823cf30be7dad5827f45858e9 pic 66fa53f757f6474ad92b2ae52ef07981839dd14d progc 7f9723167476639998ece850b9fbe1e5587aed1c progl 22c59a4046cc52510a736582eae4bcdca4713411 progp 34322336c2c5b210fefc5b85517c65de1d184da5 trans

(In other words, "solid" mode, as well as shared dictionaries/models and other tune-ups specific for the Calgary Corpus are allowed.)
As an option, the executable may read its data from the standard input (do not forget to specify which convention you use in your challenge entry), although the amount of code it saves is negligible.

The amount of RAM required to run the executable is only limited by the computer architecture, but strive for no more than 800 Mb of memory. The total time to run the decompression program to produce all the 14 files must not exceed 24 hours on a 2000 MIPS computer.

After I receive an eligible entry, the corresponding numbers will be changed according to the actual length of the winning entry, and the challenge will continue. I reserve the right to decrease the payable sum by the amount of postal or transfer expenses, in case they are substantial (greater than 15% of the payable sum). So far, I never used that right despite having to send money by Western Union to Russia and to Ukraine.

Incremental improvements from the current winner are accepted; incremental improvements from third parties are not accepted.

The information on eligible entries will be posted on the page, unless I'm advised otherwise (the code may be posted as well by author's request). In case an NDA is required, I'm ready to sign it. Keep in mind that the whole thing is for my own entertainment.

All correspondence related to the challenge itself or to this Web page can be sent to this spam-protected address (remove q's and x's)

Originally there were two challenges: one similar to the current one, and another for a "general purpose" compressor that had to act as a filter and decompress files one by one, as well as to be able to decompress other files. As it is possible to embed the solidly compressed corpus within the decompressor executable and use some single byte encoding for the 14 files in the Calgary corpus, albeit never exploited, that other challenge was of not much interest, and I decided to combine them.

The original formula was (777,777.00 - X) / 333 for each of the two challenges.

Since then, the divisor went down from 333 to 111.

Marking its tenth anniversary, this challenge is most likely the only data compression challenge with a cash prize. Since inception, I have paid $1181.91 in prize money.

History

The Calgary Corpus Compression challenge was conceived on May 21, 1996 with a posting to the Usenet group comp.compression. At the time there was a flurry of postings claiming incredible compression rates using "secret" algorithms but without any substantiation. Everybody got tired of it and I've decided to challenge these claims.

Here is the text of the original posting, courtesy of Google:

   From: Leonid A. Broukhis (leob@shellx.best.com)
   Subject: Compression challenge
   Newsgroups: comp.compression
   Date: 1996/05/21

What I'm thinking of may look foolish or unnecessary, or whatever,
but this is my preliminary proposal of my compression challenge:

If I decide to make the challenge, I'll pay the amount of
(777,777 - X) / 1000 US dollars, the said amount
never exceeding $777.78 (duh!), to the first person
who sends me the .tar.gz archive
of length X bytes, containing a Linux Intel executable a.out and
14 other files,
where the said a.out file, run repeatedly with arguments being
the names of other files contained in the original .tar.gz
file one at a time on a computer with no hard drives, CD drives
network devices or modems connected, and with no floppy in the floppy bay
(effectively, a stripped down computer, booted to Linux using a
boot/root disk combination with the only filesystem being the RAM
disk with the .tar.gz. file copied to it)
produces 14 new files, so that a 1-to-1
relationship of bytewise identity may be established between
those new files and the files in the original Calgary corpus.

 Leo

PS. If no suggestions about the wording of the challenge
follow in the next 10 days, I'll post the challenge on my web
page and publish the URL.

In Fall of 1997, the first (solid) challenge was accepted by Malcolm Taylor, who sent me an entry of size 759881. The payout was $53.74 which he donated to Jeff Gilchrist, the maintainer of the Archive Comparison Page.

The formula has been changed (the divisor remained at 333), and a minimum payout of $10.01 has been established: the threshold having been crossed, I was not interested in incremental changes. The second challenge remained untouched.

Then, after a long delay of almost 4 years, in August 2001 both challenges were accepted by Maxim Smirnov, who sent me entries of sizes 692154 and 696818 (the payout was $203.38 + $243.12 = $446.50), which he later improved to 680558 and 685341 respectively. Instead of receiving an addition to the payout, he donated it to be applied to the modified formulae he proposed. They became $33.33 + (666,666.00 - X) / 222 (with minimum payout of $33.33) for each of the two challenges.

In November 2002, both challenges were accepted by Serge Voskoboynikov, who sent me an entry of size 653720 bytes for the first one, and an entry of size 663281 bytes for the second one (the payout was $140.22). At that time I've decided to merge the two challenges: the formula became (653,720.00 - X) / 111, payout no less than $66.66.

On January 10, 2004 the challenge was accepted by Matt Mahoney, who sent me an entry of size 645667. He decided to forgo the payout of $72.55 in favor of the next challenger. Remarkably, after I've allowed to send source code instead of a binary executable, the very next entry contained the source code. To prevent any incremental changes to the entry from qualifying (it might be fairly easy to squeeze a kilobyte or two by tweaking the source code), I've decided to set the minimum payout to $100.01.

On April 2, 2004, the challenge was accepted by Alexander Rhatushnyak who sent me an entry of size 637116, gradually improved down to 619922 on April 25, at which point he received a payout of $305 (that included $0.51 in advance).

Alexander's entry is an improvement of Matt Mahoney's entry, but not a trivial one; it overshot the previous minimum required payout of $100.01 by a wide margin.

This is the first entry that uses two different archive programs for packaging: the source code of the decompressor within the HA archive (the file named "s") is packed with RAR.

On May 19, 2004, upon receiving an entry of size 614738 from Alexander Rhatushnyak, Maxim Smirnov's donation toward a modified formula ran out (at 618879 bytes, to be precise), but I kept the divisor at 111.

On December 31, 2004 Alexander Rhatushnyak improved his entry to 608980. The remaining payout of $98.07 received.

On April 4, 2005, the challenge was accepted by Przemysław Skibiński who sent me an entry of size 603416 (payout of $50.13 received). It continues the series of PAQAR-based entries and requires about 370 Mb of RAM.

On December 3, 2005, the challenge was accepted by Alexander Rhatushnyak who sent me an entry of size 593620 (payout of $88.25 received). Later improved to 589862. It continues the series of PAQAR-based entries, using the relaxed memory usage rules.

Here is the history in table form:

Size Date Name

759881 09/1997 Malcolm Taylor

692154 08/2001 Maxim Smirnov

680558 09/2001 Maxim Smirnov

653720 11/2002 Serge Voskoboynikov

645667 01/2004 Matt Mahoney

637116 04/2004 Alexander Rhatushnyak

608980 12/2004 Alexander Rhatushnyak

603416 04/2005 Przemysław Skibiński

596314 10/2005 Alexander Rhatushnyak

593620 12/2005 Alexander Rhatushnyak

589863 05/2006 Alexander Rhatushnyak

580170 07/2010 Alexander Rhatushnyak

Size	Date	Name
759881	09/1997	Malcolm Taylor
692154	08/2001	Maxim Smirnov
680558	09/2001	Maxim Smirnov
653720	11/2002	Serge Voskoboynikov
645667	01/2004	Matt Mahoney
637116	04/2004	Alexander Rhatushnyak
608980	12/2004	Alexander Rhatushnyak
603416	04/2005	Przemysław Skibiński
596314	10/2005	Alexander Rhatushnyak
593620	12/2005	Alexander Rhatushnyak
589863	05/2006	Alexander Rhatushnyak
580170	07/2010	Alexander Rhatushnyak

Glossary

An archive is a file or a set of files that can be processed by any or a combination of the programs from the following list (striked out programs are unlikely to be useful):

bunzip2 1.0.2,

unrar 3.30,

PPMd variant I,

ha 0.999;

~~tar~~,

~~uncompress~~,

~~gunzip~~,

~~unzip~~,

~~unarj~~,

~~zoo~~,

(where the program names refer to commonly used decompression/archiving applications which source code is publicly available)

or that does not need any processing before usage for purposes of this challenge, in which case the archive size is computed as the sum of file sizes, file name lengths including the terminator, and 3 bytes per file to encode their lengths (no file will be longer than 16 Mb, right?).

Executable

An executable may be Linux or Win32 (Intel), the usage of the standard dynamic libraries (e.g. MSVCRT.dll, libc, libm, libg++) is permitted. Linux executables, when stripped, seem to be smaller, but the Windows executables compress to smaller sizes. To reduce the size of an executable, you can try to avoid using standard I/O libraries - use direct system calls instead (but remember: unless you redefine exit(), the stdio library is still there), and you can use a binary editor to replace unnecessary strings in the executable with zero bytes to improve its compression rate.
A program written in Forth will be the smallest, though.

As an experiment, I decided to allow portable (compilable and runnable in UNIX/Linux) C or C++ sources and Perl scripts -- they may compress tighter than the binaries.

You may want to visit the Archive Comparison Page to see the state of the art.

You are the 5,138th person accessing this page.

This page has been modified on May 9, 2014.

3f86203a59b9d823c784f0414dd1920bcb62d067	bib
673c583d45544003eb0edd57f32a683b3c414a18	book1
b855cfafe7374942a0ae54c3bd90f0bce7b73fab	book2
5cf652cfcc8e556ffb5e118fc29bcffef0aa71ab	geo
afd9f190c621f45216a321485a543c00786bc76b	news
d155a7f8c68d24e9914b3274d5b5a6aa720e8d58	obj1
e02c588e271f242fd00ecc68a931d9c5485323a0	obj2
aef6dac8838b1e9b35a46a6c1ccf1876a63486b4	paper1
93d9bf0d3b4eae5198cf589336b30af3d6607feb	paper2
96f7ab3d975ea4d823cf30be7dad5827f45858e9	pic
66fa53f757f6474ad92b2ae52ef07981839dd14d	progc
7f9723167476639998ece850b9fbe1e5587aed1c	progl
22c59a4046cc52510a736582eae4bcdca4713411	progp
34322336c2c5b210fefc5b85517c65de1d184da5	trans