The Calgary corpus Compression & SHA-1 crack Challenge

Latest News

May 21, 2016: 20 years since its inception, the Challenge will conclude.

On July 2, 2010, the challenge was accepted by Alexander Rhatushnyak who sent me an entry of size 580170 (payout of $121.17 received). It continues the series of PAQAR-based entries, using the relaxed memory usage rules.

The entry consists of two files - a PPMd archive file and a separate data file. The size of the entry is computed as a sum of: the length of the PPMd file (7700), the length of the data file (572465), 3 bytes for the length of the data file, and 2 bytes for the data file name (arbitrary 1-character name) including the terminator.

For convenience, the contents of the PPMd archive are also available in a ZIP file.

August 18, 2006: Without compromizing the compression part of the challenge, it is now also a SHA-1 hash crack challenge!

August 18, 2006: There is now a more lucrative contest - the Hutter Prize. I am relaxing the memory limitations of my challenge to match those of the Hutter Prize.

May 21, 2006: It has been ten years since the inception of the Challenge.


Challenge Statement
History
Glossary: archive
Glossary: executable


Challenge Statement

I, Leonid A. Broukhis, will pay the amount of

(580,170.00 - X) / 111 US dollars (but no less than $33.33) to the first person who sends me an archive of length X bytes, containing an executable and possibly other files, where the said executable file, run repeatedly with arguments being the names of other files contained in the original archive file one at a time (or run once without arguments if no other files are present) on a computer with no permanent storage or communication devices accessible to the running process(es) produces 14 new files, so that a 1-to-1 relationship of bytewise identity may be established between the SHA-1 hash values of those new files and the SHA-1 hash values of the files in the original Calgary corpus, specifically,
3f86203a59b9d823c784f0414dd1920bcb62d067bib
673c583d45544003eb0edd57f32a683b3c414a18book1
b855cfafe7374942a0ae54c3bd90f0bce7b73fabbook2
5cf652cfcc8e556ffb5e118fc29bcffef0aa71abgeo
afd9f190c621f45216a321485a543c00786bc76bnews
d155a7f8c68d24e9914b3274d5b5a6aa720e8d58obj1
e02c588e271f242fd00ecc68a931d9c5485323a0obj2
aef6dac8838b1e9b35a46a6c1ccf1876a63486b4paper1
93d9bf0d3b4eae5198cf589336b30af3d6607febpaper2
96f7ab3d975ea4d823cf30be7dad5827f45858e9pic
66fa53f757f6474ad92b2ae52ef07981839dd14dprogc
7f9723167476639998ece850b9fbe1e5587aed1cprogl
22c59a4046cc52510a736582eae4bcdca4713411progp
34322336c2c5b210fefc5b85517c65de1d184da5trans

(In other words, "solid" mode, as well as shared dictionaries/models and other tune-ups specific for the Calgary Corpus are allowed.)
As an option, the executable may read its data from the standard input (do not forget to specify which convention you use in your challenge entry), although the amount of code it saves is negligible.

The amount of RAM required to run the executable is only limited by the computer architecture, but strive for no more than 800 Mb of memory. The total time to run the decompression program to produce all the 14 files must not exceed 24 hours on a 2000 MIPS computer.

After I receive an eligible entry, the corresponding numbers will be changed according to the actual length of the winning entry, and the challenge will continue. I reserve the right to decrease the payable sum by the amount of postal or transfer expenses, in case they are substantial (greater than 15% of the payable sum). So far, I never used that right despite having to send money by Western Union to Russia and to Ukraine.

Incremental improvements from the current winner are accepted; incremental improvements from third parties are not accepted.

The information on eligible entries will be posted on the page, unless I'm advised otherwise (the code may be posted as well by author's request). In case an NDA is required, I'm ready to sign it. Keep in mind that the whole thing is for my own entertainment.

All correspondence related to the challenge itself or to this Web page can be sent to this spam-protected address (remove q's and x's)


Originally there were two challenges: one similar to the current one, and another for a "general purpose" compressor that had to act as a filter and decompress files one by one, as well as to be able to decompress other files. As it is possible to embed the solidly compressed corpus within the decompressor executable and use some single byte encoding for the 14 files in the Calgary corpus, albeit never exploited, that other challenge was of not much interest, and I decided to combine them.

The original formula was (777,777.00 - X) / 333 for each of the two challenges.

Since then, the divisor went down from 333 to 111.

Marking its tenth anniversary, this challenge is most likely the only data compression challenge with a cash prize. Since inception, I have paid $1181.91 in prize money.


History

The Calgary Corpus Compression challenge was conceived on May 21, 1996 with a posting to the Usenet group comp.compression. At the time there was a flurry of postings claiming incredible compression rates using "secret" algorithms but without any substantiation. Everybody got tired of it and I've decided to challenge these claims.

Here is the text of the original posting, courtesy of Google:

   From: Leonid A. Broukhis (leob@shellx.best.com)
   Subject: Compression challenge
   Newsgroups: comp.compression
   Date: 1996/05/21

What I'm thinking of may look foolish or unnecessary, or whatever,
but this is my preliminary proposal of my compression challenge:

If I decide to make the challenge, I'll pay the amount of
(777,777 - X) / 1000 US dollars, the said amount
never exceeding $777.78 (duh!), to the first person
who sends me the .tar.gz archive
of length X bytes, containing a Linux Intel executable a.out and
14 other files,
where the said a.out file, run repeatedly with arguments being
the names of other files contained in the original .tar.gz
file one at a time on a computer with no hard drives, CD drives
network devices or modems connected, and with no floppy in the floppy bay
(effectively, a stripped down computer, booted to Linux using a
boot/root disk combination with the only filesystem being the RAM
disk with the .tar.gz. file copied to it)
produces 14 new files, so that a 1-to-1
relationship of bytewise identity may be established between
those new files and the files in the original Calgary corpus.

 Leo

PS. If no suggestions about the wording of the challenge
follow in the next 10 days, I'll post the challenge on my web
page and publish the URL.

In Fall of 1997, the first (solid) challenge was accepted by Malcolm Taylor, who sent me an entry of size 759881. The payout was $53.74 which he donated to Jeff Gilchrist, the maintainer of the Archive Comparison Page.

The formula has been changed (the divisor remained at 333), and a minimum payout of $10.01 has been established: the threshold having been crossed, I was not interested in incremental changes. The second challenge remained untouched.

Then, after a long delay of almost 4 years, in August 2001 both challenges were accepted by Maxim Smirnov, who sent me entries of sizes 692154 and 696818 (the payout was $203.38 + $243.12 = $446.50), which he later improved to 680558 and 685341 respectively. Instead of receiving an addition to the payout, he donated it to be applied to the modified formulae he proposed. They became $33.33 + (666,666.00 - X) / 222 (with minimum payout of $33.33) for each of the two challenges.

In November 2002, both challenges were accepted by Serge Voskoboynikov, who sent me an entry of size 653720 bytes for the first one, and an entry of size 663281 bytes for the second one (the payout was $140.22). At that time I've decided to merge the two challenges: the formula became (653,720.00 - X) / 111, payout no less than $66.66.

On January 10, 2004 the challenge was accepted by Matt Mahoney, who sent me an entry of size 645667. He decided to forgo the payout of $72.55 in favor of the next challenger. Remarkably, after I've allowed to send source code instead of a binary executable, the very next entry contained the source code. To prevent any incremental changes to the entry from qualifying (it might be fairly easy to squeeze a kilobyte or two by tweaking the source code), I've decided to set the minimum payout to $100.01.

On April 2, 2004, the challenge was accepted by Alexander Rhatushnyak who sent me an entry of size 637116, gradually improved down to 619922 on April 25, at which point he received a payout of $305 (that included $0.51 in advance).

Alexander's entry is an improvement of Matt Mahoney's entry, but not a trivial one; it overshot the previous minimum required payout of $100.01 by a wide margin.

This is the first entry that uses two different archive programs for packaging: the source code of the decompressor within the HA archive (the file named "s") is packed with RAR.

On May 19, 2004, upon receiving an entry of size 614738 from Alexander Rhatushnyak, Maxim Smirnov's donation toward a modified formula ran out (at 618879 bytes, to be precise), but I kept the divisor at 111.

On December 31, 2004 Alexander Rhatushnyak improved his entry to 608980. The remaining payout of $98.07 received.

On April 4, 2005, the challenge was accepted by Przemysław Skibiński who sent me an entry of size 603416 (payout of $50.13 received). It continues the series of PAQAR-based entries and requires about 370 Mb of RAM.

On December 3, 2005, the challenge was accepted by Alexander Rhatushnyak who sent me an entry of size 593620 (payout of $88.25 received). Later improved to 589862. It continues the series of PAQAR-based entries, using the relaxed memory usage rules.

Here is the history in table form:

SizeDateName
75988109/1997Malcolm Taylor
69215408/2001Maxim Smirnov
68055809/2001Maxim Smirnov
65372011/2002Serge Voskoboynikov
64566701/2004Matt Mahoney
63711604/2004Alexander Rhatushnyak
60898012/2004Alexander Rhatushnyak
60341604/2005Przemysław Skibiński
59631410/2005Alexander Rhatushnyak
59362012/2005Alexander Rhatushnyak
58986305/2006Alexander Rhatushnyak
58017007/2010Alexander Rhatushnyak


Glossary

An archive is a file or a set of files that can be processed by any or a combination of the programs from the following list (striked out programs are unlikely to be useful):

  • bunzip2 1.0.2,
  • unrar 3.30,
  • PPMd variant I,
  • ha 0.999;
  • tar,
  • uncompress,
  • gunzip,
  • unzip,
  • unarj,
  • zoo,
  • (where the program names refer to commonly used decompression/archiving applications which source code is publicly available)

    or that does not need any processing before usage for purposes of this challenge, in which case the archive size is computed as the sum of file sizes, file name lengths including the terminator, and 3 bytes per file to encode their lengths (no file will be longer than 16 Mb, right?).


    Executable

    An executable may be Linux or Win32 (Intel), the usage of the standard dynamic libraries (e.g. MSVCRT.dll, libc, libm, libg++) is permitted. Linux executables, when stripped, seem to be smaller, but the Windows executables compress to smaller sizes. To reduce the size of an executable, you can try to avoid using standard I/O libraries - use direct system calls instead (but remember: unless you redefine exit(), the stdio library is still there), and you can use a binary editor to replace unnecessary strings in the executable with zero bytes to improve its compression rate.
    A program written in Forth will be the smallest, though.

    As an experiment, I decided to allow portable (compilable and runnable in UNIX/Linux) C or C++ sources and Perl scripts -- they may compress tighter than the binaries.


    You may want to visit the Archive Comparison Page to see the state of the art.


    You are the 572nd person accessing this page.


    This page has been modified on May 9, 2014.