Engineyard SHA Contest Leaderboard I threw together

mrduncan · on July 21, 2009

It looks like "hashbreaker" and "CodingCrypto" are both running through nearly the exact same word sets. Either they are from to different people using the exact same algorithm or (what I suspect is more likely) they have created multiple twitter accounts to get around the submission limit.

  hashbreaker
  bUGs BUgs bUGS bUgS bLAnk BlAnK BlANK BLank buGS bugS bUGS bUGs dCao2

  CodingCrypto
  BugS BUgs buGs BUgs bLANK Blank bLanK BlAnk BuGs BuGs BUGS bUGs FcBkw

  CodingCrypto
  cOWs cOwS cOwS cowS DIRtY Dirty DiRTY DIrtY cOWS Cows CoWs Cows owjo0

  hashbreaker
  CoWS coWS CowS cOWS DiRTY DIrtY DirTy DIRtY COWS COWS cOwS cOWS bvpDq

Edit: Also, accounts had their first tweets on the 19th.

DocSavage · on July 22, 2009

I didn't scrutinize the sha-1 algorithm, but is it possible to optimize its computation (e.g., precompute intermediates) when using a small number of repeated words as input? This might explain why a group of people all arrived at similar solutions, since it only matters how quickly you can guess if all guesses are pretty much equal.

I lucked out and got a solution at 35. Coded something up in python and only got about 85k hashes/sec on one core. The CUDA program, though, was about 100x faster on my low-end Nvidia 8400 GS.

gojomo · on July 22, 2009

Yes: very roughly, you could precompute the internal state after 'abcdefg' -- essentially checkpointing the algorithm -- and then compute SHA1('abcdefgX'), SHA1('abcdefgY'), etc. more easily than other strings. (In fact, you'd want your checkpoint aligned at some multiple -- maybe 512 bits? -- but that's the general idea.)

I figured the 'final 5 arbitrary characters' was a nod to this approach. So the best strategy I could think of -- and I didn't try coding anything up for the contest -- was to (1) guess the word dictionary; (2) precompute a whole bunch of checkpointed-except-for-last-five-characters function states; (3) at the moment the real dictionary and target was announced, try many different last-five-character extensions against any valid precomputed states.

mrduncan · on July 22, 2009

phubbard has a very similar submission as well:

  BUgs bUGS bUGs bugS BlaNK BLanK blANK bLaNK BugS BUgs BuGs bUgs liwcC

Based on his Twitter account I'm a little more inclined to believe that they are indeed separate people all using the same algorithm.

Edit: This appears to be their info based on the submission by nkurz - http://news.ycombinator.com/item?id=717174

gojomo · on July 22, 2009

Disregarding the last 5 characters, the 'bugs blank' and 'cows dirty' strings are exactly 64 characters (512 bits) long, as are many of the other leading entries. Savvy teams are checkpointing the hash at the 64 character point, then rerunning it from there with alternate endings.

jodrellblank · on July 23, 2009

What do you mean by 'checkpointing the hash'?

How far can you go into an SHA1 calculation and then change the input and have your calculation be useful?

gojomo · on July 23, 2009

The function as applied to a stream of bytes has a certain amount of internal state. You could at any moment take a snapshot of the entire state, and from that point, feed it different suffixes of the same previously-fed data. In this way, you calculate many final hashes without always repeating the redundant part (the shared prefix).

For a concrete example, consider let us say you wanted to take the SHA1 of the input that is 1MB of 'a' followed by an 'a', and then 1MB of 'a' followed by a 'b', etc.

You don't have to run the SHA1 function over 'a...aa', then reset from scratch and run it over 'a...ab'. You can run it over 'a...a' without then feeding it the final byte or performing its final steps, snapshot its state (which is only some small multiple of its final output size), and then quickly calculate 'a...aa', 'a...ab', 'a...ac', etc. without redundantly feeding the identical 1MB prefix in each time.

DocSavage · on July 23, 2009

Is snapshotting the equivalent of using copy() in python's hashlib? I did something like below but only got a minor speed increase --

  stem_sha = hashlib.new('sha1')
  stem_sha.update('ruby ruby ruby... rails rails ')
  for suffix in random_char_set:  # Gets random 5 char phrase
      test_sha = stem_sha.copy()
      test_sha.update(suffix)
      ...

gojomo · on July 23, 2009

From the hashlib documentation, yes, it appears so.

Note, though, SHA1 works in 512-bit (64-byte) blocks, so 'snapshotting' after feeding it (for example) 63 bytes hasn't saved any hard work -- the routine was only buffering those 63 waiting for byte #64 (or the digest-finish operation) to do the real crunching.

And, the tiny EngineYard strings don't offer a giant opportunity for precalculation savings. The usual pattern for digesting is:

  1 x new()
  N x update()
  1 x digest()

The copy() lets you snapshot state after M update()s, M<N, to speed future calculations. Obviously if M is a lot -- there is a large shared prefix -- there could be big savings. But with the EngineYard strings, M should never be more than 1 block/update(). (If it is, your 10-word preamble was wastefully long.) So the potential savings are capped relative to the cost of the copy() and the (unavoidable) second update() and digest(). If the second update() and final digest() are roughly as costly as the first update() and original new(), and if the copy() is nearly costless, I suppose the speedup could approach 2x.

But also: I suspect an optimized C SHA1 operation on tiny amounts of data is going to scream compared to your Python loop and calling overheads -- so any improvement from your copy() reuse may be noise compared to what's taking up most of your loop time.

dchest · on July 22, 2009

Description: http://www.win.tue.nl/cccc/sha-1-challenge.html

electronslave · on July 22, 2009

As of right now (20090722 0700UTC) CodingCrypto has won. The programmer on both CodingCrypto and hashbreaker?

Why, none other than our good friend -- and my personal hero -- djb! I have no doubt that the race was actually between seibert and djb.

To the uninitiated: http://cr.yp.to

brown9-2 · on July 21, 2009

Kind of interesting or funny that these entries are pretty close to the top of the standings:

  31  	d241c4b4dd38aacb197c8e48e9f8488264e39a83  	
  bUGs BUgs bUGS bUgS bLAnk BlAnK BlANK BLank buGS bugS bUGS bUGs dCao2
 
  32  	334945bffdd1000f1d198e02497188a2b17f1991
  cOWs cOwS cOwS cowS DIRtY Dirty DiRTY DIrtY cOWS Cows CoWs Cows owjo0

(assuming the score are correct)

theblackbox · on July 22, 2009

Would the contest have been any easier had one made the (dubious) assumption that the "passphrase" encoded would likely be a valid sentence?

I've not tried any crypto work before but this has really grabbed my interest. It just occurred to me however, that discarding lines of inquiry leading down the "BuGS bugs bUgs BUGs BLAnK BlAnK....." route in favour of something that made more sense grammatically could be a good tactic to save time?.... or it could just increase the overhead in generating the next hash?...

eru · on July 22, 2009

Unless you found the exact sequence, the difference between hashes won't tell you how close you are.

theblackbox · on July 22, 2009

I don't see what you mean by "the difference between hashes won't tell you how close you are" .... isn't that the whole point of this competition?

Say for instance you coded four stages:

1/ Generating a list of strings for analysis 2/ Parsing that list for sentences that made sense 3/ Hashing these sentences 4/ Comparing the hashes generated with the one being sought

to me it just looked as though a lot of gibberish was analysed because there is a lot more entropy than order to sift through and nobody tried step 2... does my logic make sense or am I getting mixed up and not really seeing the bigger picture?

eru · on July 22, 2009

Suppose the hash for a specific phrase was completely random, but fixed. (Which is not too far off for a cryptographic hash-function.)

So unless you want to find the exact sequence, it does not matter whether you candidate strings make sense or not. And since the search space is so large, people just gave up on finding the exact sequence. They just use a lot of cycles to find any sequence, that is close in hash.

jodrellblank · on July 23, 2009

There's no benefit in trying step 2 because the gibberish input is as likely to be a close match as a sentence that makes sense.

A sentence that starts 'Be' (001000010...) and the same sentence changed to 'He' (001001000...) would produce very different hash results by the Avalanche Effect ( http://en.wikipedia.org/wiki/Avalanche_effect ). So having similar letters in a similar order or in similar positions, or using similar length words, wont help.

Isn't that the whole point of this competition?

The point was to get a hash with a similar quantity of 0 and 1 bits in the output as their hash, not to guess their sentence (they said what it was).

theblackbox · on July 23, 2009

The point was ... not to guess their sentence (they said what it was).

Ahh, I missed that! That makes sense in terms of the Avalanche Effect - cheers for the knowledge

0wned · on July 21, 2009

Twitter accounts with "private" updates do not show up on the public timeline... my three submissions are nowhere to be found and they do validate.

on July 21, 2009

[deleted]

0wned · on July 21, 2009

I'm a twitter virgin. Never tweetted before today's contest. My best score was 36. No big deal.

bcl · on July 21, 2009

Good job! Could you add a timestamp and option to sort by it? (I know, feature creep!)

ETA: I also find it interesting that the lowest score from those not using the optional 5 random chars is currently 38. All the lower ones use random chars.

JeremyChase · on July 21, 2009

There is a practical reason for this depending on your algorithm. In mine it was much faster to change the last 5 characters instead of changing the entire string. I saw about a 50% performance improvement by creating a fixed set of 12 words, and altering the last 5 characters. Then updating the 12 words when necessary.

jazzychad · on July 21, 2009

Thought about it, but the two systems I used to compile the data are showing different timestamps (almost like a timezone mismatch) so now it's too late to fix them. I do have the "latest 10 entries" as a compromise.

philfreo · on July 21, 2009

Right now you're showing duplicate entries (the same entry of value 31 is showing from two people)

jazzychad · on July 21, 2009

That's because somebody just copied and pasted somebody else's entry. The system is working as intended. I'm sure Engineyard will catch it in their judging (or so I hope).

twism · on July 21, 2009

the rules says you can "append up to 5 characters" but you consider entries with less than 5 optional characters invalid.

jazzychad · on July 21, 2009

No, there are several listed with less than 5 extra characters. It's just that most people are using all 5.

rneufeld · on July 21, 2009

My entry (rkneufeld) is cut off at the end. Actual suffix is: "3:OID". This moved me from 34 to 68 :( . Just thought you might like to know

jazzychad · on July 21, 2009

Apologies. I have corrected it.

Oompa · on July 21, 2009

I'm not using any, and it didn't pick my entry up.

jazzychad · on July 21, 2009

What's the link to your tweet? The Twitter Search API is dropping some tweets. Does your entry display in the search results?

Oompa · on July 21, 2009

Just popped up! Guess it just took awhile.

twism · on July 21, 2009

my apologies.