The function as applied to a stream of bytes has a certain amount of internal st...

DocSavage · on July 23, 2009

Is snapshotting the equivalent of using copy() in python's hashlib? I did something like below but only got a minor speed increase --

  stem_sha = hashlib.new('sha1')
  stem_sha.update('ruby ruby ruby... rails rails ')
  for suffix in random_char_set:  # Gets random 5 char phrase
      test_sha = stem_sha.copy()
      test_sha.update(suffix)
      ...

gojomo · on July 23, 2009

From the hashlib documentation, yes, it appears so.

Note, though, SHA1 works in 512-bit (64-byte) blocks, so 'snapshotting' after feeding it (for example) 63 bytes hasn't saved any hard work -- the routine was only buffering those 63 waiting for byte #64 (or the digest-finish operation) to do the real crunching.

And, the tiny EngineYard strings don't offer a giant opportunity for precalculation savings. The usual pattern for digesting is:

  1 x new()
  N x update()
  1 x digest()

The copy() lets you snapshot state after M update()s, M<N, to speed future calculations. Obviously if M is a lot -- there is a large shared prefix -- there could be big savings. But with the EngineYard strings, M should never be more than 1 block/update(). (If it is, your 10-word preamble was wastefully long.) So the potential savings are capped relative to the cost of the copy() and the (unavoidable) second update() and digest(). If the second update() and final digest() are roughly as costly as the first update() and original new(), and if the copy() is nearly costless, I suppose the speedup could approach 2x.

But also: I suspect an optimized C SHA1 operation on tiny amounts of data is going to scream compared to your Python loop and calling overheads -- so any improvement from your copy() reuse may be noise compared to what's taking up most of your loop time.