- "And something like zobrist hashing for looking up a given position"
Which gets particularly good if you combine it with some hack like a Bloom filter. There's a trick, I don't remember if it has a name, where you can effect a time/space tradeoff by partitioning a search set and building a separate Bloom filter for each partition. A lookup is a constant-time query of each Bloom filter, plus a linear scan of each partition that signaled a match.
(This is a reasonable thing to do, because (0) the size of chess databases can get extremely large and (1) the queries are human UX interactions that need to be responsive).
I mean, chess engines use transposition tables to store encountered positions. In this case you can use the zobrist hash as the index to a hash table and it's an O(1) lookup for a given position.
Typically that's going to be the best way to index into full games. Since you can point to an entry via a game tree search (eg; each node of the tree is a position that you can use as an index to a big transposition table).
You can probably use a bloom filter to check if a given position has ever been encountered (eg; as a layer on top of the transposition table) though. For something like storing 10 billion positions w/ a 10^-6 false positive probability (w/ reasonable number of hashes) you're looking at 30+GiB bloom filters.
Right! The hash table you're describing is O(1) lookup, but its memory size is a constant multiple larger than the Bloom filter. A Bloom filter doesn't tell you anything about where the needle is–only whether the needle exists; so, if used for lookup, it reduces to O(n) brute-force search. In between the two, you can create an array of M partitions of the haystack, each with their own Bloom filter: this structure has something like O(M + n/M) lookup, but with the smaller memory size of the Bloom filter.
When I implemented this, I deliberately avoided constructing the hash table you're describing, because (to my recollection) it was larger than my remaining disk space! :)
Which gets particularly good if you combine it with some hack like a Bloom filter. There's a trick, I don't remember if it has a name, where you can effect a time/space tradeoff by partitioning a search set and building a separate Bloom filter for each partition. A lookup is a constant-time query of each Bloom filter, plus a linear scan of each partition that signaled a match.
(This is a reasonable thing to do, because (0) the size of chess databases can get extremely large and (1) the queries are human UX interactions that need to be responsive).