> What data structure has a more efficient access pattern? The depends on wh...

Retric · on Jan 25, 2011

Your making some major assumptions. Big O is great when dealing with large data sets but beware the K.

Deleting random elements from a linked list is an O(n) operation because you need to find the element(s). They also waste a fair amount of memory.

Determining whether a value is an element of a set is an O(Log(N)) operation on an ordered array. And faster than a self-balancing tree because of cash locality issues.

Small Hash tables are slower than arrays because of that hash function. (I love Hash Tables but they make a poor default choice.) Edit: Set operations where not what I would call an "access pattern," but it's a valid point.

PS: Buffers and Bitmaps are both well suited to an array. You can't use an self-balancing tree in a hard real time system. And so it goes, when arrays don't suit your needs there are great options, but they are less common than you might think.

Chris_Newton · on Jan 25, 2011

> Big O is great when dealing with large data sets but beware the K.

Sure, you always have to be careful of the constant factor in reality. That's why I noted that arrays will typically lose out to alternatives like trees and hash tables unless you are working with very small numbers of elements.

Still, logarithmic beats linear very quickly as the amount of data increases. Consider that 2^10=1024, so by the time you have 1,000 or so elements to consider, you need a k that is about two orders of magnitude better for a linear scan to beat a binary tree.

For hash tables, it mostly comes down to how long it takes to compute the hash function, which in general should be a constant factor itself.

> Deleting random elements from a linked list is an O(n) operation because you need to find the element(s).

Why would you want to delete a random element? You're moving the goalposts.

If you really did need that functionality, then a simple linked list wouldn't be the data structure you wanted. Use the right tool for the job instead.

> They also waste a fair amount of memory.

They "waste" a small number of pointer-sized chunks for each element. If you have so much data that this might matter, you're probably running on hardware with silly specifications so it doesn't and/or using a different approach to scaling anyway.

> Determining whether a value is an element of a set is an O(Log(N)) operation on an ordered array.

Sure it is, after you've done your O(nlogn) sort. You're moving the goalposts again.

> And faster than a self-balancing tree because of cash locality issues.

Cache locality is certainly a valid concern in some applications, particularly with recent generations of PC processors/memory architectures. Once again, you should pick the right tool for the job.

In this case, the basic problem is that fetch time dominates comparison time. This is something database engineers have been learning to deal with since forever, and block-aware data structures like B-trees have been known for 40 years.

> Small Hash tables are slower than arrays because of that hash function.

Are you sure? Consider this: if you wanted to sort an array of 1,000,000 numbers between 1 and 100 as fast as possible, how would you do it?

But yes, if you really are within the constant factor required for the hash function, then a hash table is probably not a good choice.

> And so it goes, when arrays don't suit your needs there are great options, but they are less common than you might think.

Don't tell that to anyone who works with list-based programming languages!

Retric · on Jan 25, 2011

As I said there are plenty of cases where Arrays suck.

However, 1,000,000 element arrays while not small are generally sorted so fast in memory that N LOG N is fine. I could write a custom hash function that would store the counts in an Array ;0, but it's not something I want to maintain so the fallback option on numbers outside the range would be sorted.

"Sure it is, after you've done your O(nlogn) sort." Creating a binary search tree is also an O(n log n) operation...

Binary search trees wins when you want to constantly updated list that you do searches on, but that's a smaller case than just looking for elements in sets. So for example: if you start with an unordered list and want to look up 50 elements to see which one is in it, you are better off with a linear search though an array, unless you can sort the list while waiting for those 50 elements.

PS: I am bending over backward to support that 95%, but when given all the constants is surprising how often an Array works just as well or better than the "optimal" data structure. Generally speaking you want large data sets in a database somewhere, locally you rarely deal with large data-sets and when you do it's often a temporary thing or a lookup table.

silentbicycle · on Jan 29, 2011

> Are you sure? Consider this: if you wanted to sort an array of 1,000,000 numbers between 1 and 100 as fast as possible, how would you do it?

I'd use radix sort. It's O(n) rather than O(n log n).