> What data structure has a more efficient access pattern?
The depends on what kind of access you will need.
For example, if you are going to be doing a lot of searching to determine whether a value is an element of a set, an array will take O(n) in the number of elements to do a linear search, while various self-balancing tree structures will be O(logn), and a hash table with a good choice of hash function will be even better. Unless you are working with very small data sets, arrays will rapidly lose out here.
If we're talking about a data structure that gets modified over time, then suppose you need to store a sequence of values in order, and you will often insert or remove elements in the middle of the sequence. A linked list or similar structure can perform these operations for O(1) cost, but an array is O(n) in the number of elements.
Of course, all data structures are ultimately kept in one big array of memory/drive, so you could argue that the array can be just as good as the other data structures in the examples above. However, I think that is a circular argument: you're just reinventing the alternative data structures.
> People default to arrays because 95% of the time they are either the best option or good enough that it's not worth worrying about.
Perhaps. Then again, perhaps this attitude of declaring the default, easy tool "good enough" instead of learning to use the right tool for the job is why in 2011 it still takes nearly a minute for my PC to boot in the morning and sometimes I still type faster than my state-of-the-art DTP software can manage to draw the letters on the screen.
Your making some major assumptions. Big O is great when dealing with large data sets but beware the K.
Deleting random elements from a linked list is an O(n) operation because you need to find the element(s). They also waste a fair amount of memory.
Determining whether a value is an element of a set is an O(Log(N)) operation on an ordered array. And faster than a self-balancing tree because of cash locality issues.
Small Hash tables are slower than arrays because of that hash function. (I love Hash Tables but they make a poor default choice.) Edit: Set operations where not what I would call an "access pattern," but it's a valid point.
PS: Buffers and Bitmaps are both well suited to an array. You can't use an self-balancing tree in a hard real time system. And so it goes, when arrays don't suit your needs there are great options, but they are less common than you might think.
> Big O is great when dealing with large data sets but beware the K.
Sure, you always have to be careful of the constant factor in reality. That's why I noted that arrays will typically lose out to alternatives like trees and hash tables unless you are working with very small numbers of elements.
Still, logarithmic beats linear very quickly as the amount of data increases. Consider that 2^10=1024, so by the time you have 1,000 or so elements to consider, you need a k that is about two orders of magnitude better for a linear scan to beat a binary tree.
For hash tables, it mostly comes down to how long it takes to compute the hash function, which in general should be a constant factor itself.
> Deleting random elements from a linked list is an O(n) operation because you need to find the element(s).
Why would you want to delete a random element? You're moving the goalposts.
If you really did need that functionality, then a simple linked list wouldn't be the data structure you wanted. Use the right tool for the job instead.
> They also waste a fair amount of memory.
They "waste" a small number of pointer-sized chunks for each element. If you have so much data that this might matter, you're probably running on hardware with silly specifications so it doesn't and/or using a different approach to scaling anyway.
> Determining whether a value is an element of a set is an O(Log(N)) operation on an ordered array.
Sure it is, after you've done your O(nlogn) sort. You're moving the goalposts again.
> And faster than a self-balancing tree because of cash locality issues.
Cache locality is certainly a valid concern in some applications, particularly with recent generations of PC processors/memory architectures. Once again, you should pick the right tool for the job.
In this case, the basic problem is that fetch time dominates comparison time. This is something database engineers have been learning to deal with since forever, and block-aware data structures like B-trees have been known for 40 years.
> Small Hash tables are slower than arrays because of that hash function.
Are you sure? Consider this: if you wanted to sort an array of 1,000,000 numbers between 1 and 100 as fast as possible, how would you do it?
But yes, if you really are within the constant factor required for the hash function, then a hash table is probably not a good choice.
> And so it goes, when arrays don't suit your needs there are great options, but they are less common than you might think.
Don't tell that to anyone who works with list-based programming languages!
As I said there are plenty of cases where Arrays suck.
However, 1,000,000 element arrays while not small are generally sorted so fast in memory that N LOG N is fine. I could write a custom hash function that would store the counts in an Array ;0, but it's not something I want to maintain so the fallback option on numbers outside the range would be sorted.
"Sure it is, after you've done your O(nlogn) sort." Creating a binary search tree is also an O(n log n) operation...
Binary search trees wins when you want to constantly updated list that you do searches on, but that's a smaller case than just looking for elements in sets. So for example: if you start with an unordered list and want to look up 50 elements to see which one is in it, you are better off with a linear search though an array, unless you can sort the list while waiting for those 50 elements.
PS: I am bending over backward to support that 95%, but when given all the constants is surprising how often an Array works just as well or better than the "optimal" data structure. Generally speaking you want large data sets in a database somewhere, locally you rarely deal with large data-sets and when you do it's often a temporary thing or a lookup table.
The depends on what kind of access you will need.
For example, if you are going to be doing a lot of searching to determine whether a value is an element of a set, an array will take O(n) in the number of elements to do a linear search, while various self-balancing tree structures will be O(logn), and a hash table with a good choice of hash function will be even better. Unless you are working with very small data sets, arrays will rapidly lose out here.
If we're talking about a data structure that gets modified over time, then suppose you need to store a sequence of values in order, and you will often insert or remove elements in the middle of the sequence. A linked list or similar structure can perform these operations for O(1) cost, but an array is O(n) in the number of elements.
Of course, all data structures are ultimately kept in one big array of memory/drive, so you could argue that the array can be just as good as the other data structures in the examples above. However, I think that is a circular argument: you're just reinventing the alternative data structures.
> People default to arrays because 95% of the time they are either the best option or good enough that it's not worth worrying about.
Perhaps. Then again, perhaps this attitude of declaring the default, easy tool "good enough" instead of learning to use the right tool for the job is why in 2011 it still takes nearly a minute for my PC to boot in the morning and sometimes I still type faster than my state-of-the-art DTP software can manage to draw the letters on the screen.