Direct Access Arrays
- Most operations in a computer only allow for constant logical branching, like if statements
- But one operation allows for non-constant branching factor, specifically the ability to randomly access any memory address in constant time
- This operation allows an algorithm's decision tree to branch with large branching factor, as large as there is space in your computer
- To exploit this operation, we define a data structure called a direct access array, which is a normal static array that associates a semantic meaning with each array index location: any item $x$ with key $k$ will be stored at array index $k$
--
- Suppose we want to store a set of $n$ items, each with a unique integer key in the bounded range 0 to $u-1$ (some large number)
- We can store the items in a direct access array of length $u$
- Order operations will be slow: no guarantee onwhere the first, last, or next element is, may have to spend $u$ time
- To find an item having integer key $i$, a search algorithm can simply look in array slot $i$, worst-case constant time
- The cost is storage space: a direct access array must have a slot available for every possible key in range, and when $u$ is very large ccompared to the number of items stored, this can be wasteful or impossible
Hashing
- Can we get the performance benefits of a direct access array while using $O(n)$ space when $n \ll u$?
- Smaller dynamic direct access array with $m = O(n)$ slots (instead of $u$), called a hash table
- Hash function/hash map: maps item keys to differen slos of the direct access array, $h(k): \{0, ..., u-1\} \rightarrow \{0,...,m-1\}$, $h(k)$ is the hash of integer key $k$
- If the hash function happens to be injective over the $n$ keys we're storing (no two keys map o the same direct access array index), then we can support worst-case constant time search, as the hash table simply acts as a direct access array over the smaller domain $m$
- But by pigeonhole principle, if the space of possible keys is larger than the number of array indices, i.e. $m < u$, then any hash function mapping $u$ possible keys to $m$ indices must map multiple keys to the same array index
- If $h(k_1) = h(k_2)$, then the hashes of $k_1$ and $k_2$ collide
- When collisions occur, we can store collisions somewhere else in the same direct access array (open addressing), or store them somewhere else (chaining)
Chaining
- Colliding keys are stored separately from the original hash table
- Each hash table index holds a pointer to a chain, a separate data structure that supports the dynamic set interface, often a linked list or dynamic array (as long as find, insert, and delete take no more than linear time)
- Insert $x$ into the chain at index $h(x.key)$
- Find or delete $k$ from the chain at index $h(k)$
- If our chains only hold a constant number of items, then dynamic set operations will run in (amortized) constant time
- Simple Uniform Hashing Assumption (SUHA): A random key will be equally likely to hash to any of the $m$ buckets $\{0,...,m-1\}$, independently of any other keys
- All of the set operations take $O(1+\alpha)$ time, where $\alpha = n/m$ is the load factor of the hash table (number of elements divided by number of buckets)
Open Addressing
- Different open addressing schemes are defined by different probing strategies, which dictate which slot to try next if the particular slot being looked at is occupied with a different key
- For a given key $k$, we first try to store $k$ in $h(k,0)$. If that's occupied, try $h(k,1)$, then $h(k,2)$, and so on
- When deleting keys, use lazy-deleting. Otherwise, one may prevent existing keys from being found. Use a special key that's treated as "occupied" when searching, but as "empty" when inserting.
- Assuming $h$ can be evaluated in $O(1)$ time, open addressing has similar performance to chaining, with some efficiencies for small load factors
Probing Schemes
Linear Probing
- $h(k,i) = (h'(k) + i) \mod m$, where $h'$ is a regular hash function
- Doesn't satisfy the uniform hashing assumption: if two keys are mapped to the same initial slot by $h'$, they'll follow the same probing sequence, so there are $m$ possible permutations instead of the required $m!$
Double Hashing
Hash Functions
- Division Method: $h(k) = (k \mod m)$
- Ideally, the performance of our data structure is independent of the keys we store
- One heuristic is to choose $m$ to be a large prime
- Assume SUHA holds: if the size of the hash table is $\Omega(n)$, then the average size of any chain will be $1 + (n-1)/\Omega(n) = O(1)$, thus enabling dynamic set operations in average-case constant time
- Note that to maintain $m = O(n)$, insertion and deletion may require us to rebuild the direct access array to a different size, choose a new hash function, and reinsert all the items. This can be done the same way as in dynamic arrays, leading to amortized bounds for dynamic operations
Exercises