Hashing
₪ Can't use hashing if order is important like "find key just smaller than given key"
₪ Requirements for hash function => H(key) = value
→ No matter how many times you pass key => value should always be same
→ If M is size of hash table => value from hash function must lie between 0 to M-1
→ Value should be uniformly distributed for less collision
COMPRESSION MAP => H(key) = key % M (M is prime number close to table size)
POLYNOMIAL ACCUMULATION => H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M
UNIVERSAL HASHING - choose any hash function uniformly at random
- Direct Addressing --> exactly counting sort
- Create array of size r (size of universe of keys) -> in O(1) directly insert, delete or search => bcz it's array -> access that point directly
- If keys are decimal or strings - convert them to integers first using some function
- DEMERITS - Space wastage + cannot create very large array
- Chaining --> like bucket sort (see pic)
- Create array of linked list for respective value
- Best case - O(1) but Worst case - O(n)
- Open Addressing - number of elements <= size of table
- Linear Probing - Iterate linearly
- Initialize array of size M with some value
- Insert(k) - Start from h(k), if see empty / deleted slot => insert k there
- Search(k) - Start from h(k), if empty slot / found k / reached starting point
=> stop iterating [NOTE - If deleted slot → do not stop] - Best case - O(1) but Worst case - O(n) if high clustering
- Quadratic Probing - Iterate in quadratic fashion
- Everything same as above
- Insert(k) - Start from h(k)+0 → h(k)+1 → h(k)+4 → h(k)+9 ......
- Double Hashing - 2 functions => h1 & h2
- Everything same as above
- Insert(k) - Start from h1(k) => go to [ h1(k) + h2(k) ]
=> go to [ h1(k) + 2*h2(k) ] => go to [ h1(k) + 3*h2(k) ] .... - Can also use like pair --> { h1(k), h2(k) }
***********************************************************************************
Best options are chaining or double hashing
- Chaining - No need to resize hash tables
- less wastage of space as very few empty array values compared to linear probing
- Tradeoff → search linked list V/S {resize array and choice of 2nd hash function}
- more filled slots ≈ more collision → load factor = how much is already filled in table
- m = table size & n= number of stored elements → load factor α = n / m
- more load factor → high chance of collisions
- Modern Hash tables resize automatically once load factor cross certain threshold
- Chaining → assuming uniform distribution → if key is not present in table → equally likely to hash to any of m slots (size of array = size of table = m)
- unsuccessful search → no. of elements searched = length of list = n/m = α
- total time required = Θ(1+α) → 1+α bcz time for computing h(k) included
- length = α bcz n elements uniformly distributed over array
- for successful search → no. of element probed for key "x" = 1 + no. of elements before "x" in that list → 1 is for comparing iᵗʰ element as x
Xij tells if jᵗʰ element collides with iᵗʰ element or not [Bernoulli Distribution] - Double Hashing → (1- α) is fraction of empty slots [probability of slot being empty]
- NOTE - each probe/iteration looks at random slot of table generated from h2(k)
- expected no. of probes until find first empty slot {geometric mean} = 1/(1−α)
- same as insertion cost bcz for insertion we find an empty slot
- called unsuccessful search (means that key is not in table) bcz in open addressing we probe slots until hit empty or return to starting point
- When insert (i+1)ᵗʰ element, load = i/m → If search immediately after insertion
→ no. of probes to find that element = 1/(1−load) = m/(m-i) {insertion cost}
Average over n keys in table → gives expected no. of probe for successful search
String Hashing detailed
Compares string in O(1) time with high probability => Treat as alternate methodPOLYNOMIAL ROLLING HASH TECHNIQUE
- H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M
- M must be prime of order 1e9 (bcz greater M -> more range of values)
- But don't take of order 1e18 bcz during multiplication 1e9 will still be in range
- x must be greater than maximum value of s[i]
- for 'a' to 'z' => transform it into [1 to 26] → x > 26
- If add character 'a' at last => H(S+a) = (H(S) + a*x^n) % M
- That's why called rolling bcz once started from a character, it keeps adding character in O(1) → just like anything keeps rolling once started motion
- If add character 'a' at front => H(a+S) = (a + x*H(S)) % M
- For range H(S[i.....j]) = ( [H(S till jth index) - H(S till (i-1)th index)] * x-i) % M
- M must be prime else cannot find inverse modulo
- false positives is when H(s1) = H(s2) but s1 != s2 → probability = 1 / m
- H(s1) can be anything (no restriction) but H(s2) has probability = 1/m to match H(s1) → That's why larger m is always better
Comments
Post a Comment