Hashing
When we cannot use hashing - when order is important => Find key which is just smaller than the given key.
COMPRESSION MAP => H(key) = key % M (M is prime number close to table size)
POLYNOMIAL ACCUMULATION => H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M
Requirements for hash function => H(key) = value
- No matter how many times you pass key => value should always be same
- If M is size of hash table => value must lie between 0 to M-1
- Value should be uniformly distributed --> less collision
UNIVERSAL HASHING - choose any hash function uniformly at random
- Direct Addressing --> exactly counting sort
- Create array of size r (size of universe of keys) -> in O(1) directly insert, delete or search => bcz it's array -> access that point directly
- If keys are decimal or strings - convert them to integers first using some function
- DEMERITS - Space wastage + cannot create very large array
- Chaining --> like bucket sort (see pic)
- Create array of linked list for respective value
- Best case - O(1) but Worst case - O(n)
- Open Addressing - number of elements <= size of table
- Linear Probing - Iterate linearly
- Initialize array of size M with some value
- Insert(k) - Start from h(k), if see empty / deleted slot => insert k there
- Search(k) - Start from h(k), if empty slot / found k / reached starting point
=> stop iterating - Best case - O(1) but Worst case - O(n) if high clustering
- Quadratic Probing - Iterate in quadratic fashion
- Everything same as above
- Insert(k) - Start from h(k)+0 → h(k)+1 → h(k)+4 → h(k)+9 ......
- Double Hashing - 2 functions => h1 & h2
- Everything same as above
- Insert(k) - Start from h1(k) => go to [ h1(k) + h2(k) ]
=> go to [ h1(k) + 2*h2(k) ] => go to [ h1(k) + 3*h2(k) ] .... - Can also use like pair --> { h1(k), h2(k) }
The two best options are chaining or double hashing
- Chaining - No need to resize hash tables
- little wastage of space as less empty array values like in linear probing
- Tradeoff between searching linked lists V/S {resizing array + choice of 2nd hash function}
- When HashTable starts to get full in double hashing => high load factor and more collisions => chaining becomes more efficient due to multiple collisions as iterate many empty cells in array to find actual value.
String Hashing detailed
Compares string in O(1) time with high probability => Treat as alternate method
Compares string in O(1) time with high probability => Treat as alternate method
POLYNOMIAL ROLLING HASH TECHNIQUE
- H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M
- M must be prime of order 1e9 (bcz greater M -> more range of values)
- But don't take of order 1e18 --> bcz during multiplication 1e9 will still lie inside long long int but 1e18 cannot.
- x must be greater than max. value of s[i]
eg. - for 'a' to 'z' => x > 26 ---- (better 33, 37, 39, 41 -> expirmentally)
for 'a' to 'z' + 0 to 9 => x>max. (depends on implementation)
better not to take ascii => otherwise x>255 lena pdega phir x^4 wgerah mai dikkat aa skta hai => try to keep x below 100 - If add character 'a' at last => H(S+a) = (H(S) + a*x^n) % M --> can prove (simple hai)
That's why called rolling bcz once started from a character, it keeps adding character in O(1) --> just like anything keeps rolling once started the motion - If add character 'a' at front => H(a+S) = (a + x*H(S)) % M --> can prove (simple hai)
- For range H(S[i.....j]) = ( [H(S till jth index) - H(S till (i-1)th index)] * p^(-i)) % M
- If M prime nhi hota => inverse modulo nhi nikaal paate
- How many false positives (means H(s1)=H(s2) but s1!=s2
Since, 0 to M-1 values possible => probability = 1 / M
----> 1 bcz both has same value & for any s, there exists unique H(s) - That's why more M -> less probability of false positives
- If do it k times => average of false positives = k*(1/M)
Comments
Post a Comment