Hashing

When we cannot use hashing - when order is important => Find key which is just smaller than the given key.

COMPRESSION MAP => H(key) = key % M (M is prime number close to table size)
POLYNOMIAL ACCUMULATION => H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M

Requirements for hash function => H(key) = value

No matter how many times you pass key => value should always be same
If M is size of hash table => value must lie between 0 to M-1
Value should be uniformly distributed --> less collision

UNIVERSAL HASHING - choose any hash function uniformly at random

Direct Addressing --> exactly counting sort

Create array of size r (size of universe of keys) -> in O(1) directly insert, delete or search => bcz it's array -> access that point directly
If keys are decimal or strings - convert them to integers first using some function
DEMERITS - Space wastage + cannot create very large array

Chaining --> like bucket sort (see pic)

Create array of linked list for respective value
Best case - O(1) but Worst case - O(n)

Open Addressing - number of elements <= size of table

Linear Probing - Iterate linearly

Initialize array of size M with some value
Insert(k) - Start from h(k), if see empty / deleted slot => insert k there
Search(k) - Start from h(k), if empty slot / found k / reached starting point
=> stop iterating
Best case - O(1) but Worst case - O(n) if high clustering

Quadratic Probing - Iterate in quadratic fashion

Everything same as above
Insert(k) - Start from h(k)+0 → h(k)+1 → h(k)+4 → h(k)+9 ......

Double Hashing - 2 functions => h1 & h2

Everything same as above
Insert(k) - Start from h1(k) => go to [ h1(k) + h2(k) ]
=> go to [ h1(k) + 2*h2(k) ] => go to [ h1(k) + 3*h2(k) ] ....
Can also use like pair --> { h1(k), h2(k) }

The two best options are chaining or double hashing

Chaining - No need to resize hash tables

little wastage of space as less empty array values like in linear probing
Tradeoff between searching linked lists V/S {resizing array + choice of 2nd hash function}

When HashTable starts to get full in double hashing => high load factor and more collisions => chaining becomes more efficient due to multiple collisions as iterate many empty cells in array to find actual value.

String Hashing detailed
Compares string in O(1) time with high probability => Treat as alternate method

POLYNOMIAL ROLLING HASH TECHNIQUE

H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M
M must be prime of order 1e9 (bcz greater M -> more range of values)

But don't take of order 1e18 --> bcz during multiplication 1e9 will still lie inside long long int but 1e18 cannot.

x must be greater than max. value of s[i]
eg. - for 'a' to 'z' => x > 26 ---- (better 33, 37, 39, 41 -> expirmentally)
for 'a' to 'z' + 0 to 9 => x>max. (depends on implementation)
better not to take ascii => otherwise x>255 lena pdega phir x^4 wgerah mai dikkat aa skta hai => try to keep x below 100
If add character 'a' at last => H(S+a) = (H(S) + a*x^n) % M --> can prove (simple hai)
That's why called rolling bcz once started from a character, it keeps adding character in O(1) --> just like anything keeps rolling once started the motion
If add character 'a' at front => H(a+S) = (a + x*H(S)) % M --> can prove (simple hai)
For range H(S[i.....j]) = ( [H(S till jth index) - H(S till (i-1)th index)] * p^(-i)) % M

If M prime nhi hota => inverse modulo nhi nikaal paate

How many false positives (means H(s1)=H(s2) but s1!=s2
Since, 0 to M-1 values possible => probability = 1 / M
----> 1 bcz both has same value & for any s, there exists unique H(s)

That's why more M -> less probability of false positives
If do it k times => average of false positives = k*(1/M)

Search This Blog

DATA STRUCTURE & ALGORITHM (DSA)

Hashing

Comments

Post a Comment

Popular posts from this blog

Graphs

Basics