Hashing

When we cannot use hashing - when order is important => Find key which is just smaller than the given key.

COMPRESSION  MAP => H(key) = key % M (M is prime number close to table size) 
POLYNOMIAL  ACCUMULATION => 
H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M 

Requirements for hash function => H(key) = value

  1. No matter how many times you pass key => value should always be same
  2. If M is size of hash table => value must lie between 0 to M-1
  3. Value should be uniformly distributed --> less collision
UNIVERSAL  HASHING - choose any hash function uniformly at random

  1. Direct Addressing --> exactly counting sort
    • Create array of size r (size of universe of keys) -> in O(1) directly insert, delete or search => bcz it's array -> access that point directly
    • If keys are decimal or strings - convert them to integers first using some function
    • DEMERITS - Space wastage + cannot create very large array 
  2. Chaining --> like bucket sort (see pic)
    • Create array of linked list for respective value
    • Best case - O(1) but Worst case - O(n)

  3. Open  Addressing - number of elements <= size of table
    • Linear Probing - Iterate linearly 
      • Initialize array of size M with some value 
      • Insert(k) - Start from h(k), if see empty / deleted slot => insert k there
      • Search(k) - Start from h(k), if empty slot / found k / reached starting point
        => stop iterating
      • Best case - O(1) but Worst case - O(n) if high clustering

    • Quadratic Probing - Iterate in quadratic fashion
      • Everything same as above
      • Insert(k) - Start from h(k)+0 → h(k)+1 → h(k)+4 → h(k)+9 ......
    • Double Hashing - 2 functions => h1 & h2
      • Everything same as above
      • Insert(k) - Start from h1(k) => go to [ h1(k) + h2(k) ]
        => go to [ h1(k) + 2*h2(k) ] => go to
         [ h1(k) + 3*h2(k) ] ....
      • Can also use like pair --> { h1(k), h2(k) }

The two best options are chaining or double hashing
  • Chaining - No need to resize hash tables 
    • little wastage of space as less empty array values like in linear probing
    • Tradeoff between searching linked lists V/S {resizing array + choice of 2nd hash function}

  • When HashTable starts to get full in double hashing => high load factor and more collisions => chaining becomes more efficient due to multiple collisions as iterate many empty cells in array to find actual value.

String Hashing detailed
Compares string in O(1) time with high probability => Treat as alternate method

POLYNOMIAL  ROLLING  HASH  TECHNIQUE
  • H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M 
  • M must be prime of order 1e9 (bcz greater M -> more range of values)
    • But don't take of order 1e18 --> bcz during multiplication 1e9 will still lie inside long long int but 1e18 cannot.
  • x must be greater than max. value of s[i]
    eg. - for 'a' to 'z' => x > 26 ---- (better 33, 37, 39, 41 -> expirmentally)
    for 'a' to 'z' + 0 to 9 => x>max. (depends on implementation)
    better not to take ascii => otherwise x>255 lena pdega phir x^4 wgerah mai dikkat aa skta hai => try to keep x below 100

  • If add character 'a' at last => H(S+a) = (H(S) + a*x^n) % M --> can prove (simple hai)
    That's why called rolling bcz once started from a character, it keeps adding character in O(1) --> just like anything keeps rolling once started the motion

  • If add character 'a' at front => H(a+S) = (a + x*H(S)) % M --> can prove (simple hai)

  • For range H(S[i.....j]) = ( [H(S till jth index) - H(S till (i-1)th index)] * p^(-i)) % M
    • If M prime nhi hota => inverse modulo nhi nikaal paate

  • How many false positives (means H(s1)=H(s2) but s1!=s2
    Since, 0 to M-1 values possible => probability = 1 / M
    ----> 1 bcz both has same value & for any s, there exists unique H(s)
    • That's why more M -> less probability of false positives
    • If do it k times => average of false positives  = k*(1/M)

Comments

Popular posts from this blog

Graphs

Basics