Hashing

₪ Can't use hashing if order is important like "find key just smaller than given key"
₪ Requirements for hash function => H(key) = value
    → 
No matter how many times you pass key => value should always be same
    
→ If M is size of hash table => value from hash function must lie between 0 to M-1
    → Value should be uniformly distributed for less collision

COMPRESSION  MAP => H(key) = key % M (M is prime number close to table size) 
POLYNOMIAL  ACCUMULATION => 
H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M 
UNIVERSAL  HASHING - choose any hash function uniformly at random

***********************************************************************************
  1. Direct Addressing --> exactly counting sort
    • Create array of size r (size of universe of keys) -> in O(1) directly insert, delete or search => bcz it's array -> access that point directly
    • If keys are decimal or strings - convert them to integers first using some function
    • DEMERITS - Space wastage + cannot create very large array 
  2. Chaining --> like bucket sort (see pic)
    • Create array of linked list for respective value
    • Best case - O(1) but Worst case - O(n)

  3. Open  Addressing - number of elements <= size of table
    • Linear Probing - Iterate linearly 
      • Initialize array of size M with some value 
      • Insert(k) - Start from h(k), if see empty / deleted slot => insert k there
      • Search(k) - Start from h(k), if empty slot / found k / reached starting point
        => stop iterating [NOTE - If deleted slot → do not stop]
      • Best case - O(1) but Worst case - O(n) if high clustering

    • Quadratic Probing - Iterate in quadratic fashion
      • Everything same as above
      • Insert(k) - Start from h(k)+0 → h(k)+1 → h(k)+4 → h(k)+9 ......
    • Double Hashing - 2 functions => h1 & h2
      • Everything same as above
      • Insert(k) - Start from h1(k) => go to [ h1(k) + h2(k) ]
        => go to [ h1(k) + 2*h2(k) ] => go to
         [ h1(k) + 3*h2(k) ] ....
      • Can also use like pair --> { h1(k), h2(k) }
***********************************************************************************
Best options are chaining or double hashing
  • Chaining - No need to resize hash tables 
    • less wastage of space as very few empty array values compared to linear probing
    • Tradeoff → search linked list V/S {resize array and choice of 2nd hash function}

  • more filled slots ≈ more collision → load factor = how much is already filled in table
    • m = table size  &  n= number of stored elements → load factor α = n / m
      • more load factor → high chance of collisions
    • Modern Hash tables resize automatically once load factor cross certain threshold

  • Chaining → assuming uniform distribution → if key is not present in table → equally likely to hash to any of m slots (size of array = size of table = m)
    • unsuccessful search → no. of elements searched = length of list = n/m = α
      • total time required = Θ(1+α) → 1+α bcz time for computing h(k) included
      • length = α bcz n elements uniformly distributed over array
    • for successful search → no. of element probed for key "x" = 1 + no. of elements before "x" in that list → 1 is for comparing iᵗʰ element as x
      Xij tells if jᵗʰ element collides with iᵗʰ element or not [Bernoulli Distribution]

  • Double Hashing → (1- α) is fraction of empty slots [probability of slot being empty]
    • NOTE - each probe/iteration looks at random slot of table generated from h2(k)
    • expected no. of probes until find first empty slot {geometric mean} = 1/(1−α)
      • same as insertion cost bcz for insertion we find an empty slot
      • called unsuccessful search (means that key is not in table) bcz in open addressing we probe slots until hit empty or return to starting point
    • When insert (i+1)ᵗʰ elementload = i/m → If search immediately after insertion
      → no. of probes to find that element = 1/(1−load) = m/(m-i) {insertion cost}
      Average over n keys in table → gives expected no. of probe for successful search

String Hashing detailed
Compares string in O(1) time with high probability => Treat as alternate method

POLYNOMIAL  ROLLING  HASH  TECHNIQUE
  • H(string) = [(s[0]*x^0) + (s[1]*x^1) + ....] % M 
  • M must be prime of order 1e9 (bcz greater M -> more range of values)
    • But don't take of order 1e18 bcz during multiplication 1e9 will still be in range

  • x must be greater than maximum value of s[i]
    • for 'a' to 'z' => transform it into [1 to 26] → x > 26

  • If add character 'a' at last => H(S+a) = (H(S) + a*x^n) % M
    • That's why called rolling bcz once started from a character, it keeps adding character in O(1) → just like anything keeps rolling once started motion
  • If add character 'a' at front => H(a+S) = (a + x*H(S)) % M
  • For range H(S[i.....j]) = ( [H(S till jth index) - H(S till (i-1)th index)] * x-i) % M
    • M must be prime else cannot find inverse modulo

  • false positives is when H(s1) = H(s2) but s1 != s2 → probability = 1 / m
    • H(s1) can be anything (no restriction) but H(s2) has probability = 1/m to match H(s1) → That's why larger m is always better

Comments

Popular posts from this blog

Graphs

Policy based data structures (PBDS)