opensubscriber
   Find in this group all groups
 
The Haskell Cafe more information…

h : haskell-cafe@haskell.org 18 November 2008 • 7:37PM -0500

[Haskell-cafe] implementing python-style dictionary in Haskell
by kenny lu

REPLY TO AUTHOR
 
REPLY TO GROUP



Dear all,

I am trying to implement the python-style dictionary in Haskell.

Python dictionary is a data structure that maps one key to one value.
For instance, a python dictionary
d = {'a':1, 'b':2 }
maps key 'a' to 1, 'b' to 2.
Python dictionary allows for update. e.g. the statement
d['a'] = 3
changes the value pointed by 'a' from 1 to 3.

Internally, python dictionary is implemented using hash table.

My first attempt is to use Data.HashTable. However it was immediately
abandoned, as I realize the memory usage is unreasonably huge.


== SECOND ATTEMPT ==
My second attempt is to use Data.Map


{-# OPTIONS_GHC -fglasgow-exts #-}
module Main where

import qualified Data.HashTable as HT
import qualified Data.IntMap as IM
import qualified Data.Map as DM
import qualified Data.ByteString.Char8 as S
import Data.Char


-- the Dict type class
class Dict d k v | d -> k v where
    empty :: d
    insert :: k -> v -> d -> d
    lookup :: k -> d -> Maybe v
    update :: k -> v -> d -> d

-- Let's use string as key
type Key = String

-- insert key-value pairs into a dictionary
fromList :: Dict d k a => [(k,a)] -> d
fromList l =
    foldl (\d (key,val) -> insert key val d) empty l


instance Dict (DM.Map S.ByteString a) Key a where
    empty = DM.empty
    insert key val dm =
    let packed_key = S.pack key
    in DM.insert packed_key val dm
    lookup key dm =
    let packed_key = S.pack key
        in DM.lookup packed_key dm
    update key val dm =
    let packed_key = S.pack key
        in DM.update (\x -> Just val) packed_key dm


Which kinda works, however since Map is implemented using a balanced tree,
therefore,
when as the dictionary grows, it takes a long time to insert new key-value
pair.


== THIRD ATTEMPT ==
My third attempt is to use Data.IntMap

-- an implementation of Dict using IntMap
instance Dict (IM.IntMap a) Key a where
    empty = IM.empty
    insert key val im =
    let int_key = fromIntegral (HT.hashString key)
    in IM.insert int_key val im
    lookup key im =
    let int_key = fromIntegral (HT.hashString key)
    in IM.lookup int_key im
    update key val im =
    let int_key = fromIntegral (HT.hashString key)
        in IM.update (\x -> Just val) int_key im


This implementation is faster than the Map approach, however this
implementation
can't handle collision well, two keys which are hashed into the same integer
will overwrite each other.


== FOURTH ATTEMPT ==

My fourth implementation is to use Trie. The idea is to split a string (a
key) into
a list of 4-character chunks. Each chunk can be mapped into a 32-bit integer
without collision.
We then insert the value with this list of chunks into the Trie.


-- an implementation of Dict using Trie
instance Dict (Trie a) Key a where
    empty = emptyTrie
    insert key val trie =
    let key_chain = chain key
    in insertTrie key_chain val trie
    lookup key trie =
    let key_chain = chain key
        in lookupTrie key_chain trie
    update key val trie =
    let key_chain = chain key
        in updateTrie key_chain val trie


-- an auxillary function that "splits" string into small pieces,
-- 4 characters per piece, 4 chars = 32 bit
chain :: Key -> [Key]
chain k | length k > 4 = let (k',ks) = splitAt 4 k
             in (k':chain ks)
    | otherwise    = [k]

-- a collision-free hash function which turns four chars into Int32
safehash :: [Char] -> Int
safehash cs | length cs > 4 = error "safehash failed."
        | otherwise =
        sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]


-- a trie datatype
data Trie a = Trie [a] (IM.IntMap (Trie a))

-- the empty trie
emptyTrie = Trie [] (IM.empty)

-- insert value into the trie
insertTrie :: [String] -> a -> Trie a -> Trie a
insertTrie [] i (Trie is maps) = Trie (i:is) maps
insertTrie (word:words) i (Trie is maps) =
    let key = safehash word
    in case IM.lookup key maps of
     { Just trie -> let trie' = insertTrie words i trie
                maps' = IM.update (\x -> Just trie') key maps
            in Trie is maps'
     ; Nothing -> let trie = emptyTrie
              trie' = insertTrie words i trie
              maps' = IM.insert key trie' maps
              in Trie is maps'
     }

-- lookup value from the trie
lookupTrie :: [String] -> Trie a -> Maybe a
lookupTrie [] (Trie vs _) =
    case vs of
      [] -> Nothing
      (x:_) -> Just x
lookupTrie (word:words) (Trie is maps) =
    let key = safehash word
    in case IM.lookup key maps of
       Just trie -> lookupTrie words trie
       Nothing   -> Nothing

-- update the trie with the given value.
updateTrie :: [String] -> a -> Trie a -> Trie a
-- we only update the first value and leave the rest unchanged.
updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
updateTrie (word:words) v  (Trie is maps) =
    let key = safehash word
    in case IM.lookup key maps of
       Just trie -> let trie' = updateTrie words v trie
                maps'  = IM.update (\x -> Just trie') key maps
            in Trie is maps'
       Nothing   -> Trie is maps



== BENCH MARK ==

I have a main function which builds a dictionary from a text file.
Each line of the file is a key-value pair separated by a space.

e.g.

key1 1
key2 2
...


main :: IO ()
main = do { content <- readFile "in.txt"
      ; let -- change this following type annotation
            -- to change different type of the dictionary
            -- dict :: DM.Map S.ByteString Int
            -- dict :: IM.IntMap Int
            dict :: Trie Int
            dict = fromList (map parse_a_line  (lines content))
      ; case Main.lookup "key256" dict of
        { Just v -> putStrLn (show v)
        ; Nothing -> putStrLn "Not found"
        }
        -- read a line here so that we can pause the program
            -- and look at the memory usage.
      ; v <- readLn
      ; putStrLn v
      }
    where  parse_a_line :: String -> (Key,Int)
           parse_a_line line = case words line of
                  [key,val] -> (key,read val)
                  _ -> error " parse error.  "


I tested all three implementations by building a dictionary of size 1000000.
The result shows that the Map and the Trie approaches handle collision well,
but
the IntMap approach does not.


Here is a comparison of memory usage

Map     : 345 MB
IntMap : 146 MB
Trie     : 282 MB
Python : 94 MB

Here is a comparison of execution time (on an intel dual core 2.0G)

Map: 26 sec
IntMap: 9 sec
Trie: 12 sec
Python: 2.24 sec


The above number shows that my implementations of python style dictionary
are space/time in-efficient as compared to python.

Can some one point out what's wrong with my implementations?

I've attached my code in the tgz file.

Cheers,
Kenny

Bookmark with:

Delicious   Digg   reddit   Facebook   StumbleUpon

Related Messages

opensubscriber is not affiliated with the authors of this message nor responsible for its content.