whio  Update of "whio_ht_HOWTO"

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.


Artifact ID: d7802577677c682ebe335fc2b8fe1e45187639fd
Page Name:whio_ht_HOWTO
Date: 2011-05-23 18:50:57
Original User: stephan
Parent: fdc7080d00fd097b03e140211a37cd0dfa8f3020

ACHTUNG: AS OF 20110523, THIS PAGE IS NOW MAINTAINED IN THE NEW WIKI: http://whiki.wanderinghorse.net/wikis/whio/?page=whio_ht_HOWTO

See also: whio_ht, whio_ht_tool

whio_ht HOWTO

This page demonstrates how to use the whio_ht API.

Creating a hashtable:

The process of preparing a whio_dev object to hold a hashtable is called "formatting" it, because it is conceptually like formatting a hard drive partition for use with a specific filesystem.

// assume that 'dev' is a whio_dev which we previously opened.
whio_ht * ht = NULL /*DO NOT use an uninitialized pointer!*/;
whio_ht_opt opt = whio_ht_opt_empty;
opt.hashSize = 503 /* should be a prime number, ideally larger than
                      the number of entries we expect to hold.
                      On-storage size of the table is about
                      (hashSize*whio_ht_sizeof_encoded_hashent) bytes.
int rc = whio_ht_format( &ht, dev, &opt, "string" );
if( 0 != rc ) { // error...
   // we still own dev, so we must clean it up.
   return ....;
// else ht takes ownership of dev

The "string" argument to whio_ht_format() is the name of a "function set" - a pair of key hashing and comparison functions. The library has sets built in for hashing strings (case-sensitively or case-insensitively) and the core integer types. Clients can register their own sets as well. The reason we use a name is so that when we open a hashtable from storage, we can find the functions that it should use. If we did not do that, then the client would need to tell the hashtable what functions to use each time the hashtable is opened. To use case-insensitive routines, use the function set name string:nocase.

Opening an existing hashtable:

whio_ht * ht = NULL;
int rc = whio_ht_open( &ht, dev );
if( 0 != rc ) {
    ... error ...
    // we still own dev, so we must clean it up:
    return ...

Remember: we can use the stack-allocation trick here, too.

Closing a hashtable:

When we're done with a hashtable instance we must close it to free its resources:

whio_ht_close( ht );

That frees all resources, including the whio_dev instance storing the hashtable.

Inserting items:

int rc = whio_ht_insert( ht, key, keyLength, data, dataLength );
if( 0 != rc ) { ... error ... }

Because the hashtable accepts arbitrary inputs (not just strings), the client has to tell it how long the key and value parts are. The library allows length-0 (NULL) values, but not length-0 keys.

Insertion will fail with whio_rc.AccessError if the given key is already in the hashtable. (TODO: provide whio_ht_replace() which works like insert but overwrites existing entries.)


whio_ht_record rec = whio_ht_record_empty;
int rc = whio_ht_search( ht, key, keyLength, &rec );
if( whio_rc.NotFoundError == rc ) { ... no such entry ... }
else if( 0 != rc ) { ... a "real" error ... }
else {
  ... we got an entry ...

On success, the rec object holds information about the result, but does not actually hold the key or value. Since the key/value data can be arbitrarily large, the library does not allocate memory them, and instead provides several different approaches for fetching them from the storage. The simplest approach, if one doesn't mind doing a lot of de/allocation:

void * k = NULL;
void * v = NULL;
int rc = whio_ht_kv_get_alloc( ht, &rec, &k, NULL, &v, NULL );
if( 0 != rc ) { ... error ... }
... use k/v, then free them ...
free(k); // also frees v, as they are allocated together via 1 allocation

The NULL arguments to whio_ht_kv_get_alloc() are optional whio_size_t pointers where the function stores the length of the allocated key and value. For STRING keys/values it is not normally necessary to remember these, but variable-length binary data it is normally important to know the size.

There are also options which stream the key/value data directly from storage to client-provided memory, but using these requires that the client first fetch the sizes from the whio_ht_record handle and initialize their buffers appropriately.

(TODO: add callback-based streaming, analog to whio_vlbm's features. This allows us to, e.g., stream the contents of a hashtable entry directly to a whio_stream or whio_dev object.)

Removing items:

int rc = whio_ht_remove( ht, key, keyLength );

Remove will fail with whio_rc.NotFoundError if no such item exists, or with another error code if something else was wrong (invalid arguments or an I/O error).

Iterating over entries

ACHTUNG #1: this is not currently 100% thread-safe. Concurrent modification to the hashtable might invalidate any given iterator, potentially leading to undefined results. The iterator API checks for this condition and returns whio_rc.ConcurrentModificationError if it sees this. There is, however, still a timeframe for concurrent changes between the time the iterator is fetched/checked and the time its record is read by the iteration loop. It is not clear how to work around this problem without requiring that client-provided mutexes be capable of recursive locks.

ACHTUNG #2: Iteration requires reading every single hashtable entry, so it performs at least a number of reads equal to the hashtable's size plus one read for each record in the hashtable. This is not a problem on fast/cached storage, but this may suck on slow devices with high seek times.

Iteration is straighforward, and may look familiar to users of the C++ STL iterators:

whio_ht_iterator it = whio_ht_iterator_empty;
int rc = whio_ht_iter_begin( ht, &it );
if( rc ) ... error ...;
while( ! whio_ht_iterator_is_end(&it) ) {
   ... use it.record to fetch key/value parts ...
   ... then advance the iterator: ...
   rc = whio_ht_iterator_next( &it );
   if( rc ) ... error ...;
   else continue;

Tip: stack- or custom-allocating a whio_ht

We can optionally allocate the whio_ht object on the stack (or as a member object in a larger object) by passing a pointer to a non-NULL object to the format or opening routines:

whio_ht htObj = whio_ht_empty;
whio_ht * ht = &htObj;
... as above ...
int rc = whio_ht_format( &ht, ... );
... as above ...

That will use the stack-allocated ht object, but does not otherwise change the way we use the object. The same trick works for whio_ht_open(). In fact, it need not be stack allocated: the client can allocate it however he likes. The requirement is, however, that after calling whio_ht_close(), the caller must deallocate the object using a method appropriate for its allocation. Abstractly, that looks like:

whio_ht * ht = (whio_ht*) mySpecialAllocator(sizeof(whio_ht));
*ht = whio_ht_empty; // ensure it has a clean state
// Now we can pass it to format() or open() and use it
// as normal. When we're done:
whio_ht_close(ht); // frees ht's resources, but NOT ht!

One might ask how the library knows whether or not whio_ht_close() must free the ht object. When formatting or opening the hashtable, if it allocates the object then it marks that object so that the close routine knows that it must deallocate the object.