whio  Update of "whio_udb"

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Artifact ID: 086353d0f0307d3429f4953d8985ed1248212be7
Page Name:whio_udb
Date: 2011-05-23 18:50:58
Original User: stephan
Parent: 8d2d65c49e89e4a73bc91751e946cf9233033f2b
Content

ACHTUNG: AS OF 20110523, THIS PAGE IS NOW MAINTAINED IN THE NEW WIKI: http://whiki.wanderinghorse.net/wikis/whio/?page=whio_udb

See also: whio_udb_HOWTO, whio_udb_tool, whio_ht

whio_udb: "u" as in "micro", "DB" as in "database"

Header: <wh/whio/whio_udb.h>

Code Status: as of 14 March 2010, whio_udb is only about 4 nights old but appears to work as designed. We'll call it beta.

The whio_udb ("u" as in "micro" and "db" as in "database") API is a small database API encapsulating a storage-based hashtable for holding records with fixed maximum key and value sizes. When working with string records, the strings need not all be the same size, but all with have a db-instance-specified _maximum_ size. When working with binary data, the user must tell the DB how to measure the length of the objects.

UDB was initially inspired by the example database application in the book Advanced Programming in the Unix Environment (an excellent book overall, by the way), but is not based on that code. That example did, however, show me how to implement a free-list which has O(1) de/allocation, and that detail was important for the overall performance of UDB. Speaking of...

Primary Features

  • Excellent performance characteristics. All of the major algorithms are amortized O(1): insertion, searching, and removal. On my box i'm seeing average insertion rates of up to about 75000 records per second with records totaling 72 bytes of key/value data.
  • Uses whio_dev for storage, so a db can be stored on any storage for which a whio_dev implementation exists. (That includes using them inside a whio_epfs container.)
  • An db can operate in r/o or r/w modes.
  • Quite a lot of documentation. At only three days old, it had more API docs than most other Open Source projects ten times its size. No public API member is undocumented, and most are documented in pedantic levels of detail.
  • All of the parameters of the DB are client-configurable: hashtable length, key size, data size, and the various functions used for hashing and comparison. e.g. it is possible to make the db do case-insensitive searches by simply providing a case-insensitive comparison function.
  • Requires only a small, fixed amount of memory plus an amount proportional to the combined sizes of the key and data fields.
  • Data format is endian- and bitness-independent. A db created on one platform will work on another, provided the user hasn't stored endian-dependent raw data in the db.
  • Allows iteration over all elements in the hashtable via a for-each interface. Also supports iterating over records where the keys match a wildcard-style string (e.g. "[a-zA-Z]*.txt").
  • Grows on demand, with an essentially arbitrary upper limit on the number of records. (That said, large numbers of records in conjunction with small a hashtable size will worsen overall performance.)
  • It maps names to hashing function sets, so that when opening a db it can load the proper hash/length/comparison functions. This means the client does not need to know which functions have to be used when opening a given DB (just when formatting it). Clients can register their own function sets to work within this framework.
  • It has optional support for a client-defined mutex.

Primary Misfeatures

  • The data format will change as i continue to hack. Until it stabilizes, store your data somewhere else for safekeeping, so a new version doesn't lock your old data away in a file it cannot open. (Maybe i can encapsulate such changes in another API abstraction, allowing multiple versions to be supported?)
  • No storage locking yet. On the TODO list. The framework is in place, it just needs to be applied.
  • A given DB has a fixed maximum size for keys and values, and a fixed hashtable size. The only way to change the sizes is create a new DB and copy the data into it. (The number of records in the hashtable is essentially unbounded, despite its fixed size.)
  • It is not a general-purpose hashtable, and cannot store heterogeneous types in one database. Each instance must be tailored/sized to fit specific needs.
  • Must be able to allocate memory in increments of the record size (key+data length plus a fixed amount for the encoding/decoding buffer), so it cannot use really huge records on really low-memory devices. It uses an internal memory allocator to recycle its record buffers, so it doesn't have to allocate very often (normally only one buffer is needed at a given time, and that buffer gets recycled).
  • It cannot yet physically remove unused blocks. That means, in effect, that it can grow but not shrink. It does not, however, need to allocate new record blocks unless it is 100% full.
  • Out of the box it does not support 64-bit storage. To get that support, the whio library must be compiled with its WHIO_SIZE_T_BITS set to 64. Doing so, however, will change the sizes of most classes and waste a good deal of memory on 32-bit environments. It is completely untested on storage more than 32 bits. Also, DBs compiled with a different WHIO_SIZE_T_BITS value are not compatible!
  • Theoretically UDB can be used to ruin SD cards (and other storage with relatively short write-lifetimes) in a matter of seconds. Every insert and removal requires writing to the db storage. If you insert or remove, say, 1000000 records in a DB stored on an SD card, the db will have to update its header 1m times, and you could (theoretically) update the same block often enough to wear it out. Read-only mode should pose no problems on SD cards (it should perform quite well, actually).

HOWTO

See whio_udb_HOWTO.