whio  whio_epfs_iomodel

ACHTUNG: AS OF 20110523, THIS PAGE IS NOW MAINTAINED IN THE NEW WIKI: http://whiki.wanderinghorse.net/wikis/whio/?page=whio_epfs_iomodel

See also: whio_epfs

The whio_epfs I/O Model

whio_epfs is concerned with i/o on two distinct levels:

  • Back-end storage (normally a file, but any full-featured whio_dev object will do).
  • Access to the data within the EFS (embedded filesystem).

The first part is handled by the core whio_dev interface and the common implementations (e.g. file- and memory-based devices). The second part is performed by the EPFS engine, but access to the contents of an EFS is exposed to the client as yet another whio_dev implementation. Each "pseudofile" within the EFS container can be opened via the whio_epfs API and then be accessed/manipulated via the whio_dev API.

The layers of abstraction involved are:

  • The underlying device storage, accessed via the whio_dev interface. This is normally a file.
  • An EFS instance primarily manages the layout of the underlying data format. It owns the underlying whio_dev storage and it provides the internal framework off of which...
  • The "pseudofiles", exposed to the client as whio_dev instances, proxy i/o operations to the underlying storage.

In effect, using the EFS adds one layer of i/o indirection (but no additional copying/buffering) between the client and the underlying storage. All of the major algorithms for implementing the "redirected operations" require amoritized O(1) time, so the real i/o penalty implied by the abstraction layer is relatively small and highly predictable. The obvious exceptions are (A) that reading/writing N bytes are O(N) and (B) the library must, due to its internal object-oriented nature, "seek" the storage cursor very often.

EPFS has the following i/o-related components:

  • Each record in the EPFS is recorded as an "inode", also called a "pseudofile", conceptually similar (but not quite identical) to conventional inodes in a filesystem. Each inode stores its logical size and the ID of its first data block (or 0 if it has no blocks). Inodes are numbered sequentially, starting at 1 (0 is reserved for the "not an inode" value). The number of inodes is fixed when an EFS is "formatted", and the maximum number of inodes depends on the size of the type whio_epfs_id_t. It might be interesting to know that inodes do not have names associated with them, as far as the core library is concerned. See whio_epfs_namer for details.
  • Data blocks. Each data block has a fixed-sized metadata record and a client-determined number of data bytes. The metadata stores internal flags and the ID of the next block in an inode's block chain. Blocks are numbered sequentially, starting at 1 (0 is reserved for the "not a block" value). An EFS can be configured to have a fixed number of blocks or to add blocks as necessary. The absolute maximum number of blocks depends on the size of the type whio_epfs_id_t.

To perform i/o on an inode, the library requires that the inode be "opened," which is conceptually identical to opening a file. Each opened inode is managed by an internal handle class (whio_epfs_handle), which is indirectly exposed to the client via a whio_dev object. All i/o on the inode/handle is via the i/o device object. The whio_dev object takes care of translating i/o coordinates into virtual data blocks within the EFS and performing the actual i/o on the underlying storage. Each handle has its own read/write mode but shares the inode-specific data with all other handles opened for that inode. The library allows an inode to be opened for read- or read-write mode by an arbitrary number of handles (up to 255, anyway), but only one copy of that inode and its block chain are kept in memory. The block chain is kept in memory because it allows us translate virtual-to-real i/o coordinates in amortized O(1) time as opposed to O(N).

When i/o is requested on a handle, several steps must be managed:

  • Check the access mode of the handle against the access mode of the request. i.e. don't allow a read-only handle to write.
  • Translate the requested i/o position into a block ID. Given a block's ID, figuring out its on-storage position is an O(N) (N=block count) on the first calculation and O(1) on further operations as long as the operations are within the current virtual bounds of the inode. Write operations which expand the block chain are O(N) (N=number of new blocks).
  • Find the given block and perform the i/o on it, wrapping around to the next data block if we hit a block boundary during a read or write (and expanding as needed for write requests).

Some of the details which complicate the above:

  • Each handle has its own access mode, independent of the inode itself, but shares other inode data including the block chain. (A read-only EFS cannot open handles for read/write.)
  • The opened inode must be stored in memory and all handles which use it must refer to the same inode object, so that all handles to it see all changes to it. (Internal ownership become a bit problematic here, and that code needs to be refactored.)
  • The block chain for an opened inode must also be in memory, for cross-handle synchronization and performance reasons. sizeof(whio_epfs_block) is quite small (6 bytes on my machine), so the block chain (allocated as an array) does not take much memory. e.g. 100 blocks (that's a lot!) = approximately 600 bytes for managing the block chain.
  • Write requests may need to expand the block chain. This can lead to a whio_rc.DeviceFullError (if the EFS has no free blocks), an whio_rc.AllocError (if the in-memory block chain could not be expanded. i.e. out of memory), or whio_rc.IOError if the underlying storage fails.

Two of the i/o operations, in particular, are critical to the engine:

  • Translating i/o offsets to logical block IDs.
  • Expanding the block chain on demand for write requests.

Because each block is a fixed size, the block table has a known/fixed position in the EFS storage, and because of how we cache the block chain, the calculation of i/o position is O(1) for the average case. Loading the chain (done the first time an i/o translation is required) is O(N) and the coordinate translation algorithm gains a one-time N factor when adding N blocks via a write. Once a translated position is known and EFS blocks exist to hold the data, requests can be sent on to the EFS storage.

Because the block chain for an inode is stored in memory (just their IDs and linking info, not their actual contents), and because writes may cause us to add new blocks, we are forced to forgo the conventional good wisdom which says never to allocate memory during i/o operations. While an allocation failure is, in practice, very unlikely, it is in fact quite possible if the memory pool has been set to a too-small value and it is configured to not fall back to malloc() when it fills up.

The smaller the EFS block size, the more blocks we need to store the same amount of data. The block chains will also become longer, which means we have to allocate more blocks for opened handles. If an EFS is configured with a very small block size, it is easy to cause an out-of-memory condition if the memory pool is configured as mentioned above. That said, an OOM is only likely to ever happen if the memory pool is configured as mentioned above. (i tend to use it because i have an overblown fascination with avoiding malloc().)

A side-effect of the (potential) allocation-during-write is that the majority of virtual output errors an EFS will encounter will happen because a block chain could not be expanded. (At least in my experience.) These tend to show up as truncation errors because we internally grow the device by using the whio_dev::truncate() operation of the pseudofile's proxy i/o device.

When we want to import external data to an inode and we know the data's size in advance, it is a good idea to first truncate the destination device to the desired length first. That will get all memory allocation out of the way up front, and guarantees that the data block space is available for the import. If the truncation fails then the cause is likely that either there are not enough free blocks in the EFS, new blocks could not be added, or that allocation of the in-memory block chain failed.