Internals of nffs

Disk structure

On disk, each area is prefixed with the following header:

/** On-disk representation of an area header. */
struct nffs_disk_area {
    uint32_t nda_magic[4];  /* NFFS_AREA_MAGIC{0,1,2,3} */
    uint32_t nda_length;    /* Total size of area, in bytes. */
    uint8_t nda_ver;        /* Current nffs version: 0 */
    uint8_t nda_gc_seq;     /* Garbage collection count. */
    uint8_t reserved8;
    uint8_t nda_id;         /* 0xff if scratch area. */
};

Beyond its header, an area contains a sequence of disk objects, representing the contents of the file system. There are two types of objects: inodes and data blocks. An inode represents a file or directory; a data block represents part of a file's contents.

/** On-disk representation of an inode (file or directory). */
struct nffs_disk_inode {
    uint32_t ndi_magic;         /* NFFS_INODE_MAGIC */
    uint32_t ndi_id;            /* Unique object ID. */
    uint32_t ndi_seq;           /* Sequence number; greater supersedes
                                   lesser. */
    uint32_t ndi_parent_id;     /* Object ID of parent directory inode. */
    uint8_t reserved8;
    uint8_t ndi_filename_len;   /* Length of filename, in bytes. */
    uint16_t ndi_crc16;         /* Covers rest of header and filename. */
    /* Followed by filename. */
};

An inode filename's length cannot exceed 256 bytes. The filename is not null-terminated. The following ASCII characters are not allowed in a filename:

/ (slash character)
\0 (NUL character)

/** On-disk representation of a data block. */
struct nffs_disk_block {
    uint32_t ndb_magic;     /* NFFS_BLOCK_MAGIC */
    uint32_t ndb_id;        /* Unique object ID. */
    uint32_t ndb_seq;       /* Sequence number; greater supersedes lesser. */
    uint32_t ndb_inode_id;  /* Object ID of owning inode. */
    uint32_t ndb_prev_id;   /* Object ID of previous block in file;
                               NFFS_ID_NONE if this is the first block. */
    uint16_t ndb_data_len;  /* Length of data contents, in bytes. */
    uint16_t ndb_crc16;     /* Covers rest of header and data. */
    /* Followed by 'ndb_data_len' bytes of data. */
};

Each data block contains the ID of the previous data block in the file. Together, the set of blocks in a file form a reverse singly-linked list.

The maximum number of data bytes that a block can contain is determined at initialization-time. The result is the greatest number which satisfies all of the following restrictions:

No more than 2048.
At least two maximum-sized blocks can fit in the smallest area.

The 2048 number was chosen somewhat arbitrarily, and may change in the future.

ID space

All disk objects have a unique 32-bit ID. The ID space is partitioned as follows:

ID range	Node type
0x00000000 - 0x0fffffff	Directory inodes
0x10000000 - 0x7fffffff	File inodes
0x80000000 - 0xfffffffe	Data blocks
0xffffffff	Reserved (NFFS_ID_NONE)

Scratch area

A valid nffs file system must contain a single “scratch area.” The scratch area does not contain any objects of its own, and is only used during garbage collection. The scratch area must have a size greater than or equal to each of the other areas in flash.

RAM representation

Every object in the file system is stored in a 256-entry hash table. An object's hash key is derived from its 32-bit ID. Each list in the hash table is sorted by time of use; most-recently-used is at the front of the list. All objects are represented by the following structure:

/**
 * What gets stored in the hash table.  Each entry represents a data block or
 * an inode.
 */
struct nffs_hash_entry {
    SLIST_ENTRY(nffs_hash_entry) nhe_next;
    uint32_t nhe_id;        /* 0 - 0x7fffffff if inode; else if block. */
    uint32_t nhe_flash_loc; /* Upper-byte = area idx; rest = area offset. */
};

For each data block, the above structure is all that is stored in RAM. To acquire more information about a data block, the block header must be read from flash.

Inodes require a fuller RAM representation to capture the structure of the file system. There are two types of inodes: files and directories. Each inode hash entry is actually an instance of the following structure:

/** Each inode hash entry is actually one of these. */
struct nffs_inode_entry {
    struct nffs_hash_entry nie_hash_entry;
    SLIST_ENTRY(nffs_inode_entry) nie_sibling_next;
    union {
        struct nffs_inode_list nie_child_list;           /* If directory */
        struct nffs_hash_entry *nie_last_block_entry;    /* If file */
    };
    uint8_t nie_refcnt;
};

A directory inode contains a list of its child files and directories (fie_child_list). These entries are sorted alphabetically using the ASCII character set.

A file inode contains a pointer to the last data block in the file (nie_last_block_entry). For most file operations, the reversed block list must be walked backwards. This introduces a number of speed inefficiencies:

All data blocks must be read to determine the length of the file.
Data blocks often need to be processed sequentially. The reversed nature of the block list transforms this from linear time to an O(n^2) operation.

Furthermore, obtaining information about any constituent data block requires a separate flash read.

Inode cache and Data Block cache

The speed issues are addressed by a pair of caches. Cached inodes entries contain the file length and a much more convenient doubly-linked list of cached data blocks. The benefit of using caches is that the size of the caches need not be proportional to the size of the file system. In other words, caches can address speed efficiency concerns without negatively impacting the file system's scalability.

nffs requires both caches during normal operation, so it is not possible to disable them. However, the cache sizes are configurable, and both caches can be configured with a size of one if RAM usage must be minimized.

The following data structures are used in the inode and data block caches.

/** Full data block representation; not stored permanently in RAM. */
struct nffs_block {
    struct nffs_hash_entry *nb_hash_entry;   /* Points to real block entry. */
    uint32_t nb_seq;                         /* Sequence number; greater
                                                supersedes lesser. */
    struct nffs_inode_entry *nb_inode_entry; /* Owning inode. */
    struct nffs_hash_entry *nb_prev;         /* Previous block in file. */
    uint16_t nb_data_len;                    /* # of data bytes in block. */
    uint16_t reserved16;
};

/** Represents a single cached data block. */
struct nffs_cache_block {
    TAILQ_ENTRY(nffs_cache_block) ncb_link; /* Next / prev cached block. */
    struct nffs_block ncb_block;            /* Full data block. */
    uint32_t ncb_file_offset;               /* File offset of this block. */
};

/** Full inode representation; not stored permanently in RAM. */
struct nffs_inode {
    struct nffs_inode_entry *ni_inode_entry; /* Points to real inode entry. */
    uint32_t ni_seq;                         /* Sequence number; greater
                                                supersedes lesser. */
    struct nffs_inode_entry *ni_parent;      /* Points to parent directory. */
    uint8_t ni_filename_len;                 /* # chars in filename. */
    uint8_t ni_filename[NFFS_SHORT_FILENAME_LEN]; /* First 3 bytes. */
};

/** Doubly-linked tail queue of cached blocks; contained in cached inodes. */
TAILQ_HEAD(nffs_block_cache_list, nffs_block_cache_entry);

/** Represents a single cached file inode. */
struct nffs_cache_inode {
    TAILQ_ENTRY(nffs_cache_inode) nci_link;        /* Sorted; LRU at tail. */
    struct nffs_inode nci_inode;                   /* Full inode. */
    struct nffs_cache_block_list nci_block_list;   /* List of cached blocks. */
    uint32_t nci_file_size;                        /* Total file size. */
};

Only file inodes are cached; directory inodes are never cached.

Within a cached inode, all cached data blocks are contiguous. E.g., if the start and end of a file are cached, then the middle must also be cached. A data block is only cached if its owning file is also cached.

Internally, cached inodes are stored in a singly-linked list, ordered by time of use. The most-recently-used entry is the first element in the list. If a new inode needs to be cached, but the inode cache is full, the least-recently-used entry is freed to make room for the new one. The following operations cause an inode to be cached:

Querying a file's length.
Seeking within a file.
Reading from a file.
Writing to a file.

The following operations cause a data block to be cached:

Reading from the block.
Writing to the block.

If one of the above operations is applied to a data block that is not currently cached, nffs uses the following procedure to cache the necessary block:

If none of the owning inode‘s blocks are currently cached, allocate a cached block entry corresponding to the requested block and insert it into the inode’s list.
Else if the requested file offset is less than that of the first cached block, bridge the gap between the inode's sequence of cached blocks and the block that now needs to be cached. This is accomplished by caching each block in the gap, finishing with the requested block.
Else (the requested offset is beyond the end of the cache),
1. If the requested offset belongs to the block that immediately follows the end of the cache, cache the block and append it to the list.
2. Else, clear the cache, and populate it with the single entry corresponding to the requested block.

If the system is unable to allocate a cached block entry at any point during the above procedure, the system frees up other blocks currently in the cache. This is accomplished as follows:

Iterate the inode cache in reverse (i.e., start with the least-recently-used entry). For each entry:
1. If the entry's cached block list is empty, advance to the next entry.
2. Else, free all the cached blocks in the entry's list.

Because the system imposes a minimum block cache size of one, the above procedure will always reclaim at least one cache block entry. The above procedure may result in the freeing of the block list that belongs to the very inode being operated on. This is OK, as the final block to get cached is always the block being requested.

Detection

The file system detection process consists of scanning a specified set of flash regions for valid nffs areas, and then populating the RAM representation of the file system with the detected objects. Detection is initiated with the nffs_detect() function.

Not every area descriptor passed to nffs_detect() needs to reference a valid nffs area. Detection is successful as long as a complete file system is detected somewhere in the specified regions of flash. If an application is unsure where a file system might be located, it can initiate detection across the entire flash region.

A detected file system is valid if:

At least one non-scratch area is present.
At least one scratch area is present (only the first gets used if there is more than one).
The root directory inode is present.

During detection, each indicated region of flash is checked for a valid area header. The contents of each valid non-scratch area are then restored into the nffs RAM representation. The following procedure is applied to each object in the area:

Verify the object's integrity via a crc16 check. If invalid, the object is discarded and the procedure restarts on the next object in the area.
Convert the disk object into its corresponding RAM representation and insert it into the hash table. If the object is an inode, its reference count is initialized to 1, indicating ownership by its parent directory.
If an object with the same ID is already present, then one supersedes the other. Accept the object with the greater sequence number and discard the other.
If the object references a nonexistent inode (parent directory in the case of an inode; owning file in the case of a data block), insert a temporary “dummy” inode into the hash table so that inter-object links can be maintained until the absent inode is eventually restored. Dummy inodes are identified by a reference count of 0.
If a delete record for an inode is encountered, the inode's parent pointer is set to null to indicate that it should be removed from RAM.

If nffs encounters an object that cannot be identified (i.e., its magic number is not valid), it scans the remainder of the flash area for the next valid magic number. Upon encountering a valid object, nffs resumes the procedure described above.

After all areas have been restored, a sweep is performed across the entire RAM representation so that invalid inodes can be deleted from memory.

For each directory inode:

If its reference count is 0 (i.e., it is a dummy), migrate its children to the /lost+found directory, and delete it from the RAM representation. This should only happen in the case of file system corruption.
If its parent reference is null (i.e., it was deleted), delete it and all its children from the RAM representation.

For each file inode:

If its reference count is 0 (i.e., it is a dummy), delete it from the RAM representation. This should only happen in the case of file system corruption. (We should try to migrate the file to the lost+found directory in this case, as mentioned in the todo section).

When an object is deleted during this sweep, it is only deleted from the RAM representation; nothing is written to disk.

When objects are migrated to the lost+found directory, their parent inode reference is permanently updated on the disk.

In addition, a single scratch area is identified during the detection process. The first area whose fda_id value is set to 0xff is designated as the file system scratch area. If no valid scratch area is found, the cause could be that the system was restarted while a garbage collection cycle was in progress. Such a condition is identified by the presence of two areas with the same ID. In such a case, the shorter of the two areas is erased and designated as the scratch area.

Formatting

A new nffs file system is created via formatting. Formatting is achieved via the nffs_format() function.

During a successful format, an area header is written to each of the specified locations. One of the areas in the set is designated as the initial scratch area.

Flash writes

The nffs implementation always writes in a strictly sequential fashion within an area. For each area, the system keeps track of the current offset. Whenever an object gets written to an area, it gets written to that area‘s current offset, and the offset is increased by the object’s disk size.

When a write needs to be performed, the nffs implementation selects the appropriate destination area by iterating though each area until one with sufficient free space is encountered.

There is no write buffering. Each call to a write function results in a write operation being sent to the flash hardware.

New objects

Whenever a new object is written to disk, it is assigned the following properties:

ID: A unique value is selected from the 32-bit ID space, as appropriate for the object's type.
Sequence number: 0

When a new file or directory is created, a corresponding inode is written to flash. Likewise, a new data block also results in the writing of a corresponding disk object.

Moving/Renaming files and directories

When a file or directory is moved or renamed, its corresponding inode is rewritten to flash with the following properties:

ID: Unchanged
Sequence number: Previous value plus one.
Parent inode: As specified by the move / rename operation.
Filename: As specified by the move / rename operation.

Because the inode's ID is unchanged, all dependent objects remain valid.

Unlinking files and directories

When a file or directory is unlinked from its parent directory, a deletion record for the unlinked inode gets written to flash. The deletion record is an inode with the following properties:

ID: Unchanged
Sequence number: Previous value plus one.
Parent inode ID: NFFS_ID_NONE

When an inode is unlinked, no deletion records need to be written for the inode's dependent objects (constituent data blocks or child inodes). During the next file system detection, it is recognized that the objects belong to a deleted inode, so they are not restored into the RAM representation.

If a file has an open handle at the time it gets unlinked, application code can continued to use the file handle to read and write data. All files retain a reference count, and a file isn't deleted from the RAM representation until its reference code drops to 0. Any attempt to open an unlinked file fails, even if the file is referenced by other file handles.

Writing to a file

The following procedure is used whenever the application code writes to a file. First, if the write operation specifies too much data to fit into a single block, the operation is split into several separate write operations. Then, for each write operation:

Determine which existing blocks the write operation overlaps (n = number of overwritten blocks).
If n = 0, this is an append operation. Write a data block with the following properties:
- ID: New unique value.
- Sequence number: 0.
Else (n > 1), this write overlaps existing data.
1. For each block in [1, 2, ... n-1], write a new block containing the updated contents. Each new block supersedes the block it overwrites. That is, each block has the following properties:
  - ID: Unchanged
  - Sequence number: Previous value plus one.
2. Write the nth block. The nth block includes all appended data, if any. As with the other blocks, its ID is unchanged and its sequence number is incremented.

Appended data can only be written to the end of the file. That is, “holes” are not supported.

Garbage collection

When the file system is too full to accommodate a write operation, the system must perform garbage collection to make room. The garbage collection procedure is described below:

The non-scratch area with the lowest garbage collection sequence number is selected as the “source area.” If there are other areas with the same sequence number, the one with the smallest flash offset is selected.
The source area‘s ID is written to the scratch area’s header, transforming it into a non-scratch ID. This former scratch area is now known as the “destination area.”
The RAM representation is exhaustively searched for collectible objects. The following procedure is applied to each inode in the system:
- If the inode is resident in the source area, copy the inode record to the destination area.
- If the inode is a file inode, walk the inode's list of data blocks, starting with the last block in the file. Each block that is resident in the source area is copied to the destination area. If there is a run of two or more blocks that are resident in the source area, they are consolidated and copied to the destination area as a single new block (subject to the maximum block size restriction).
The source area is reformatted as a scratch sector (i.e., is is fully erased, and its header is rewritten with an ID of 0xff). The area's garbage collection sequence number is incremented prior to rewriting the header. This area is now the new scratch sector.