Documentation/components/filesystem/mnemofs.rst - nuttx - Git at Google

 =======
 MNEMOFS
 =======

 Mnemofs is a NAND Flash File System built for NuttX.

 Usage
 =====

 If there's a NAND flash available at a location, for example, ``/dev/nand``,
 you can mount it with ``mnemofs`` to a location like ``/mydir`` using::

     mount -t mnemofs /dev/nand /mydir

 The above command will only work if the device was already formatted using
 mnemofs. For a brand new device, or if you want to switch from an existing
 file system, this won't work, and would need a format::

     mount -t mnemofs -o forceformat /dev/nand /mydir

 Unsure of whether you need to do a format? This will help::

     mount -t mnemofs -o autoformat /dev/nand /mydir

 This will format the device only if it can not detect mnemofs being already
 formatted onto it. Do note this includes cases where mnemofs is formatted to
 the device, but it's been mutilated to the point of being unrecognizable.

 After this, use it like a regular file system. That's the job of a file
 system after all...to hide the storage device's peculiarities behind an
 abstraction. A file system is considered good if you don't have to think
 about its existence during regular usage.

 NAND Flashes
 ============

 Programmatically, the NAND flash has some problems. The whole device can be
 condensed into three layers: blocks, pages and cells.

 Cells represent the smallest unit of storage in NAND flashes, but are often
 ignored, as direct access is not allowed. If a cell stores one bit, it's a
 Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
 the more bits per cell, the lesser is the wear resilience. Thus, higher
 bits per cell are easier to wear out and become unreliable.

 Pages are the smallest readable or writable unit of the NAND flash. It's
 made up of several cells, and can be expected to have a size of the similar
 order of 512 B.

 Blocks are the smallest erasable unit of NAND flash. They are made up of
 several pages. If a page is already written, it needs to be erased before it
 can be written again. And since blocks are the smallest erasable unit, the
 entire block needs to be erased if the user wants to update the contents of
 one page.

 The erase operation is what causes a block to wear out. If a block is worn
 out too much, it will lose its ability to reliably store data. An unreliable
 block can not guarantee that the data read from the pages in it is the same
 as what was written to it. This state is called as a bad block.

 A manufacturer can also deem a block to be unreliable from their testing,
 and can mark them as bad blocks right from manufacture.

 A good file system will aim to level out the wear between blocks as much as
 it can.

 Design
 ======

 There are various layers and components in mnemofs, and they interact with
 various layers on abstraction over each other.

 Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
 be updated, it is copied over in memory, and then the change is applied, and
 the new data is written to a new location.

 R/W Layer
 ---------

 This works with the NAND flash device driver directly. It can write an
 entire page, read an entire page, erase an entire block, check if a block is
 bad (from it's bad block marker), or set a block as bad. It's the simplest
 layer.

 Block Allocator
 ---------------

 The block allocator contains two arrays. One is a bit mask is for tracking
 the free pages, while the other is an array of numbers, one number for each
 block, denoting the number of pages in that block that are ready to be
 erased.

 The block allocator allocates pages or blocks in a sequential manner to keep
 it fair for all pages, thus, ensuring wear levelling. It also starts from a
 random offset to prevent bias to the front of the device in case of multiple
 power losses and reinitialization in such casses. If a block is required it
 skips pages to the start of the next block. Since block allocations happen
 only in the journal, they happen in bulk and the number of skipped pages is
 very minimal.

 Once reaching the end of the device, it cycles back to the front. Thus any
 skipped pages get the chance to be allocated in the next cycle.

 CTZ Layer
 ---------

 This works with the R/W Layer, and acts as an abstraction layer for other
 components in mnemofs. Mnemofs uses
 `CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
 to represent both files and directories in flash. CTZ lists of files contain
 only the data of the file, while CTZ lists of directories contain directory
 entries (direntries) for each FS Object (file or directory) grouped into it.

 This layer abstracts away the complex division of flash space that's present
 in CTZ skip lists, and allows users of this layer to not worry about the
 complexities of a CTZ skip list, and in fact, to feel that the data is like a
 contiguous space.

 This layer allows the user to specify a data offset, which refers to the
 offset into the actual data stored in the CTZ skip list (ie. excluding the
 pointers), and the number of bytes, and perform operations on the CTZ list
 almost like if it were a single array.

 In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
 pointer takes up 4 bytes.

 Littlefs design document shows how a CTZ list can be identified using the
 index of the last CTZ block, and the page number of that CTZ block.

 Journal
 -------

 The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
 concern the master node. These blocks are arranged in a singly linked list.

 Due to CoW policy, when a CTZ list is updated, it now has a new location. The
 first ``n`` blocks of the journal is responsible for storing logs containing
 information about this very update. It will contain the old location of the
 CTZ skip list, and the new location.

 Thus, when the user requires an updated location of a CTZ list, they will
 first find the old location by traversing the FS tree in the flash, and then
 will traverse the journal to find the latest location.

 So, the FS tree on the flash acts like a "base" state with updates stored in
 the journal. Each log in journal is followed by a checksum to verify if all
 of it was written properly. This helps in making it power loss resilient.

 The journal, when combined with CoW, plays another important role. In pure
 CoW, any update to a CTZ file will result in it having a new location. This
 new location will need to be updated in the parent, which itself will have a
 new location after the update, and so on till it reaches the root. The
 journal stops this propagation immediately. When the journal is full above
 a certain limit, it will flush, and apply all of these changes to the FS
 tree in one go. This helps in wear reduction.

 The journal mainly works with the CTZ layer, and any updates to a CTZ list
 using this layer automatically adds a log for it in the journal.

 The journal starts with a magic sequence, then the number of blocks in the
 journal (excluding master blocks), and then follows an array with the block
 numbers of the blocks in the journal (including the master blocks). Following
 this, logs are stored in the blocks.

 Master Node and Root
 --------------------

 The root of the file system is treated like any directory as far as its
 storage on the flash is concerned. This is because the master node acts as a
 parent to the root, and contains information of the root in a way identical
 to direntries.

 The master node is stored in the master blocks. There are two master blocks,
 and both are duplicated of each other for backup. Each master block is a
 block, and thus have multiple pages in them. Each page contains one revision
 of the master node. The master nodes are written sequentially, and have a
 timestamp on them as well.

 When a CTZ list is moved to a new location, the obsolete pages of the old
 CTZ list are marked for deletion.

 LRU Cache
 ---------

 Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
 component is to reduce wear on the flash at the expense of memory.

 The LRU is a kernel list of nodes. Each node represents an FS object. Each
 node also contains a kernel list of deltas. Each delta contains information
 about an update or deletion from the user (which is what all of the VFS write
 operations can be condensed to).

 There's a pre-configured limit for both deltas per node and nodes in the LRU.

 If the delta list in a node is full, and another is to be added, all the
 existing deltas in the list are clubbed together and written to the flash
 using the CTZ layer. The layer also automatically adds a log for this update.
 When a node receives a delta, it is bumped from its current location in the
 LRU to be at the front. This way, the last node in the LRU is always the
 least used node.

 If the node limit is reached in the LRU, and a new node is to be added to the
 LRU, then the final node (which is also the least recently used node), is
 popped from the LRU to make space for the new node. This popped node is then
 written to the flash using the CTZ layer as well.

 The LRU helps in clubbing updates to a single FS object and thus helps in
 reducing the wear of the flash.

 Journal Flush
 -------------

 The latest master node revision is the most useful out of the revisions. As
 in CoW it's prudent to update the FS tree from bottom up, the root is the
 last one to get updated in the case of a journal flush.

 The logs are just location updates. So, when a journal flush occurs, it will
 update the locations of all the children in the parent. This updates the
 parent, and then this update goes through the same procedure as any other
 update.

 This is why it's best to start the flush operation when the journal is filled
 up over a certain limit, instead of waiting it to be full. Why? Any log of
 a parent makes any log of its children written **before** it useless, as the
 updated location of the parent can be read to get the updated location of the
 child till that point in the logs.

 So, it will be best to move up the FS tree from bottom during update and
 update the root last, since the root is the parent of every FS object.

 Once the root is updated, all other journal logs become useless, and can be
 erased. The root's log is not written in the first ``n`` blocks of the
 journal, but written as a new master node entry in the master blocks.

 Once the new root is written, the first ``n`` blocks can be erased, and
 reallocated (due to the rules of wear levelling). The master blocks however
 have some conditions for reallocation. This is called moving of the journal.

 Every time the first ``n`` blocks are cleared, a new master node is added.
 The only time a master block needs to be erased is when it becomes full.
 Thus, if there are ``p`` pages in a block, the master blocks will be
 moved along with the rest of the journal for every ``p`` journal flushes.

 Before the new master node is written, none of the old pages should be erased
 to allow rollback to the previous FS tree state. The moment the new master
 node is updated, any block which has all of the pages in it ready for
 deletion will be erased to make space.

 FS Object Layer
 ---------------

 This layer provides an abstraction for iterating, adding, deleting or
 reading direntries.

 This works with the LRU and the journal to get the latest data and thus the
 user of this layer does not have to worry about these underlying mnemofs
 components.

 VFS Method Layer
 ----------------

 VFS method layer contains methods exposed to the VFS. This layer works with
 the FS Object layer for direntry related tasks, or the LRU for file level
 read/write tasks.
	=======
	MNEMOFS
	=======

	Mnemofs is a NAND Flash File System built for NuttX.

	Usage
	=====

	If there's a NAND flash available at a location, for example, ``/dev/nand``,
	you can mount it with ``mnemofs`` to a location like ``/mydir`` using::

	mount -t mnemofs /dev/nand /mydir

	The above command will only work if the device was already formatted using
	mnemofs. For a brand new device, or if you want to switch from an existing
	file system, this won't work, and would need a format::

	mount -t mnemofs -o forceformat /dev/nand /mydir

	Unsure of whether you need to do a format? This will help::

	mount -t mnemofs -o autoformat /dev/nand /mydir

	This will format the device only if it can not detect mnemofs being already
	formatted onto it. Do note this includes cases where mnemofs is formatted to
	the device, but it's been mutilated to the point of being unrecognizable.

	After this, use it like a regular file system. That's the job of a file
	system after all...to hide the storage device's peculiarities behind an
	abstraction. A file system is considered good if you don't have to think
	about its existence during regular usage.

	NAND Flashes
	============

	Programmatically, the NAND flash has some problems. The whole device can be
	condensed into three layers: blocks, pages and cells.

	Cells represent the smallest unit of storage in NAND flashes, but are often
	ignored, as direct access is not allowed. If a cell stores one bit, it's a
	Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
	the more bits per cell, the lesser is the wear resilience. Thus, higher
	bits per cell are easier to wear out and become unreliable.

	Pages are the smallest readable or writable unit of the NAND flash. It's
	made up of several cells, and can be expected to have a size of the similar
	order of 512 B.

	Blocks are the smallest erasable unit of NAND flash. They are made up of
	several pages. If a page is already written, it needs to be erased before it
	can be written again. And since blocks are the smallest erasable unit, the
	entire block needs to be erased if the user wants to update the contents of
	one page.

	The erase operation is what causes a block to wear out. If a block is worn
	out too much, it will lose its ability to reliably store data. An unreliable
	block can not guarantee that the data read from the pages in it is the same
	as what was written to it. This state is called as a bad block.

	A manufacturer can also deem a block to be unreliable from their testing,
	and can mark them as bad blocks right from manufacture.

	A good file system will aim to level out the wear between blocks as much as
	it can.

	Design
	======

	There are various layers and components in mnemofs, and they interact with
	various layers on abstraction over each other.

	Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
	be updated, it is copied over in memory, and then the change is applied, and
	the new data is written to a new location.

	R/W Layer
	---------

	This works with the NAND flash device driver directly. It can write an
	entire page, read an entire page, erase an entire block, check if a block is
	bad (from it's bad block marker), or set a block as bad. It's the simplest
	layer.

	Block Allocator
	---------------

	The block allocator contains two arrays. One is a bit mask is for tracking
	the free pages, while the other is an array of numbers, one number for each
	block, denoting the number of pages in that block that are ready to be
	erased.

	The block allocator allocates pages or blocks in a sequential manner to keep
	it fair for all pages, thus, ensuring wear levelling. It also starts from a
	random offset to prevent bias to the front of the device in case of multiple
	power losses and reinitialization in such casses. If a block is required it
	skips pages to the start of the next block. Since block allocations happen
	only in the journal, they happen in bulk and the number of skipped pages is
	very minimal.

	Once reaching the end of the device, it cycles back to the front. Thus any
	skipped pages get the chance to be allocated in the next cycle.

	CTZ Layer
	---------

	This works with the R/W Layer, and acts as an abstraction layer for other
	components in mnemofs. Mnemofs uses
	`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
	to represent both files and directories in flash. CTZ lists of files contain
	only the data of the file, while CTZ lists of directories contain directory
	entries (direntries) for each FS Object (file or directory) grouped into it.

	This layer abstracts away the complex division of flash space that's present
	in CTZ skip lists, and allows users of this layer to not worry about the
	complexities of a CTZ skip list, and in fact, to feel that the data is like a
	contiguous space.

	This layer allows the user to specify a data offset, which refers to the
	offset into the actual data stored in the CTZ skip list (ie. excluding the
	pointers), and the number of bytes, and perform operations on the CTZ list
	almost like if it were a single array.

	In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
	pointer takes up 4 bytes.

	Littlefs design document shows how a CTZ list can be identified using the
	index of the last CTZ block, and the page number of that CTZ block.

	Journal
	-------

	The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
	concern the master node. These blocks are arranged in a singly linked list.

	Due to CoW policy, when a CTZ list is updated, it now has a new location. The
	first ``n`` blocks of the journal is responsible for storing logs containing
	information about this very update. It will contain the old location of the
	CTZ skip list, and the new location.

	Thus, when the user requires an updated location of a CTZ list, they will
	first find the old location by traversing the FS tree in the flash, and then
	will traverse the journal to find the latest location.

	So, the FS tree on the flash acts like a "base" state with updates stored in
	the journal. Each log in journal is followed by a checksum to verify if all
	of it was written properly. This helps in making it power loss resilient.

	The journal, when combined with CoW, plays another important role. In pure
	CoW, any update to a CTZ file will result in it having a new location. This
	new location will need to be updated in the parent, which itself will have a
	new location after the update, and so on till it reaches the root. The
	journal stops this propagation immediately. When the journal is full above
	a certain limit, it will flush, and apply all of these changes to the FS
	tree in one go. This helps in wear reduction.

	The journal mainly works with the CTZ layer, and any updates to a CTZ list
	using this layer automatically adds a log for it in the journal.

	The journal starts with a magic sequence, then the number of blocks in the
	journal (excluding master blocks), and then follows an array with the block
	numbers of the blocks in the journal (including the master blocks). Following
	this, logs are stored in the blocks.

	Master Node and Root
	--------------------

	The root of the file system is treated like any directory as far as its
	storage on the flash is concerned. This is because the master node acts as a
	parent to the root, and contains information of the root in a way identical
	to direntries.

	The master node is stored in the master blocks. There are two master blocks,
	and both are duplicated of each other for backup. Each master block is a
	block, and thus have multiple pages in them. Each page contains one revision
	of the master node. The master nodes are written sequentially, and have a
	timestamp on them as well.

	When a CTZ list is moved to a new location, the obsolete pages of the old
	CTZ list are marked for deletion.

	LRU Cache
	---------

	Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
	component is to reduce wear on the flash at the expense of memory.

	The LRU is a kernel list of nodes. Each node represents an FS object. Each
	node also contains a kernel list of deltas. Each delta contains information
	about an update or deletion from the user (which is what all of the VFS write
	operations can be condensed to).

	There's a pre-configured limit for both deltas per node and nodes in the LRU.

	If the delta list in a node is full, and another is to be added, all the
	existing deltas in the list are clubbed together and written to the flash
	using the CTZ layer. The layer also automatically adds a log for this update.
	When a node receives a delta, it is bumped from its current location in the
	LRU to be at the front. This way, the last node in the LRU is always the
	least used node.

	If the node limit is reached in the LRU, and a new node is to be added to the
	LRU, then the final node (which is also the least recently used node), is
	popped from the LRU to make space for the new node. This popped node is then
	written to the flash using the CTZ layer as well.

	The LRU helps in clubbing updates to a single FS object and thus helps in
	reducing the wear of the flash.

	Journal Flush
	-------------

	The latest master node revision is the most useful out of the revisions. As
	in CoW it's prudent to update the FS tree from bottom up, the root is the
	last one to get updated in the case of a journal flush.

	The logs are just location updates. So, when a journal flush occurs, it will
	update the locations of all the children in the parent. This updates the
	parent, and then this update goes through the same procedure as any other
	update.

	This is why it's best to start the flush operation when the journal is filled
	up over a certain limit, instead of waiting it to be full. Why? Any log of
	a parent makes any log of its children written before it useless, as the
	updated location of the parent can be read to get the updated location of the
	child till that point in the logs.

	So, it will be best to move up the FS tree from bottom during update and
	update the root last, since the root is the parent of every FS object.

	Once the root is updated, all other journal logs become useless, and can be
	erased. The root's log is not written in the first ``n`` blocks of the
	journal, but written as a new master node entry in the master blocks.

	Once the new root is written, the first ``n`` blocks can be erased, and
	reallocated (due to the rules of wear levelling). The master blocks however
	have some conditions for reallocation. This is called moving of the journal.

	Every time the first ``n`` blocks are cleared, a new master node is added.
	The only time a master block needs to be erased is when it becomes full.
	Thus, if there are ``p`` pages in a block, the master blocks will be
	moved along with the rest of the journal for every ``p`` journal flushes.

	Before the new master node is written, none of the old pages should be erased
	to allow rollback to the previous FS tree state. The moment the new master
	node is updated, any block which has all of the pages in it ready for
	deletion will be erased to make space.

	FS Object Layer
	---------------

	This layer provides an abstraction for iterating, adding, deleting or
	reading direntries.

	This works with the LRU and the journal to get the latest data and thus the
	user of this layer does not have to worry about these underlying mnemofs
	components.

	VFS Method Layer
	----------------

	VFS method layer contains methods exposed to the VFS. This layer works with
	the FS Object layer for direntry related tasks, or the LRU for file level
	read/write tasks.