HI, It's a bit long so that i haven't read its whole, but i want to know if it has any collision with my ongoing feature "btrfs hot relocation/migration"? On Thu, Feb 21, 2013 at 12:46 AM, Matias Bjorling <mabj@xxxxxx> wrote: > Here is a short proposal for the hybrid storage cache idea with > introduction/motivation and a bird's eye view of an approach to implement a > hybrid storage cache for btrfs. Please note that there is currently no available > patches. We would like to get as much input before as possible before we start > designing and implementing a solution. > > 1. Introduction > > The emerge of Solid State Drives (SSD) change how data is stored for fast > accesses. Their high throughput and low latency characteristics make them a good > choice for applications that traditionally require many hard-drives. > > SSDs are still more expensive per GB, making traditional hard-drives a good and > affordable solution to store bulk amount of data. Often, the working set of a > filesystem is smaller than the full capacity of a drive. We can exploit this by > extending the often used bulk data on SSDs. We prioritize data that is often > accessed randomly, while larger bulk operations are kept on bulk storage. > > Recent development in Linux SSD caches, uses a block IO approach to solve > caching. The approach assumes that data is stable on disk and evicts data based > on LRU, temparature, etc. This is great for read only IO patterns and in-place > writes. However, btrfs uses a copy on write approach, that reduces the benefits > of block IO caching. The block caches are unable to track updates (require > extensive hints forth and back between the cache layer). Additionally, data and > metadata is the same to the block layer. > > The internal file-system information available within btrfs allows separation of > these types of updates and enables fine-grained control of a to-be implemented > cache. > > 2 Overview > > The design space for a cache is divided into read and writes. For both read > and write caches, we divide them into caching metadata (trees) accesses or > user data. Writes are further divided into either being write-back or > write-through. > > 2.1 Overview > > Any device attached to the storage pool should allow to be used as a cache. It > is natural to store the cache in the already implemented chunk architecture (as > cache chunks). Each allocated cache chunk may? be available to one or more > subvolumes. > > 2.2 Caching hierarchy > > By adding an extra layer, we have the following access pattern: host memory -> > SSD or Disk -> Disk. > > - Host memory caches lookup paths, transactions, free space infomation, etc. > - SSD/disk cache frequently used data or writes for data that cannot be in > host memory. > - Traditional hard-drives store the largest amount of data and store a > complete copy of all data. > > 2.3 Hotness tracking > > The data to cache is defined by some hotness algorithm, which can be applied to > different layers of btrfs: > > - Inode level > The recently implemented VFS hot track patches enable us to detect hotness > for files within any given VFS file-system implementation. For reads that > are within a tunable cache size, we either cache the tree leaf or its > extent. > For writes, we track the tree updates and write the tree updates to the SSD. > If the file is larger and it receives a considerable amount of reads or > writes, their hot subparts should be cached. > > - Tree level > Within the fs, we can track the hotness of each tree. If a tree is > read or updated frequently, it should be served by the SSD cache. > > - Extent level > Hole or parts of extents should be tracked and cached if needed. > > 2.4 Cache opportunities: > > - Hotness tracking for random reads > > Define threshold for when to cache reads. Back of envelope calculation > tells us to cache when IO size is below 1.5MB. This assumes 100 IO/s and > a read speed of 150MB/s from the traditional drives. This should be > tunable. > > If data is updated, we should "follow" the newly written data and evict the > "old" data from the cache. As such, if the old data was cached, we make sure > to also cache the new data. > > Implementation details: > - Use the hot track patches for VFS to track when an inode is hot and then > cache the reads. > - Track CoW actions and pre-warm cache. > > - Write-back cache > > * Tree updates > > Updates to trees are batched and flushed every 30 seconds. Flush the updates > to cache layer first, and then flush them later to bulk storage. > > When updates are flushed to bulk storage, we reorder IOs to be as sequential > as possible. This optimization allows us to have higher throughput at > the cost of sorting writes at flush time. > > The technique requires that we track tree updates between disk cache and > disk. As our trees are append only, we can track the current generation and > apply the difference at timed intervals or at mount/unmount times. > > Implementation details: > - This can be implemented on a per-tree basis. E.g. specific extent > trees, checksum tree or other frequently updated tree. > > * Data updates > > Data are placed two places. If small, directly inside the tree leafs or if > larger, within extents. If an inode is known to be hot, we cache the updates. > > - Write-through cache for user data > > If the cache isn't "safe", i.e. no duplicate copies. The cache can still be > used using write-through and cache subsequent reads. > > This is similar to a pure disk block-based cache approach. > > 2.5 Other > > - Warmup cache at mount time > > Reread the SSD cache on mount to enjoy a preheated cache of the bulk storage. > > This can be archived by storing information about the cache and reconstruct > the cache tree. > > - (By David Sterba) Implement appropriate debugfs/sysfs hooks for monitoring > the cache and get information about the size of trees. This is useful for > deciding if a tree should be cached on an SSD or not. E.g. the checksum tree > might always be in memory, but if it isn't, it should be cached on the SSD > storage to lower checksum tree seeks on the bulk storage. > > 2.6 Summary > > The following list of items have to be addressed for the first full patchset: > > - Cache lookup > - Cache type (write through, write back, hot tracking, etc.) > - Data structure for lookup cache > - Allow for prioritized storage (e.g. PCM>SSD>HDD) > - Eviction strategy. LRU, LFU, FIFO, Temparature-based (VFS hot track) > - Disk layout for cache storage > > Here we presented our design space for a hybrid drive solution, as well as > what it would take to carry it out. We are very much open to any kind of input, > feedback or new ideas you might have. > > Sincerely, > Matias & Jesper > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
