On Sat, May 09, 2020 at 10:46:27PM +0100, Steven Fosdick wrote: > On Sat, 9 May 2020 at 22:02, Phil Karn <karn@xxxxxxxx> wrote: > > My understanding is that large sequential writes can go directly to the > > SMR areas, which is an argument for a more conventional RAID array. How > > hard does btrfs try to do large sequential writes? > > Ok, so I had not heard of SMR before it was mentioned here and > immediate read the links. It did occur to me that large sequential > writes could, in theory, go straight to SMR zones but it also occurred > to be that it isn't completely straight forward. This is a nice overview: https://www.snia.org/sites/default/files/Dunn-Feldman_SNIA_Tutorial_Shingled_Magnetic_Recording-r7_Final.pdf > 1. If the drive firmware is not declaring that the drive uses SMR, and > therefore the host doesn't send a specific command to begin a > sequential write, how many sectors in a row does the drive wait to > receive before conclusion this is a large sequential operation? > > 2. What happens if the sequential operation does not begin a the start > of an SMR zone? In the event of a non-append write, a RMW operation performed on the entire zone. The exceptions would be data extents that are explicitly deleted (TRIM command), and it looks like a sequential overwrite at the _end_ of a zone (i.e. starting in the middle on a sector boundary and writing sequentially to the end of the zone without writing elsewhere in between) can be executed without having to rewrite the entire zone (zones can be appended at any time, the head erases data forward of the write location). I don't know if any drives implement that. In order to get conventional flush semantics to work, the drive has to write everything twice: once to a log zone (which is either CMR or SMR), then copy from there back to the SMR zone to which it belongs ("cleaning"). There is necessarily a seek in between, as the log zone and SMR data zones cannot coexist within a track. DM-SMR drives usually have smaller zones than HA-SMR drives, but we can only guess (or run a timing attack to find out). This would allow the drive to track a few zones in the typical 256MB RAM cache size for the submarined SMR drives. This source reports zone sizes of 15-40MB for DM-SMR and 256MB for HA-SMR, with cache CMR sizes not exceeding 0.2% of capacity: https://www.usenix.org/system/files/conference/hotstorage16/hotstorage16_wu.pdf btrfs should do OK as long as you use space_cache=v2--space cache v1 would force the drive into slow RMW operations every 30 seconds, as it would be forcing the drive to complete cleaning operations in multiple zones. Nobody should be using space_cache=v1 any more, and this is just yet another reason. Superblock updates would keep 2 zones updated all the time, effectively reducing the number of usable open zones in the drive permanently by 2. Longer commit intervals may help. > The only thing that would make it easy is if the drive had a > battery-backed RAM cache at least as big as an SMR zone, ideally about > twice as big, so it could accumulate the data for one zone and then > start writing that while accepting data for the next. As I have no > idea how big these zones are I have no idea how feasible that is. Batteries and flash are expensive, so you can assume the drive has neither unless they are prominently featured in the marketing docs to explain the costs that are passed on to the customer. All of the metadata and caches are stored on the spinning platters.
