Re: BTRFS RAID-1 leaf size change scenario

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Swâmi Petaramesh posted on Thu, 12 Feb 2015 14:26:09 +0100 as excerpted:

> I have a BTRFS RAID-1 FS made from 2x 2TB SATA mechanical drives.
> 
> It was created a while ago, with defaults by the time of 4K leaf sizes.
> 
> It also contains *lots* of subvols and snapshots.
> 
> It has become very slow over time, and I know that BTRFS performs better
> with the new 16K leaf sizes.

I agree with everything Chris Murphy said and were it me, would do the 
migration as he suggested.  Here I'll focus on another aspect, and 
reemphasize one he mentioned, as well.

1) Focus: Snapshots

Btrfs makes it deceivingly easy to make snapshots, since due to COW they 
can be created at very close to zero cost.  Unfortunately, that ease of 
creation belies the much more complicated snapshot maintenance and 
deletion costs, and often people create and keep around far more 
snapshots than is healthy for an optimally functioning btrfs.

Basically, anything over say 500 snapshots on a btrfs is going to start 
bogging it down, and more than around 250-ish snapshots of any single 
subvolume should be entirely unnecessary.  Unfortunately, due to the 
deceptive ease of creation, some people even take per-minute snapshots 
and fail to thin them down well over time, thus ending up with thousands 
to hundreds of thousands of snapshots, particularly if they're 
snapshotting multiple subvolumes at that extreme per-minute frequency.  A 
filesystem in this condition is going to be a nightmare to do any 
reasonable maintenance (like a rebalance to add/remove/replace devices, 
or a defrag of more than a few files) on at all, and even regular 
operations will likely slow down due to fragmentation, etc.

Given your starred-emphasis "*lots*" of snapshots, I strongly suspect 
this to be one of the big reasons for your slowdowns, far more so than 
the 4k nodesize, tho that won't help.  OTOH, if your characterization of 
*lots* was actually less than this, snapshotting probably isn't such a 
big problem after all and you can skip to the reemphasis point, below.

Unfortunately, at this point it may not be reasonable to recover from the 
situation on the existing filesystem, as doing the necessary thinning 
down of those snapshots could take nigh eternity (well, days per 
snapshot, "not reasonable" if you're dealing with anything near the 
thousands of snapshots I suspect) due to all that overhead.

But regardless of whether you can fix the existing btrfs, at least once 
you start over with a new one, try to better manage your snapshotting 
practices and I suspect the filesystem won't slow down as fast as this 
one did, while if you don't, I strongly suspect the newer 16k nodesizes 
aren't going to make that much difference and you'll get the same sort of 
slowdowns over time as you're dealing with now.

Here's the base argument concerning snapshot thinning management.  
Suppose you're doing hourly snapshots, and not doing any thinning.  
Suppose that a year later, you find you need a version of a file from a 
year ago, and go to retrieve it from one of those snapshots.  So you go 
to mount a year-old snapshot and you have to pick one.  Is it *REALLY* 
going to matter, a year on, with no reason to access it since, what exact 
hour it was?  How are you even going to /know/ what exact hour to pick?

A year on, practicality suggests you'll simply pick one out of the 24 for 
the day and call it good.  But is even that level of precision 
necessary?  A year on, might a single snapshot for the week, or for the 
month, or even the quarter, be sufficient?  Chances are it will be, and 
if the one you pick is too new or too old, you can simply pick one newer 
or one older and be done with it.

Similarly, per-minute snapshots?  In the extreme case, maybe for a half 
our or hour.  Then thin them down to say 10 minute snapshots, and to half 
hour snapshots after four or six hours (depending on whether you're 
basing on an 8-hour workday or a 24-hour day), then to hourly after a 
day, four or six hourly after three days, and daily after a week.

But in practice, per-minute snapshots are seldom necessary at all, and 
could be problems for maintenance if they end up taking more than a 
minute to delete.  Ten minute, possibly, more likely half-hour or hourly 
is fine.

So say we start with half hour snapshots, 24-hours/day, but thinning down 
to hourly after four hours and to four-hourly after a day, for a week.  
That's:

8*half-hourly, 24-4=20*hourly, 7-1=6*6=36-4-hourly = 8+20+36 =
64 snapshots in a week.

Now, keep daily snapshots for three additional weeks = 64+21 =
85 snapshots in four weeks.

And keep weekly snapshots to fill out the half-year (26 weeks)
26-4 = 22 weeks, 22*7=154, 154+85 = 239 snapshots in a half a year.

Now after half a year, if the data is any value at all, it will have been 
backed up elsewhere.  If you like, to avoid having to dig up those 
backups, you can keep quarterly snapshots for... pretty much the life of 
the filesystem or hardware, it'll only add four snapshots a year beyond 
the 239 for nearest half-year.  Or delete snapshots beyond a quarter or a 
half year and rely on the off-filesystem backups, allowing btrfs to 
finally free the space tied up in the oldest and thus presumably most 
changed copies of the files in question.

As the above demonstrates, even at originally half-hourly snapshots, a 
reasonable thinning program keeps snapshots per subvolume to 200-300.

And if you can get by with say 4X per day (6-hourly on a 24-hour day) 
snapshots and keep only two days of that, thinning to 2X per day for a 
week, then daily for another week and weekly out to six months, that's:

2-days of 4X-daily = 8, 5 days of 2X = 10, 7 dailies, 24 weeklies =
8+10+7+24 = 49 snapshots to six months, starting with 6-hourly.

Now say two of the half-hourly snapshotted subvolumes and five of the 6-
hourly snapshotted subvolumes.  Rounding up for easier math and to allow 
for a few quarterly snapshots, we're now at 2*250=500 at the higher 
frequency, 5*60=300 at the lower frequency, 800 total for the filesystem.

800 snapshots for the filesystem is a bit high, but it's manageable, and 
**WELL** better than the tens or hundreds of thousands of snapshots that 
some are trying to handle.  Obviously, if you can reasonably manage only 
one subvolume at the higher snapshot frequency, or if you can reduce 
snapshot frequency to daily or even not snapshot at all or only perhaps 
weekly for some of those subvolumes, total snapshots will go down 
accordingly.

Bottom line, if you're dealing with much over a thousand snapshots per 
filesystem, seriously reconsider your snapshotting strategy as you're 
probably doing it wrong.  If possible, keep it to a few hundred.  
Filesystem management and even general usage should be far better as a 
result.

Again, while it might be too late to reasonably recover from a bad 
snapshotting strategy on the existing filesystem, as deleting all those 
old snapshots now may take far longer than is reasonable (tho you could 
try it and /see/ how long deleting a snapshot takes), at least try to 
manage things better when you setup the new filesystem.

2) Reemphasis: Versions.

Restating what CM said, old btrfs versions are buggy btrfs versions.  
Btrfs is still new enough and not yet stable and mature enough that 
running current versions really does lower the risk to your data, as old 
versions are known buggy versions, and running them is effectively 
playing Russian Roulette with your data -- you might get away with it for 
awhile, but play the odds long enough and eventually you'll get shot.

More specifically, at operational runtime, it's the kernel that's most 
vital, as userspace basically only tells the kernel what to do at a high 
level, and the kernel actually executes the lower level code to do it.  
So older kernels risk runtime damage on existing filesystems due to bugs 
that have long since been found and fixed in newer kernel versions.

However, in offline mode, such as when trying to repair an unmounted 
filesystem using btrfs check, or when using btrfs-restore to try to 
recover un-backed-up data from an unmountable filesystem before trying to 
repair it, btrfs-progs userspace becomes vital, as it's actually touching 
the filesystem then, without the kernel's direct involvement.

So a current kernel is most vital for btrfs at runtime, while a current 
btrfs-progs userspace is most vital if something screwed up and you're 
trying to fix it or recover what you can before blowing the existing 
filesystem away to start over.

Meanwhile, beyond data corruption bugs, one of the big focuses recently 
has been operation scaling.  If you have thousands of snapshots as I 
suspect, it's very possible that the latest 3.18 or 3.19 kernel will 
actually let you work with them in a reasonable timeframe, while a 3.16 
vintage kernel will take so long it's impractical to do anything with 
them at all.  I'd certainly try it, at least, before giving up on doing 
anything practical with that old filesystem, unless of course you decide 
to simply bite the bullet and start over with a new filesystem and new 
devices, and only access the old one long enough to get current data and 
possibly a few selected snapshots off of it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux