Re: illegal snapshot, cannot be deleted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2015-11-13 14:55, Hugo Mills wrote:
On Fri, Nov 13, 2015 at 02:40:44PM -0500, Austin S Hemmelgarn wrote:
On 2015-11-13 13:42, Hugo Mills wrote:
On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote:
On 2015-11-13 12:30, Vedran Vucic wrote:
Hello,

Here are outputs of commands as you requested:
  btrfs fi df /
Data, single: total=8.00GiB, used=7.71GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=377.25MiB
GlobalReserve, single: total=128.00MiB, used=0.00B

btrfs fi show
Label: none  uuid: d6934db3-3ac9-49d0-83db-287be7b995a5
         Total devices 1 FS bytes used 8.08GiB
         devid    1 size 18.71GiB used 10.31GiB path /dev/sda6

btrfs-progs v4.0+20150429

Hmm, that's odd, based on these numbers, you should be having no
issue at all trying to run a balance. You might be hitting some
other bug in the kernel, however, but I don't remember if there were
any known bugs related to ENOSPC or balance in the version you're
running.

    There's one specific bug that shows up with ENOSPC exactly like
this. It's in all versions of the kernel, there's no known solution,
and no guaranteed mitigation strategy, I'm afraid. Various things like
balancing, or adding, balancing, and removing a device again have been
tried. Sometimes they seem to help; sometimes they just make the
problem worse.

    We average maybe one report a week or so with this particular
set of symptoms.
We should get this listed on the Wiki on the Gotcha's page ASAP,
especially considering that it's a pretty significant bug (not quite
as bad as data corruption, but pretty darn close).

    It's certainly mentioned in the FAQ, in the main entry on
unexpected ENOSPC. The text takes you through identifying when there's
the "usual" problem, then goes on to say that if you've hit ENOSPC
with free space still to be unallocated, you've got this issue.
It should still probably be on the Gotcha's page also, as it definitely fits the general description of the stuff there.
Vedran, could you try running the balance with just '-dusage=40' and
then again with just '-musage=40'?  If just one of those fails, it
could help narrow things down significantly.

Hugo, is there anything else known about this issue (I don't recall
seeing it mentioned before, and a quick web search didn't turn up
much)?

    I grumble about it regularly on IRC, where we get many more reports
of it than on the mailing list. There have been a couple on here that
I can recall, but not many.
Ah, that would explain it, I'm almost never on IRC.

  In particular:
1. Is there any known way to reliably reproduce it (I would assume
not, as that would likely lead to a mitigation strategy.  If someone
does find a reliable reproducer, please let me know, I've got some
significant spare processor time and storage space I could dedicate
to getting traces and filesystem images for debugging, and already
have most of the required infrastructure set up for something like
this)?

    None that I know of. I can start asking people for btrfs-image
dumps again, if you want to investigate. I did do that for a while, to
pass them to josef, but he said he didn't need any more of them after
a while. (He was always planning on investigating it, but kept getting
diverted by data corruption bugs, which have higher priority).
I don't have the experience to be able to properly debug it myself from images (my expertise has always been finding bugs, not necessarily fixing them), but was more offering to try and generate images (if we could find some series of commands that reproduces this at least some of the time, I have the resources to run a couple of VM's doing that over and over again until it hits the bug). If I could get some, I might be able to put some assertions into the kernel so that it panics when there's an ENOSPC in the balance code, and get a stack trace, but the more I think about it, the more likely it seems that that isn't going to be too helpful.

2. Is it contagious (that is, if I send a snapshot from a filesystem
that is affected by it, does the filesystem that receives the
snapshot become affected; if we could find a way to reproduce it, I
could easily answer this question within a couple of minutes of
reproducing it)?

    No, as far as I know, it doesn't transfer via send/receive.
send/receive is largely equivalent to copying the data by other means
-- receive is implemented almost exclusively in userspace, with only a
couple of ioctls for mucking around with the UUIDs at the end.
I thought that might be the case, but wanted to ask just to be safe (I do local backups on some systems using send/receive, largely because this means if my regular root filesystem gets corrupted, I can directly boot the backups, run a couple of commands, and then have a working system again in about 5 or 10 minutes, but if this could spread through send/receive, then that makes backups done this way less useful (because this is something that I would treat similar to regular FS corruption)).

3. Do we have any kind of statistics beyond the rate of reports (for
example, does it happen more often on bigger filesystems, or
possibly more frequently with certain chunk profiles)?

    Not that I've noticed, no. We've had it on small and large,
single-device and many devices, HDD and SSD, converted and not
converted. At one point, a couple of years ago, I did think it was
down to converted filesystems, because we had a run of them, but that
seems not to be the case.
That would seem to me to indicate it's somewhere in the common path for balance, which narrows things down at least, although not by much. Have we had anyone try balancing just data chunks or just metadata chunks? That might narrow things down even further. If it's corruption in the FS itself, I would assume it's somewhere either in the system chunks, the metadata chunks, or the space cache (if it's there, mounting with clear_cache should fix it).

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux