On Feb 17, 2015, at 10:26 PM, Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > On Thu, Feb 12, 2015 at 11:12:25AM +0000, Steven Schlansker wrote: >> [ Please CC me on replies, I'm not on the list ] >> [ This is a followup to http://www.spinics.net/lists/linux-btrfs/msg41496.html ] >> >> Hello linux-btrfs, >> I've been having troubles keeping my Apache Mesos / Docker slave nodes stable. After some period of load, tasks begin to hang. Once this happens task after task ends up waiting at the same point, never to return. The system quickly becomes unusable and must be terminated. > > Are you seeing any ENOMEM Btrfs-related errors in your dmesg? In your > previous thread you trigged an ENOMEM BUG_ON and you mentioned that your > containers often get OOM'ed. Correct, the containers OOM relatively frequently. We're working on fixing this but accidents happen and we're doing heavy development with memory-hungry services. I've not observed any BTRFS complaints about memory; I'd assumed that it would be allocating outside of any container limits. But I don't know how e.g. that interacts with page cache accounting. > > I experimented with Btrfs in a memory-constrained cgroup and saw all > sorts of buggy behavior (https://lkml.org/lkml/2015/2/17/131), but I > haven't been able to reproduce this particular issue. This is a wild > guess, but there could be a buggy error handling path somewhere that > forgets to unlock a page. That seems like a very reasonable explanation. Is there any particular debugging I could provide to diagnose this? I can't reproduce it on demand but it's only a matter of time until it happens again... And I'd like to help contribute to fixing it, but I'm not much of a filesystem developer :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
