Re: OSD deadlock with cephfs client and OSD on same machine
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On Tue, 29 May 2012, Amon Ott wrote:
> Hello again!
>
> On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount
> on the same system and no syncfs system call (as to be expected with libc6 <
> 2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers
> the system.
>
> After some investigation in the code, this is what I found:
> In src/common/sync_filesystem.h, the function sync_filesystem() first tries a
> syncfs() (not available), then a btrfs ioctrl sync (not available with
> non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems,
> including the journal device, the osd storage area and the cephfs mount.
> Under some load, when OSD calls sync(), cephfs sync waits for the local osd,
> which already waits for its storage to sync, which the kernel wants to do
> after the cephfs sync. Deadlock.
>
> The function sync_filesystem() is called by FileStore::sync_entry() in
> src/os/FileStore.cc, but only on non-btrfs storage and if
> filestore_fsync_flushes_journal_data is false. After forcing this to true in
> OSD config, our test cluster survived three days of heavy load (and still
> running fine) instead of deadlocking all nodes within an hour. Reproduced
> with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in
> current master.
>
> Conclusion: If you want to run OSD and cephfs kernel client on the same Linux
> server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is
> 2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still
> unstable) or risk data loss by missing syncs through the workaround of
> forcing filestore_fsync_flushes_journal_data to true.
Note that fsync_flushed_journal_data should only be set to true with ext3
and the 'data=ordered' or 'data=journal' mount option. It is an
implementation artifact only that fsync() will flush all previous writes.
> Please consider putting out a fat warning at least at build time, if syncfs()
> is not available, e.g. "No syncfs() syscall, please expect a deadlock when
> running osd on non-btrfs together with a local cephfs mount." Even better
> would be a quick runtime test for missing syncfs() and storage on non-btrfs
> that spits out a warning, if deadlock is possible.
I think a runtime warning makes more sense; nobody will see the build time
warning (e.g., those installed debs).
> As a side effect, the experienced lockup seems to be a good way to reproduce
> the long standing bug 1047 - when our cluster tried to recover, all MDS
> instances died with those symptoms. It seems that a partial sync of journal
> or data partition causes that broken state.
Interesting! If you could also note on that bug what the metadata
workload was (what was making hard links?), that would be great!
Thanks-
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
[CEPH Users]
[Information on CEPH]
[Linux USB Devel]
[Video for Linux]
[Linux Audio Users]
[Photo]
[Yosemite News]
[Yosemite Photos]
[Free Online Dating]
[Linux Kernel]
[Linux SCSI]
[XFree86]