Re: [PATCH 1/2] Btrfs: fix race between ranged fsync and writeback of adjacent ranges

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 07, 2019 at 08:09:02PM +0100, Filipe Manana wrote:
> On Tue, May 7, 2019 at 6:44 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
> >
> > On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@xxxxxxxxxx wrote:
> > > From: Filipe Manana <fdmanana@xxxxxxxx>
> > >
> > > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
> > > inode) that happens to be ranged, which happens during a msync() or writes
> > > for files opened with O_SYNC for example, we can end up with a corrupt log,
> > > due to different file extent items representing ranges that overlap with
> > > each other, or hit some assertion failures.
> > >
> > > When doing a ranged fsync we only flush delalloc and wait for ordered
> > > exents within that range. If while we are logging items from our inode
> > > ordered extents for adjacent ranges complete, we end up in a race that can
> > > make us insert the file extent items that overlap with others we logged
> > > previously and the assertion failures.
> > >
> > > For example, if tree-log.c:copy_items() receives a leaf that has the
> > > following file extents items, all with a length of 4K and therefore there
> > > is an implicit hole in the range 68K to 72K - 1:
> > >
> > >   (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
> > >
> > > It copies them to the log tree. However due to the need to detect implicit
> > > holes, it may release the path, in order to look at the previous leaf to
> > > detect an implicit hole, and then later it will search again in the tree
> > > for the first file extent item key, with the goal of locking again the
> > > leaf (which might have changed due to concurrent changes to other inodes).
> > >
> > > However when it locks again the leaf containing the first key, the key
> > > corresponding to the extent at offset 72K may not be there anymore since
> > > there is an ordered extent for that range that is finishing (that is,
> > > somewhere in the middle of btrfs_finish_ordered_io()), and it just
> > > removed the file extent item but has not yet replaced it with a new file
> > > extent item, so the part of copy_items() that does hole detection will
> > > decide that there is a hole in the range starting from 68K to 76K - 1,
> > > and therefore insert a file extent item to represent that hole, having
> > > a key offset of 68K. After that we now have a log tree with 2 different
> > > extent items that have overlapping ranges:
> > >
> >
> > Well this kind of sucks.  I wonder if we should be locking the extent range
> > while we're doing this in order to keep this problem from happening.  A fix for
> > another day though
> 
> The only reason I have not adopted that, is because it would prevent
> fsync and readpages() / readpage() from running concurrently (more of
> a problem when fsync is full ranged).
> 

Hrm good point.  Time to make the extent lock read/write lock!

> Locking the range would avoid any such eventual performance drop on
> fsync alone, but it would also allow to drop the inode's dio_sem?
> Remember it was giving lockdep warnings before and you moved it to
> btrfs_sync_file() from btrfs_log_changed_extents() sometime ago.
> However still not enough, as I still get often (random xfstests,
> brfs/072 and generic/299 for example) similar lockdep warnings due to
> dio_sem:
> 
> https://pastebin.com/ar6cLg2Q

Hmm that's annoying, but not a real problem because we're just destroying a
workqueue that has nothing on it because we raced with somebody else allocating
a new workqueue.  I think the best thing to do is to export sb_init_dio_done_wq
and call it within our direct io before we take all of our locks, that should
shut lockdep up.  That's kind of shitty though, I'll think about it some more.
Thanks,

Josef



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux