On Wed, Feb 21, 2018 at 1:10 PM, Nikolay Borisov <nborisov@xxxxxxxx> wrote:
>
>
> On 21.02.2018 15:06, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@xxxxxxxx> wrote:
>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>>> that DIO reads don't race with truncate. The idea is that if we have a
>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>>> forces the dio read case to fallback to inode_locking to prevent
>>> read/truncate races. Unfortunately this is subtly broken for at least
>>> 2 reasons:
>>>
>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>>> (for the read case). This means that there is no ordering guarantee
>>> between the invocation of inode_dio_wait and the increment of
>>> i_dio_count in btrfs_direct_IO in the tread case.
>>>
>>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
>>> are not really paired with the reader side - the test_bit in
>>> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
>>> the actual sleeping condition that needs ordering to prevent live-locks/
>>> missed wakeups is the modification/read of i_dio_count. So in this case
>>> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>>>
>>> The interraction between the two threads roughly looks like:
>>>
>>> T1(truncate): T2(btrfs_direct_IO):
>>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK) if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
>>> if (atomic_read()) if (atomic_dec_and_test(&inode->i_dio_count)
>>> schedule() wake_up_bit
>>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>>
>>> Without the ordering between the test_bit in T2 and setting the bit in
>>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
>>> to sleep in schedule and T2 misses the bit set, resulting in missing the
>>> wake up.
>>>
>>> In any case all of this is VERY subtle. So fix it by simply making
>>> the DIO READ case take inode_lock_shared. This ensure that we can have
>>> DIO reads in parallel at the same time we are protected against
>>> concurrent modification of the target file.
>>
>> And that prevents writes and reads against different (i.e. not
>> overlapping) ranges from happening in parallel.
>> That has a big impact on applications (databases for e.g.) that
>> operate on large files serving multiple requests.
>> Now all reads are serialized against all writes and vice versa.
>>
>
> Correct, but I'd prefer correctness over performance! And I assume other
> people as well, since as is the code atm it's not providing full
> protection between racing reads and truncate.
And even more people prefer correctness without significant impact on
performance.
So fix it differently please.
>
>> Unless I missed something, a big NAK to this change as it is.
>
>
>
>>
>>
>>> This way we closely mimic
>>> what ext4 codes does and simplify this mess.
>>>
>>> Multiple xfstest runs didn't show any regressions.
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@xxxxxxxx>
>>> ---
>>> fs/btrfs/btrfs_inode.h | 17 -----------------
>>> fs/btrfs/inode.c | 34 ++++++++++++++++++++--------------
>>> 2 files changed, 20 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>>> index f527e99c9f8d..3519e49d4ef0 100644
>>> --- a/fs/btrfs/btrfs_inode.h
>>> +++ b/fs/btrfs/btrfs_inode.h
>>> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>>> blk_status_t);
>>> };
>>>
>>> -/*
>>> - * Disable DIO read nolock optimization, so new dio readers will be forced
>>> - * to grab i_mutex. It is used to avoid the endless truncate due to
>>> - * nonlocked dio read.
>>> - */
>>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
>>> -{
>>> - set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>> - smp_mb();
>>> -}
>>> -
>>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
>>> -{
>>> - smp_mb__before_atomic();
>>> - clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>> -}
>>> -
>>> static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>>> u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>>> {
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 491a7397f6fa..9c43257e6e11 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>>> /* we don't support swapfiles, so vmtruncate shouldn't fail */
>>> truncate_setsize(inode, newsize);
>>>
>>> - /* Disable nonlocked read DIO to avoid the end less truncate */
>>> - btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
>>> + /*
>>> + * Truncate after all in-flight dios are finished, new ones
>>> + * will block on inode_lock. This only matters for AIO requests
>>> + * since DIO READ is performed under inode_shared_lock and
>>> + * write under exclusive lock.
>>> + */
>>> inode_dio_wait(inode);
>>> - btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>>>
>>> ret = btrfs_truncate(inode);
>>> if (ret && inode->i_nlink) {
>>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> loff_t offset = iocb->ki_pos;
>>> size_t count = 0;
>>> int flags = 0;
>>> - bool wakeup = true;
>>> bool relock = false;
>>> ssize_t ret;
>>>
>>> if (check_direct_IO(fs_info, iter, offset))
>>> return 0;
>>>
>>> - inode_dio_begin(inode);
>>> -
>>> /*
>>> * The generic stuff only does filemap_write_and_wait_range, which
>>> * isn't enough if we've written compressed pages to this area, so
>>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> offset + count - 1);
>>>
>>> if (iov_iter_rw(iter) == WRITE) {
>>> +
>>> + inode_dio_begin(inode);
>>> +
>>> /*
>>> * If the write DIO is beyond the EOF, we need update
>>> * the isize, but it is protected by i_mutex. So we can
>>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> dio_data.unsubmitted_oe_range_end = (u64)offset;
>>> current->journal_info = &dio_data;
>>> down_read(&BTRFS_I(inode)->dio_sem);
>>> - } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>>> - &BTRFS_I(inode)->runtime_flags)) {
>>> - inode_dio_end(inode);
>>> - flags = DIO_LOCKING | DIO_SKIP_HOLES;
>>> - wakeup = false;
>>> + } else {
>>> + /*
>>> + * In DIO READ case locking the inode in shared mode ensures
>>> + * we are protected against parallel writes/truncates
>>> + */
>>> + inode_lock_shared(inode);
>>> + inode_dio_begin(inode);
>>> }
>>>
>>> ret = __blockdev_direct_IO(iocb, inode,
>>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> btrfs_delalloc_release_space(inode, data_reserved,
>>> offset, count - (size_t)ret);
>>> btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>>> - }
>>> + } else
>>> + inode_unlock_shared(inode);
>>> out:
>>> - if (wakeup)
>>> - inode_dio_end(inode);
>>> + inode_dio_end(inode);
>>> +
>>> if (relock)
>>> inode_lock(inode);
>>>
>>> --
>>> 2.7.4
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html