Re: [PATCH 3/3] btrfs: add missing discards when unpinning extents with -o discard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/8/15 10:12 AM, Filipe David Manana wrote:
> On Wed, Jun 3, 2015 at 3:47 PM,  <jeffm@xxxxxxxx> wrote:
>> From: Jeff Mahoney <jeffm@xxxxxxxx>
>> 
>> When we clear the dirty bits in btrfs_delete_unused_bgs for
>> extents in the empty block group, it results in
>> btrfs_finish_extent_commit being unable to discard the freed
>> extents.
>> 
>> The block group removal patch added an alternate path to forget
>> extents other than btrfs_finish_extent_commit.  As a result, any
>> extents that would be freed when the block group is removed
>> aren't discarded.  In my test run, with a large copy of mixed
>> sized files followed by removal, it left nearly 2/3 of extents
>> undiscarded.
>> 
>> To clean up the block groups, we add the removed block group onto
>> a list that will be discarded after transaction commit.
>> 
>> v1->v2: - Fix ordering to ensure that we don't discard extents
>> freed in an uncommitted transaction.
>> 
>> v2->v3: - Factor out get/put block_group->trimming to ensure that
>> cleanup always happens at the last reference drop. - Cleanup the
>> free space cache on the last reference drop. - Use list_move
>> instead of list_add in case of multiple adds.  We still issue a
>> warning, but we shouldn't fall over.
>> 
>> Signed-off-by: Jeff Mahoney <jeffm@xxxxxxxx>
> 
> Hi Jeff,
> 
> The changes look good from eyeballing, and I gave it a try
> together with the previous 2 patches in the corresponding series. 
> However, this patch in the series fails tests btrfs/065 to
> btrfs/072.
> 
> Running the following in a loop until it returns failure ($? != 0) 
> triggers an apparent corruption issue most of the time:

Ok, I didn't have my test environment set up to do the multidevice
tests by default, and definitely didn't have it set up to do 065,
since that requires 5+ devices of identical size. Now that I do have
that set up, I see this:

  8,37   2     1442    22.727796582 15400  D   D 419434496 + 8192 [umoun
t]

419434496 is the start of /dev/sdc5 and we're blowing away 4 MB.  That
happens to coincide with the system (single) block group.  In my
testing, I found (IIRC) that system (single) and metadata (single)
were always in the unused list but were getting skipped by the cleaner
thread.  It looks like the cleanup block groups on close_ctree is
misbehaving.  Whatever condition that was causing those to be skipped
must be unset for 065.  It wasn't a problem during normal runtime, but
now that it's called in close_ctree, it's introducing a problem.  I'll
start digging.

Thanks for the review!

- -Jeff

> root 17:16:48 /home/fdmanana/git/hub/xfstests (up_master_2)> 
> MKFS_OPTIONS="-O skinny-metadata" MOUNT_OPTIONS="-o discard"
> ./check btrfs/065 FSTYP         -- btrfs PLATFORM      --
> Linux/x86_64 debian3 4.1.0-rc5-btrfs-next-10+ MKFS_OPTIONS  -- -O
> skinny-metadata /dev/sdc MOUNT_OPTIONS -- -o discard /dev/sdc
> /home/fdmanana/btrfs-tests/scratch_1
> 
> btrfs/065 130s ... [failed, exit status 1] - output mismatch (see 
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad) ---
> tests/btrfs/065.out 2014-11-17 20:59:51.282203001 +0000 +++
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad 
> 2015-06-07 17:37:40.841293317 +0100 @@ -1,2 +1,3 @@ QA output
> created by 065 Silence is golden +_check_btrfs_filesystem:
> filesystem on /dev/sdc is inconsistent (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.full) ... (Run
> 'diff -u tests/btrfs/065.out 
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad'  to
> see the entire diff) Ran: btrfs/065 Failures: btrfs/065 Failed 1 of
> 1 tests
> 
> root 17:37:41 /home/fdmanana/git/hub/xfstests (up_master_2)>
> btrfsck /dev/sdc No valid Btrfs found on /dev/sdc Couldn't open
> file system root 17:37:43 /home/fdmanana/git/hub/xfstests
> (up_master_2)>
> 
> 
> With only patches 1 and 2 in the series (or without them) this
> never happens (well, at least not in ~50 iterations when running
> btrfs/065). And I've never seen this before testing this patch.
> 
> It also makes generic/251 "never" finish apparently - it didn't
> finish here after 9h running, and without any of these patches in
> the series it used to finish in less than half an hour.
> 
> Maybe the apparent corruption is due to some interaction with
> device replace or scrub, I haven't had the time to
> analyze/determine what it could be.
> 
> thanks
> 
> 
> 
> 
>> --- fs/btrfs/ctree.h            |  3 ++ fs/btrfs/extent-tree.c
>> | 69 ++++++++++++++++++++++++++++++++++++++++++--- 
>> fs/btrfs/free-space-cache.c | 57
>> +++++++++++++++++++++---------------- fs/btrfs/super.c
>> |  2 +- fs/btrfs/transaction.c      |  2 ++ 
>> fs/btrfs/transaction.h      |  2 ++ 6 files changed, 106
>> insertions(+), 29 deletions(-)
>> 
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index
>> 6f364e1..780acf1 100644 --- a/fs/btrfs/ctree.h +++
>> b/fs/btrfs/ctree.h @@ -3438,6 +3438,8 @@ int
>> btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct
>> btrfs_root *root, u64 group_start, struct extent_map *em); void
>> btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); +void
>> btrfs_get_block_group_trimming(struct btrfs_block_group_cache
>> *cache); +void btrfs_put_block_group_trimming(struct
>> btrfs_block_group_cache *cache); void
>> btrfs_create_pending_block_groups(struct btrfs_trans_handle
>> *trans, struct btrfs_root *root); u64
>> btrfs_get_alloc_profile(struct btrfs_root *root, int data); @@
>> -4068,6 +4070,7 @@ __printf(5, 6) void __btrfs_std_error(struct
>> btrfs_fs_info *fs_info, const char *function, unsigned int line,
>> int errno, const char *fmt, ...);
>> 
>> +const char *btrfs_decode_error(int errno);
>> 
>> void __btrfs_abort_transaction(struct btrfs_trans_handle *trans, 
>> struct btrfs_root *root, const char *function, diff --git
>> a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
>> fb4dbfc..e53dbe9 100644 --- a/fs/btrfs/extent-tree.c +++
>> b/fs/btrfs/extent-tree.c @@ -6034,20 +6034,19 @@ int
>> btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, 
>> struct btrfs_root *root) { struct btrfs_fs_info *fs_info =
>> root->fs_info; +       struct btrfs_block_group_cache
>> *block_group, *tmp; +       struct list_head *deleted_bgs; struct
>> extent_io_tree *unpin; u64 start; u64 end; int ret;
>> 
>> -       if (trans->aborted) -               return 0; - if
>> (fs_info->pinned_extents == &fs_info->freed_extents[0]) unpin =
>> &fs_info->freed_extents[1]; else unpin =
>> &fs_info->freed_extents[0];
>> 
>> -       while (1) { +       while (!trans->aborted) { 
>> mutex_lock(&fs_info->unused_bg_unpin_mutex); ret =
>> find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY,
>> NULL); @@ -6066,6 +6065,34 @@ int
>> btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, 
>> cond_resched(); }
>> 
>> +       /* +        * Transaction is finished.  We don't need the
>> lock anymore.  We +        * do need to clean up the block groups
>> in case of a transaction +        * abort. +        */ +
>> deleted_bgs = &trans->transaction->deleted_bgs; +
>> list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list)
>> { +               u64 trimmed = 0; + +               ret =
>> -EROFS; +               if (!trans->aborted) +
>> ret = btrfs_discard_extent(root, +
>> block_group->key.objectid, +
>> block_group->key.offset, +
>> &trimmed); + +
>> list_del_init(&block_group->bg_list); +
>> btrfs_put_block_group_trimming(block_group); +
>> btrfs_put_block_group(block_group); + +               if (ret) { 
>> +                       const char *errstr =
>> btrfs_decode_error(ret); +
>> btrfs_warn(fs_info, +                                  "Discard
>> failed while removing blockgroup: errno=%d %s\n", +
>> ret, errstr); +               } +       } + return 0; }
>> 
>> @@ -9845,6 +9872,11 @@ int btrfs_remove_block_group(struct
>> btrfs_trans_handle *trans, * currently running transaction might
>> finish and a new one start, * allowing for new block groups to be
>> created that can reuse the same * physical device locations
>> unless we take this special care. +        * +        * There may
>> also be an implicit trim operation if the file system +        *
>> is mounted with -odiscard. The same protections must remain +
>> * in place until the extents have been discarded completely when 
>> +        * the transaction commit has completed. */ remove_em =
>> (atomic_read(&block_group->trimming) == 0); /* @@ -9917,8
>> +9949,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info
>> *fs_info) return;
>> 
>> spin_lock(&fs_info->unused_bgs_lock); + while
>> (!list_empty(&fs_info->unused_bgs)) { u64 start, end; +
>> int trimming;
>> 
>> block_group = list_first_entry(&fs_info->unused_bgs, struct
>> btrfs_block_group_cache, @@ -10016,12 +10050,39 @@ void
>> btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) 
>> spin_unlock(&block_group->lock); spin_unlock(&space_info->lock);
>> 
>> +               /* DISCARD can flip during remount */ +
>> trimming = btrfs_test_opt(root, DISCARD); + +               /*
>> Implicit trim during transaction commit. */ +               if
>> (trimming) +
>> btrfs_get_block_group_trimming(block_group); + /* *
>> Btrfs_remove_chunk will abort the transaction if things go *
>> horribly wrong. */ ret = btrfs_remove_chunk(trans, root, 
>> block_group->key.objectid); + +               if (ret) { +
>> if (trimming) +
>> btrfs_put_block_group_trimming(block_group); +
>> goto end_trans; +               } + +               /* +
>> * If we're not mounted with -odiscard, we can just forget +
>> * about this block group. Otherwise we'll need to wait +
>> * until transaction commit to do the actual discard. +
>> */ +               if (trimming) { +
>> WARN_ON(!list_empty(&block_group->bg_list)); +
>> spin_lock(&trans->transaction->deleted_bgs_lock); +
>> list_move(&block_group->bg_list, +
>> &trans->transaction->deleted_bgs); +
>> spin_unlock(&trans->transaction->deleted_bgs_lock); +
>> btrfs_get_block_group(block_group); +               } end_trans: 
>> btrfs_end_transaction(trans, root); next: diff --git
>> a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index
>> 9dbe5b5..c79253e 100644 --- a/fs/btrfs/free-space-cache.c +++
>> b/fs/btrfs/free-space-cache.c @@ -3274,35 +3274,23 @@ next: 
>> return ret; }
>> 
>> -int btrfs_trim_block_group(struct btrfs_block_group_cache
>> *block_group, -                          u64 *trimmed, u64 start,
>> u64 end, u64 minlen) +void btrfs_get_block_group_trimming(struct
>> btrfs_block_group_cache *cache) { -       int ret; +
>> atomic_inc(&cache->trimming); +}
>> 
>> -       *trimmed = 0; +void btrfs_put_block_group_trimming(struct
>> btrfs_block_group_cache *block_group) +{ +       struct
>> extent_map_tree *em_tree; +       struct extent_map *em; +
>> bool cleanup;
>> 
>> spin_lock(&block_group->lock); -       if (block_group->removed)
>> { -               spin_unlock(&block_group->lock); -
>> return 0; -       } -       atomic_inc(&block_group->trimming); +
>> cleanup = (atomic_dec_and_test(&block_group->trimming) && +
>> block_group->removed); spin_unlock(&block_group->lock);
>> 
>> -       ret = trim_no_bitmap(block_group, trimmed, start, end,
>> minlen); -       if (ret) -               goto out; - -       ret
>> = trim_bitmaps(block_group, trimmed, start, end, minlen); -out: -
>> spin_lock(&block_group->lock); -       if
>> (atomic_dec_and_test(&block_group->trimming) && -
>> block_group->removed) { -               struct extent_map_tree
>> *em_tree; -               struct extent_map *em; - -
>> spin_unlock(&block_group->lock); - +       if (cleanup) { 
>> lock_chunks(block_group->fs_info->chunk_root); em_tree =
>> &block_group->fs_info->mapping_tree.map_tree; 
>> write_lock(&em_tree->lock); @@ -3326,10 +3314,31 @@ out: * this
>> block group have left 1 entry each one. Free them. */ 
>> __btrfs_remove_free_space_cache(block_group->free_space_ctl); -
>> } else { +       } +} + +int btrfs_trim_block_group(struct
>> btrfs_block_group_cache *block_group, +
>> u64 *trimmed, u64 start, u64 end, u64 minlen) +{ +       int
>> ret; + +       *trimmed = 0; + +
>> spin_lock(&block_group->lock); +       if (block_group->removed)
>> { spin_unlock(&block_group->lock); +               return 0; } +
>> btrfs_get_block_group_trimming(block_group); +
>> spin_unlock(&block_group->lock); + +       ret =
>> trim_no_bitmap(block_group, trimmed, start, end, minlen); +
>> if (ret) +               goto out;
>> 
>> +       ret = trim_bitmaps(block_group, trimmed, start, end,
>> minlen); +out: +
>> btrfs_put_block_group_trimming(block_group); return ret; }
>> 
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index
>> 2ccd8d4..a80da03 100644 --- a/fs/btrfs/super.c +++
>> b/fs/btrfs/super.c @@ -69,7 +69,7 @@ static struct
>> file_system_type btrfs_fs_type;
>> 
>> static int btrfs_remount(struct super_block *sb, int *flags, char
>> *data);
>> 
>> -static const char *btrfs_decode_error(int errno) +const char
>> *btrfs_decode_error(int errno) { char *errstr = "unknown";
>> 
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c 
>> index 5628e25..2005262 100644 --- a/fs/btrfs/transaction.c +++
>> b/fs/btrfs/transaction.c @@ -256,6 +256,8 @@ loop: 
>> mutex_init(&cur_trans->cache_write_mutex); 
>> cur_trans->num_dirty_bgs = 0; 
>> spin_lock_init(&cur_trans->dirty_bgs_lock); +
>> INIT_LIST_HEAD(&cur_trans->deleted_bgs); +
>> spin_lock_init(&cur_trans->deleted_bgs_lock); 
>> list_add_tail(&cur_trans->list, &fs_info->trans_list); 
>> extent_io_tree_init(&cur_trans->dirty_pages, 
>> fs_info->btree_inode->i_mapping); diff --git
>> a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index
>> 0b24755..14325f2 100644 --- a/fs/btrfs/transaction.h +++
>> b/fs/btrfs/transaction.h @@ -74,6 +74,8 @@ struct
>> btrfs_transaction { */ struct mutex cache_write_mutex; spinlock_t
>> dirty_bgs_lock; +       struct list_head deleted_bgs; +
>> spinlock_t deleted_bgs_lock; struct btrfs_delayed_ref_root
>> delayed_refs; int aborted; int dirty_bg_run; -- 1.8.4.5
>> 
>> -- To unsubscribe from this list: send the line "unsubscribe
>> linux-btrfs" in the body of a message to
>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> 
> 
> 


- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVd1EWAAoJEB57S2MheeWyqbUP/jtZ7rAsqGPmZAOH3O5CYfp4
m9kacFUoPCVlMNK3Tpv6n/RQrDbwZ9BLRs6J0+a37JZAf+BwN/cTlc8ANOHQeBxH
O9VnTKUa+NHGviS7vgOpY9fV9jMVsgN6s89z2cPI7huXoovKum1zwuyMbrSvb92G
/Q8Rc6/eIW6kKnOqtBTi3l/gXeTUX3j9l3HlFY9SxkE2gRF+FRqWzBZBhDHS8jVs
S9ACTm7IaeOzs84NAGaQCSs5KGXZYjbYYhaUh9hacKWnOuvZ2MJLenyQuKb3JThL
cgaNu6+0Tl58SXh0hh2FrNTHKfCNrhs2jSAZQxGdEd9keViCmGMVgYbS8fyqKMw5
9LvKQB1NYWoKiCc7x+zK0GGU5p0MIWNWRqm02b/yp25LJKPwBHpX8hrseb7tHYdj
MSAqlHBUyfZR/nh6vywSeIsV4inPRIQYQKXoG5hechXQt7ivifXDtFMEa21PiLdR
bHxt2mgn6uqFBh/2atdYCzqOjH4QG3kQ9kVOdHxDbHk1sTFY26Q79xUImTA3qp7E
pI3Ckn9uG5b1m3PDdUFWBUkN6DffpUnsa1DAQBnI69/W6KY/VMKPaPrppYlQ5S8w
cBMhnfDxfyjfisUAS+mUmq4GcbpSt95KsVKizz7TlK2VxtPCWOywK0shRmoWIgME
62rJ9pbtLvoHY9GLNXnS
=E81+
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux