On Thu, Jul 13, 2017 at 11:38 PM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
> On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote:
>> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdmanana@xxxxxxxxxx wrote:
>> > From: Filipe Manana <fdmanana@xxxxxxxx>
>> >
>> > The recent changes to make bio cloning faster (added in the 4.13 merge
>> > window) by using the bio_clone_fast() API introduced a regression on
>> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field
>> > (therefore it can not be used) and the raid5/6 code uses the
>> > bio_for_each_segment_all() API to iterate the segments of a bio, and this
>> > API uses a bio's bi_vcnt field.
>> >
>> > The issue is very simple to trigger by doing for example a direct IO write
>> > against a raid5 or raid6 filesystem and then attempting to read what we
>> > wrote before:
>> >
>> > $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
>> > $ mount /dev/sdc /mnt
>> > $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
>> > $ od -t x1 /mnt/foobar
>> > od: /mnt/foobar: read error: Input/output error
>> >
>> > For that example, the following is also reported in dmesg/syslog:
>> >
>> > [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
>> > [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
>> > [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
>> > [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> > [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
>> >
>> > A scrub will also fail correcting bad copies, mentioning the following in
>> > dmesg/syslog:
>> >
>> > [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
>> > [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
>> > [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
>> > [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>> > [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
>> > [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>> > [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
>> > [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
>> > [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
>> > [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>> > [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
>> > [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
>> > [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
>> > [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
>> > [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
>> > [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
>> > [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
>> > [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
>> > [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
>> > [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
>> > [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
>> > [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
>> > [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
>> > [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
>> > [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
>> > [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
>> > [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
>> > [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde
>> >
>> > So fix this by using the bio_for_each_segment() API and setting before
>> > the bio's bi_iter field to the value of the corresponding btrfs bio
>> > container's saved iterator if we are processing a cloned bio in the
>> > raid5/6 code (the same code processes both cloned and non-cloned bios).
>> >
>> > This incorrect iteration of cloned bios was also causing some occasional
>> > BUG_ONs when running fstest btrfs/064, which have a trace like the
>> > following:
>> >
>> > [ 6674.416156] ------------[ cut here ]------------
>> > [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
>> > [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
>> > [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
>> > [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
>> > [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
>> > [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
>> > [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
>> > [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
>> > [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
>> > [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
>> > [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
>> > [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
>> > [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
>> > [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
>> > [ 6674.416235] FS: 0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
>> > [ 6674.416235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
>> > [ 6674.416239] Call Trace:
>> > [ 6674.416259] __raid56_parity_recover+0xfc/0x16e [btrfs]
>> > [ 6674.416276] raid56_parity_recover+0x157/0x16b [btrfs]
>> > [ 6674.416293] btrfs_map_bio+0xe0/0x259 [btrfs]
>> > [ 6674.416310] btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
>> > [ 6674.416327] end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
>> > [ 6674.416331] bio_endio+0x17d/0x1b3
>> > [ 6674.416346] end_workqueue_fn+0x3c/0x3f [btrfs]
>> > [ 6674.416362] btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
>> > [ 6674.416379] btrfs_endio_helper+0xe/0x10 [btrfs]
>> > [ 6674.416381] process_one_work+0x276/0x4b6
>> > [ 6674.416384] worker_thread+0x1ac/0x266
>> > [ 6674.416386] ? rescuer_thread+0x278/0x278
>> > [ 6674.416387] kthread+0x106/0x10e
>> > [ 6674.416389] ? __list_del_entry+0x22/0x22
>> > [ 6674.416391] ret_from_fork+0x27/0x40
>> > [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
>> > [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
>> > [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---
>> >
>>
>> Thank you so much for the fix!
>>
>>
>> > Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
>> > ---
>> > fs/btrfs/raid56.c | 26 ++++++++++++++++++--------
>> > 1 file changed, 18 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> > index b9abb0b01021..b89d07003697 100644
>> > --- a/fs/btrfs/raid56.c
>> > +++ b/fs/btrfs/raid56.c
>> > @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
>> > static void index_rbio_pages(struct btrfs_raid_bio *rbio)
>> > {
>> > struct bio *bio;
>> > - struct bio_vec *bvec;
>> > u64 start;
>> > unsigned long stripe_offset;
>> > unsigned long page_index;
>> > - int i;
>> >
>> > spin_lock_irq(&rbio->bio_list_lock);
>> > bio_list_for_each(bio, &rbio->bio_list) {
>> > + struct bio_vec bvec;
>> > + struct bvec_iter iter;
>> > + int i = 0;
>> > +
>> > start = (u64)bio->bi_iter.bi_sector << 9;
>> > stripe_offset = start - rbio->bbio->raid_map[0];
>> > page_index = stripe_offset >> PAGE_SHIFT;
>> >
>> > - bio_for_each_segment_all(bvec, bio, i)
>> > - rbio->bio_pages[page_index + i] = bvec->bv_page;
>> > + if (bio_flagged(bio, BIO_CLONED))
>> > + bio->bi_iter = btrfs_io_bio(bio)->iter;
>> > +
>>
>> I think we can use use bio->bi_iter directly as the bio is not
>> submitted yet, i.e. bi_iter is not advanced yet.
>>
>> > + bio_for_each_segment(bvec, bio, iter) {
>> > + rbio->bio_pages[page_index + i] = bvec.bv_page;
>> > + i++;
>> > + }
>> > }
>> > spin_unlock_irq(&rbio->bio_list_lock);
>> > }
>> > @@ -1423,11 +1430,14 @@ static int fail_bio_stripe(struct btrfs_raid_bio *rbio,
>> > */
>> > static void set_bio_pages_uptodate(struct bio *bio)
>> > {
>> > - struct bio_vec *bvec;
>> > - int i;
>> > + struct bio_vec bvec;
>> > + struct bvec_iter iter;
>> > +
>> > + if (bio_flagged(bio, BIO_CLONED))
>> > + bio->bi_iter = btrfs_io_bio(bio)->iter;
>> >
>
> It is not necessary to update set_bio_pages_uptodate() because the bio
> here is for reading pages in RMW and no way it could be a cloned one.
>From my testing I've never seen this function getting cloned bios as
well. But I'm afraid we don't have full test coverage and it scares me
to have code around that can cause silent corruptions or crashes due
to such subtle details (bi_vcnt can't be used for cloned bios and it's
not obvious which APIs from the block layer make use of it without
reading their code). Plus a quick look I had at btrfs_map_bio()
suggested it can clone bios used for read operations through
btrfs_submit_bio_hook() for example. So that's why I changed that
function too to be able to deal both with cloned and non-cloned bios.
>
> thanks,
>
> -liubo
>
>>
>> Ditto.
>>
>> Others look good.
>>
>> Reviewed-by: Liu Bo <bo.li.liu@xxxxxxxxxx>
>>
>> -liubo
>>
>> > - bio_for_each_segment_all(bvec, bio, i)
>> > - SetPageUptodate(bvec->bv_page);
>> > + bio_for_each_segment(bvec, bio, iter)
>> > + SetPageUptodate(bvec.bv_page);
>> > }
>> >
>> > /*
>> > --
>> > 2.11.0
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html