|
|
|
Re: Reason for md raid 01 blksize limited to 4 KiB? | |
| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] |
|
On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer
<sebastian.riemer@xxxxxxxxxxxxxxxx> wrote:
> On 29/05/12 12:25, NeilBrown wrote:
> > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> > <sebastian.riemer@xxxxxxxxxxxxxxxx> wrote:
> >> Now, I've updated mdadm to version 3.2.5 and it works like you've
> >> described it. Thanks for your help! But the buffered IO is what matters.
> >> 4k isn't enough there. Please inform me about changes which increase the
> >> size in buffered IO. I'll have a look at this, too.
> >
> > I don't know. I'd have to dive into the code and look around and put a few
> > printks in to see what is happening.
>
> Now, I've configured a storage server with real HDDs for testing the
> cached IO with kernel 3.4. Here direct IO always doesn't work
> (Input/Output error with dd/fio). And cached IO is totally slow. My
> RAID0 devices are md100 and md200. The RAID1 on top is the md300.
>
> The md100 is reported as "faulty spare" and this has hit the following a
> kernel bug.
>
> This is the debug output:
>
> md/raid0:md100: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid1:md300: Disk failure on md100, disabling device.
> md/raid1:md300: Operation continuing on 1 devices.
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:1, o:0, dev:md100
> disk 1, wo:0, o:1, dev:md200
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 1, wo:0, o:1, dev:md200
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 2704000 320
>
> The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID
> controller where the drives are passed through as single drive RAID0
> logical devices. I guess this is a problem for MD RAID0 underneath the
> RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size.
Hmmm... that's bad. Looks like I have a bug .... yes I do. Patch below
fixes it. If you could test and confirm I would appreciated it.
As for the cached writes being always 4K - are you writing through a
filesystem or directly to /dev/md300??
If the former it is a bug in that filesystem.
If the later, it is a bug in fs/block_dev.c
In particular, fs/block_dev.c uses "generic_writepages" for the
"writepages" method rather than "mpage_writepages" (or a wrapper which
calls it with appropriate args).
'generic_writepages' simply calls ->writepage on each dirty page.
mpage_writepages (used e.g. by ext2) collects multiple pages into
a single bio.
The elevator at the device level should still collect these 1-page bios into
larger requests, but I guess that has higher CPU overhead.
thanks for the report.
NeilBrown
From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@xxxxxxx>
Date: Thu, 31 May 2012 15:39:11 +1000
Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn
The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.
However were were only setting that when a device was hot-added,
not when a device was present from the start.
This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels.
Cc: stable@xxxxxxxxxxxxxxx
Reported-by: Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx>
Signed-off-by: NeilBrown <neilb@xxxxxxx>
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 15dd59b..d7e9577 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
err = -EINVAL;
spin_lock_init(&conf->device_lock);
rdev_for_each(rdev, mddev) {
+ struct request_queue *q;
int disk_idx = rdev->raid_disk;
if (disk_idx >= mddev->raid_disks
|| disk_idx < 0)
@@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
if (disk->rdev)
goto abort;
disk->rdev = rdev;
+ q = bdev_get_queue(rdev->bdev);
+ if (q->merge_bvec_fn)
+ mddev->merge_check_needed = 1;
disk->head_position = 0;
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3f91c2e..d037adb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev)
(conf->raid_disks / conf->near_copies));
rdev_for_each(rdev, mddev) {
-
+ struct request_queue *q;
disk_idx = rdev->raid_disk;
if (disk_idx >= conf->raid_disks
|| disk_idx < 0)
@@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev)
goto out_free_conf;
disk->rdev = rdev;
}
+ q = bdev_get_queue(rdev->bdev);
+ if (q->merge_bvec_fn)
+ mddev->merge_check_needed = 1;
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
Attachment:
signature.asc
Description: PGP signature
![]() |
![]() |