Re: [PATCH V2] Btrfs: Full direct I/O and AIO read implementation.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives



jim owens wrote:
> Josef Bacik wrote:
>> On Wed, Feb 10, 2010 at 01:53:50PM -0500, jim owens wrote:
>>> +    /* traditional 512-byte device sector alignment is the
>>> +     * minimum required. if they have a larger sector disk
>>> +     * (possibly multiple sizes in the filesystem) and need
>>> +     * a larger alignment for this I/O, we just fail later. +     */
>>> +    if (offset & 511)
>>> +        return -EINVAL;
>>> +
>>> +    /* check memory alignment, blocks cannot straddle pages.
>>> +     * allow 0-length vectors which are questionable but seem legal.
>>> +     */
>>> +    for (seg = 0; seg < nr_segs; seg++) {
>>> +        if (iov[seg].iov_len && ((unsigned long)iov[seg].iov_base &
>>> 511))
>>> +            return -EINVAL;
>>> +        if (iov[seg].iov_len & 511)
>>> +            return -EINVAL;
>>> +        done += iov[seg].iov_len;
>>> +    }
>>
>> Lets try and keep to what everybody else does and just limit it based on
>> blocksize.  That way we can stay consistent with the rest of the
>> fs's, and we
>> don't have an ugly magic number.
>
> The problem that I did not explain in my comment is that other fs's
> get the blocksize from 1 device bdi, which usually will be 512.  But
> the btrfs bdi covers all volumes and gives the btrf logical 4096 size.
> If we do what is done in direct-io.c then we only allow access on 4k
> (or in the future larger logical) boundaries.
>
> While the hard-code is ugly, at least you see it without chasing
> some silly #define.  As I said in my [00], I think we need to go
> to whatever the device supports because people expect that directio
> works in 512 byte multiples.
>
> This 512-byte stuff dictates a bunch of coding throughout, but I
> felt it was important to do it.  If people really think logical
> fs block size is the way to go, we can delete code.
>
> Note that my "smaller than a logical block" decision also gates
> code which I did to allow something like a raid 1 where 1 device
> is 512 byte blocks and the other has 4k blocks.  This should work
> fine for cache I/O so I wanted to make it work here too.
>
> On the other hand, if they make a mixed raid5, any 512-byte access
> to an extent on a 4k drive will be caught as an error (your later
> comment where you think I don't need the check) but any extent
> on the 512-byte disk can be read at the 512-byte boundary.  This
> might be confusing to users but strictly follows directio rules.
>
>>> +    diocb->rw = rw;
>>> +    diocb->kiocb = kiocb;
>>> +     diocb->start = offset;
>>> +    diocb->begin = offset;
>>> +    diocb->terminate = offset + done;
>>
>> This is more of a preference and less of a technical issue, but it
>> would be nice
>> to be able to clean this stuff up so we don't have a bunch of fields
>> that all
>> carry the same information.
>
> I wanted to do a cleanup... but I was under intense time pressure to
> get (what I though was) a complete working version up to the list.
> Now I have some more time so I'll take a stab at it.
>
>>> +    diocb->inode = inode;
>>> +    diocb->blocksize = BTRFS_I(diocb->inode)->root->sectorsize;
>>> +
>>
>> We can just take this out of the inode, no sense in carrying
>> blocksize around by
>> itself.
>
> This is one of those performance trade-offs. I did that because it cuts
> 2 derefs for each time blocksize is used... and there are 31 of them by
> my count.  I am trading 4 bytes of extra memory for less instructions.
> It really would not be 62 instructions because the compiler probably
> will optimize some of them away. We don't go nuts about number of
> instructions like some kernel subsystems, so if people want it the other
> way, I'm fine changing it.
>
>>> +    diocb->umc.user_iov = iov;
>>> +    diocb->umc.work_iov = *iov;
>>
>> Really?  There has to be a better way to get this done.
>
> Well, if you have an idea, let me know.  I know this stuff is ugly but
> I can not modify the input kiocb and I need to deal with the problem
> that the #%$^ multiple user memory vectors are only 512 byte aligned
> and are likely not on device extent boundaries anyway even if they
> are page aligned.
>
> While I hope no application writer is insane enough to do it, the
> way the interface is designed, they could say read 16K and hand me
> 32 vectors with each being 512 bytes.
>
> The only way to simplify memory handling is to be like direct-io.c
> and drive the top level by walking the user vectors.  But that would
> greatly suck for btrfs extent handling, particularly for compression.
> I chose to write ugly vector handling code to make the performance
> the best possible, which is to drive the top level by file extent.
>
>>> +    if (is_sync_kiocb(diocb->kiocb)) {
>>> +        if (diocb->rw == READ)
>>> +            btrfs_dio_read(diocb);
>>> +        else
>>> +            btrfs_dio_write(diocb);
>>> +        done = btrfs_dio_wait(diocb);
>>> +
>>> +        btrfs_dio_free_diocb(diocb);
>>> +        return done;
>>> +    } else {
>>> +        diocb->submit.func = btrfs_dio_aio_submit;
>>> +        btrfs_queue_worker(&BTRFS_I(diocb->inode)->root->fs_info->
>>> +                submit_workers, &diocb->submit);
>>> +        return -EIOCBQUEUED;
>>> +    }
>>> +}
>>> +
>>
>> Again just a nit, but can we do a ret = done, ret = -EIOCBQUEUED and
>> then have
>>
>> return ret;
>>
>> at the bottom of the function.  It just looks nicer.
>
> I don't have a problem doing it in this instance, just don't expect
> consistency everywhere... sometimes it is easier to understand or
> more efficient or cleaner to have multiple returns.
>>> +static void btrfs_dio_read(struct btrfs_diocb *diocb)
>>> +{
>>> +    struct extent_io_tree *io_tree = &BTRFS_I(diocb->inode)->io_tree;
>>> +    u64 end = diocb->terminate; /* copy because reaper changes it */
>>> +    u64 data_len;
>>> +    int err = 0;
>>> +    int loop = 0;
>>> +
>>> +    /* expand lock region to include what we read to validate
>>> checksum */ +    diocb->lockstart = diocb->start &
>>> ~(diocb->blocksize-1);
>>> +    diocb->lockend = ALIGN(diocb->terminate, diocb->blocksize) - 1;
>>> +
>>
>> Ok so we keep track of how much len we write in btrfs_dio_inline_read
>> and such,
>> and those functions only modify lockstart by += len, so we ought to
>> be able to
>> take lockstart and lockend out of the diocb and keep them local, and do
>> lockstart += len we get back from the individual read functions, and
>> then unlock
>> the remaining if we have any at the end.
>
> Hmm, lockend is easy to make local. I think lockstart has a problem
> because
> it does not match extent start or data start, but I'll check into it.
>>> +getlock:
>>> +    mutex_lock(&diocb->inode->i_mutex);
>>> +       
>>
>> We don't need the i_mutex here.
>
> I disagree, see following code.  If we don't hold the mutex then
> we can have a new writer between the ordered wait and i_size_read()
> so I would blow up if the file size was less than user-specified
> data_len at ordered wait but grows before I fetch the file size.
> I will not have flushed that new data to disk but will read it.
>
>>> +    /* ensure writeout and btree update on everything
>>> +     * we might read for checksum or compressed extents
>>> +     */
>>> +    data_len = diocb->lockend + 1 - diocb->lockstart;
>>> +    err = btrfs_wait_ordered_range(diocb->inode, diocb->lockstart,
>>> data_len);
>>> +    if (err) {
>>> +        diocb->error = err;
>>> +        mutex_unlock(&diocb->inode->i_mutex);
>>> +        return;
>>> +    }
>>> +    data_len = i_size_read(diocb->inode);
>>> +    if (data_len < end)
>>> +        end = data_len;
>>> +    if (end <= diocb->start) {
>>> +        mutex_unlock(&diocb->inode->i_mutex);
>>> +        goto fail; /* 0 is returned past EOF */
>>> +    }
>>> +    if (!loop) {
>>> +        loop++;
>>> +        diocb->terminate = end;
>>> +        diocb->lockend = ALIGN(diocb->terminate, diocb->blocksize)
>>> - 1;
>>> +    }
>>> +
>>> +    lock_extent(io_tree, diocb->lockstart, diocb->lockend, GFP_NOFS);
>>
>> Ok so between teh btrfs_wait_ordered_range and the lock_extent we
>> could have had
>> another ordered extent come in, so here we should be checking to see
>> if there is
>> an ordered extent, and if there is put it and go back to getlock. 
>> Look at
>> btrfs_page_mkwrite() for an example of what I'm talking about.  It would
>> probably be good to move all the size read stuff under the
>> lock_extent(), but I
>> couldn't quite figure out how to do it nicely, so either way.
>>
>>> +    mutex_unlock(&diocb->inode->i_mutex);
>
> Holding the mutex elegantly fixes that whole problem
> because I am protecting everything from new writes.
>
>>> +/* for consistent eof processing between inline/compressed/normal
>>> + * extents, an unaligned eof gets special treatment, read into temp
>>> + * and memcpy to user on completion the part that does not match
>>> + * the users I/O alignment (for now always 511)
>>> + */
>>> +static void btrfs_dio_eof_tail(u32 *filetail, int eof, struct
>>> btrfs_diocb *diocb)
>>> +{
>>> +    if (eof)
>>> +        *filetail &= 511;
>>> +    else
>>> +        *filetail = 0; /* aligned direct to user memory */ +}
>>
>> Again, blocksize alignment, not 512.
>
> Again, feature decision required about blocksize.
>
>>> +        /* device start and length may not be sector aligned or
>>> +         * user memory address/length vectors may not be aligned
>>> +         * on a device sector because device sector size is > 512.
>>> +         * we might have different size devices in the filesystem,
>>> +         * so retry all copies to see if any meet the alignment.
>>> +         */
>>> +        iomask =
>>> bdev_logical_block_size(btrfs_map_stripe_bdev(extcb->em, dvn)) - 1;
>>> +        if ((extcb->diodev[dvn].physical & iomask) || (dev_left &
>>> iomask) ||
>>> +                (!temp_pages &&
>>> +                btrfs_dio_not_aligned(iomask, (u32)dev_left,
>>> +                            &extcb->diocb->umc))) {
>>
>> The btrfs_dio_inline_read check doesn't seem necessary since we did the
>> alignment check in btrfs_direct_IO, you can just kill it.
>
> Nope, I can only kill it if we say all devices are forced to be
> the same blocksize and we know what that is in btrfs_direct_IO.
>
>>> +
>>> +static void btrfs_dio_put_next_in(struct bio_vec *vec,
>>> +                struct btrfs_dio_extcb *extcb)
>>> +{
>>> +    while (vec->bv_len) {
>>> +        unsigned int bv_len;
>>> +        if (extcb->bo_frag) {
>>> +            /* current bi_io_vec is part of this put-back */
>>> +            vec->bv_len += extcb->bo_frag;
>>> +            extcb->bo_frag = 0;
>>> +            /* else put-back begins at previous bi_io_vec or bio */
>>> +        } else if (extcb->bo_bvn) {
>>> +            extcb->bo_bvn--;
>>> +        } else {
>>> +            extcb->bo_now--;
>>> +            extcb->bo_bvn = extcb->order[extcb->bo_now]->bi_vcnt - 1;
>>> +        }
>>> +
>>> +        bv_len =
>>> extcb->order[extcb->bo_now]->bi_io_vec[extcb->bo_bvn].bv_len;
>>> +        if (vec->bv_len < bv_len) {
>>> +            extcb->bo_frag = vec->bv_len;
>>> +            vec->bv_len = 0;
>>> +            return;
>>> +        }
>>> +        vec->bv_len -= bv_len;
>>> +    }
>>> +}
>>
>> Again, this is all quite scary and is very fragile, I would hate to
>> have to try
>> and figure out what was going wrong in here.  Is there a way we can
>> do this less
>> complicated?
>
> I'm afraid to tell you, but this is the simplified version.
>
> I would hope not fragile, but it is scary.
>
> Again, this is all about those painful user memory addresses not being
> aligned with our logical blocks, so they can cross multiple devices
> and within 1 device bio they can cross 2 bio vecs because the user
> memory is only 512 byte aligned, not page-aligned, thus there is part
> of the block in one page and part in another page.  And if they do that,
> every logical block in the read is in 2 pages instead of 1... arrgh!
> It would all be so easy if we did not transfer into user memory or
> did not do checksums.
>
> Other than only allowing logical block bounded transfers, I did not
> think of a solution to make it simple.  But ideas are always welcome.
>
>>> +    /* ugly again - compressed extents can end with an implied hole */
>>> +    if (!extcb->error && extcb->icb.out_len != len) {
>>> +        while (extcb->umc.todo) {
>>> +            struct bio_vec uv;
>>> +            char *out;
>>> +
>>> +            extcb->error = btrfs_dio_get_user_bvec(&uv, &extcb->umc);
>>> +            if (extcb->error)
>>> +                goto fail;
>>> +            out = kmap_atomic(uv.bv_page, KM_USER0);
>>> +            memset(out + uv.bv_offset, 0, uv.bv_len);
>>> +            kunmap_atomic(out, KM_USER0);
>>
>> Umm, I'm just going to close my eyes and pretend this is necessary.
>
> This one is another one I blame on Chris ;)
>
> I found this wonderful surprise via fsx after getting past that other
> wonderful surprise with the inline in a large file.
>
> But on the other hand, this tiny drop of ugly in my ocean of ugly
> is better than changing other parts of btrfs.
>
>>> +static int btrfs_dio_drop_workbuf(struct btrfs_dio_extcb *extcb)
>>> +{
>>> +    extcb->icb.workspace = NULL;
>>> +    extcb->tmpbuf = NULL;
>>> +    extcb->tmpbuf_size = 0;
>>> +    return 0;
>>> +}
>>> +
>>
>> Can change this to just be a void
>
> yes!!!, finally an easy one.
>
>>> +/* submit kernel temporary pages for compressed read */
>>> +static int btrfs_dio_add_temp_pages(u64 *dev_left, struct
>>> btrfs_dio_extcb *extcb,
>>> +                int dvn)
>>> +{
>>> +    while (extcb->diodev[dvn].vecs && *dev_left) {
>>> +        unsigned int pglen = min_t(long, *dev_left, PAGE_SIZE);
>>> +        struct page *page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
>>> +
>>> +        if (!page)
>>> +            return -ENOMEM;
>>> +        if (!bio_add_page(extcb->diodev[dvn].bio, page, pglen, 0)) {
>>> +            extcb->diodev[dvn].vecs = 0;
>>> +            page_cache_release(page);
>>> +            return 0;
>>> +        }
>>> +        extcb->csum_pg1 = page;
>>> +        extcb->iolen += pglen;
>>> +        extcb->diodev[dvn].physical += pglen;
>>> +        *dev_left -= pglen;
>>> +        extcb->diodev[dvn].vecs--;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>
>> Ok so I assume this is you have all of the kmap horrors I keep
>> seeing?  Is it
>> because we read the compressed data into these temporary pages, and then
>> uncompress the data into user pages?  I'd just like to know for my own
>> edification.
>
> Yes, compression is the main use of the temp pages, to hold the
> compressed
> stuff which can be smaller or bigger than the actual user read.  I
> have to
> do the read from disk into the temp pages, which becomes the source to
> the
> zlib_inflate and the user memory is the output. The extra complication
> here
> is that we can be spanning devices, the "dvn", even for a small extent
> because it is packed into the chunk, so you might need 28k of
> compressed data
> but 12k is at the end of the dev0 stripe and 16k is at the beginning
> of the
> dev1 stripe.  The "extcb->csum_pg1 = page;" thing is for the other use of
> temp pages, to read the part of the logical block at the beginning or/and
> end of uncompressed reads that the user did not want so we can
> checksum it.
>
>>> +        out = kmap_atomic(uv.bv_page, KM_USER0);
>>> +        memset(out + uv.bv_offset, 0, uv.bv_len);
>>> +        kunmap_atomic(out, KM_USER0);
>>> +
>>
>> /me hands jim a zero_user_page()
>
> There was actually a reason I chose not to use it, but I don't have
> access
> to the kernel source right now so I'll figure that out later and either
> explain why or change it if there was no good reason.
>
>> Check to make sure path isn't NULL.
>>
>> Need proper tabbing.
>>
>> Another nit, tab and then space to inside the (, so it looks like this
>>
>> if (blah &&
>>     blah) {
>>     do foo
>> }
>>
>> Just one of these min_t's is needed.
>>
>> Check to make sure alloc_page works.
>
> OK, thanks, will do.
>
>> This is alot of code so I'm sure I've missed other things, but I
>> think that
>> covers all the major points.  One other thing is there are _alot_ of
>> random
>> space problems.  Before you re-submit I'd recommend editing your
>> ~/.vimrc and
>> add
>>
>> let c_space_errors = 1
>>
>> and then open up this file in vim and see all of the hilighted red
>> areas where
>> there is trailing whitespace and get rid of it.  checkpatch.pl will
>> also catch
>> that if you don't want to use vim, you should probably use
>> checkpatch.pl anyway
>> since it checks other stuff.  I tested this patch and it worked fine
>> with fsx
>> btw, so it mostly just needs to be cleaned up.  Thanks,
>
> Thanks for the review, and even more for both sets of testing you did.
>
> jim
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux