Josef Bacik wrote:
On Wed, Feb 10, 2010 at 01:53:50PM -0500, jim owens wrote:
+ /* traditional 512-byte device sector alignment is the
+ * minimum required. if they have a larger sector disk
+ * (possibly multiple sizes in the filesystem) and need
+ * a larger alignment for this I/O, we just fail later.
+ */
+ if (offset & 511)
+ return -EINVAL;
+
+ /* check memory alignment, blocks cannot straddle pages.
+ * allow 0-length vectors which are questionable but seem legal.
+ */
+ for (seg = 0; seg < nr_segs; seg++) {
+ if (iov[seg].iov_len && ((unsigned long)iov[seg].iov_base & 511))
+ return -EINVAL;
+ if (iov[seg].iov_len & 511)
+ return -EINVAL;
+ done += iov[seg].iov_len;
+ }
Lets try and keep to what everybody else does and just limit it based on
blocksize. That way we can stay consistent with the rest of the fs's, and we
don't have an ugly magic number.
The problem that I did not explain in my comment is that other fs's
get the blocksize from 1 device bdi, which usually will be 512. But
the btrfs bdi covers all volumes and gives the btrf logical 4096 size.
If we do what is done in direct-io.c then we only allow access on 4k
(or in the future larger logical) boundaries.
While the hard-code is ugly, at least you see it without chasing
some silly #define. As I said in my [00], I think we need to go
to whatever the device supports because people expect that directio
works in 512 byte multiples.
This 512-byte stuff dictates a bunch of coding throughout, but I
felt it was important to do it. If people really think logical
fs block size is the way to go, we can delete code.
Note that my "smaller than a logical block" decision also gates
code which I did to allow something like a raid 1 where 1 device
is 512 byte blocks and the other has 4k blocks. This should work
fine for cache I/O so I wanted to make it work here too.
On the other hand, if they make a mixed raid5, any 512-byte access
to an extent on a 4k drive will be caught as an error (your later
comment where you think I don't need the check) but any extent
on the 512-byte disk can be read at the 512-byte boundary. This
might be confusing to users but strictly follows directio rules.
+ diocb->rw = rw;
+ diocb->kiocb = kiocb;
+ diocb->start = offset;
+ diocb->begin = offset;
+ diocb->terminate = offset + done;
This is more of a preference and less of a technical issue, but it would be nice
to be able to clean this stuff up so we don't have a bunch of fields that all
carry the same information.
I wanted to do a cleanup... but I was under intense time pressure to
get (what I though was) a complete working version up to the list.
Now I have some more time so I'll take a stab at it.
+ diocb->inode = inode;
+ diocb->blocksize = BTRFS_I(diocb->inode)->root->sectorsize;
+
We can just take this out of the inode, no sense in carrying blocksize around by
itself.
This is one of those performance trade-offs. I did that because it cuts
2 derefs for each time blocksize is used... and there are 31 of them by
my count. I am trading 4 bytes of extra memory for less instructions.
It really would not be 62 instructions because the compiler probably
will optimize some of them away. We don't go nuts about number of
instructions like some kernel subsystems, so if people want it the other
way, I'm fine changing it.
+ diocb->umc.user_iov = iov;
+ diocb->umc.work_iov = *iov;
Really? There has to be a better way to get this done.
Well, if you have an idea, let me know. I know this stuff is ugly but
I can not modify the input kiocb and I need to deal with the problem
that the #%$^ multiple user memory vectors are only 512 byte aligned
and are likely not on device extent boundaries anyway even if they
are page aligned.
While I hope no application writer is insane enough to do it, the
way the interface is designed, they could say read 16K and hand me
32 vectors with each being 512 bytes.
The only way to simplify memory handling is to be like direct-io.c
and drive the top level by walking the user vectors. But that would
greatly suck for btrfs extent handling, particularly for compression.
I chose to write ugly vector handling code to make the performance
the best possible, which is to drive the top level by file extent.
+ if (is_sync_kiocb(diocb->kiocb)) {
+ if (diocb->rw == READ)
+ btrfs_dio_read(diocb);
+ else
+ btrfs_dio_write(diocb);
+ done = btrfs_dio_wait(diocb);
+
+ btrfs_dio_free_diocb(diocb);
+ return done;
+ } else {
+ diocb->submit.func = btrfs_dio_aio_submit;
+ btrfs_queue_worker(&BTRFS_I(diocb->inode)->root->fs_info->
+ submit_workers, &diocb->submit);
+ return -EIOCBQUEUED;
+ }
+}
+
Again just a nit, but can we do a ret = done, ret = -EIOCBQUEUED and then have
return ret;
at the bottom of the function. It just looks nicer.
I don't have a problem doing it in this instance, just don't expect
consistency everywhere... sometimes it is easier to understand or
more efficient or cleaner to have multiple returns.
+static void btrfs_dio_read(struct btrfs_diocb *diocb)
+{
+ struct extent_io_tree *io_tree = &BTRFS_I(diocb->inode)->io_tree;
+ u64 end = diocb->terminate; /* copy because reaper changes it */
+ u64 data_len;
+ int err = 0;
+ int loop = 0;
+
+ /* expand lock region to include what we read to validate checksum */
+ diocb->lockstart = diocb->start & ~(diocb->blocksize-1);
+ diocb->lockend = ALIGN(diocb->terminate, diocb->blocksize) - 1;
+
Ok so we keep track of how much len we write in btrfs_dio_inline_read and such,
and those functions only modify lockstart by += len, so we ought to be able to
take lockstart and lockend out of the diocb and keep them local, and do
lockstart += len we get back from the individual read functions, and then unlock
the remaining if we have any at the end.
Hmm, lockend is easy to make local. I think lockstart has a problem because
it does not match extent start or data start, but I'll check into it.
+getlock:
+ mutex_lock(&diocb->inode->i_mutex);
+
We don't need the i_mutex here.
I disagree, see following code. If we don't hold the mutex then
we can have a new writer between the ordered wait and i_size_read()
so I would blow up if the file size was less than user-specified
data_len at ordered wait but grows before I fetch the file size.
I will not have flushed that new data to disk but will read it.
+ /* ensure writeout and btree update on everything
+ * we might read for checksum or compressed extents
+ */
+ data_len = diocb->lockend + 1 - diocb->lockstart;
+ err = btrfs_wait_ordered_range(diocb->inode, diocb->lockstart, data_len);
+ if (err) {
+ diocb->error = err;
+ mutex_unlock(&diocb->inode->i_mutex);
+ return;
+ }
+ data_len = i_size_read(diocb->inode);
+ if (data_len < end)
+ end = data_len;
+ if (end <= diocb->start) {
+ mutex_unlock(&diocb->inode->i_mutex);
+ goto fail; /* 0 is returned past EOF */
+ }
+ if (!loop) {
+ loop++;
+ diocb->terminate = end;
+ diocb->lockend = ALIGN(diocb->terminate, diocb->blocksize) - 1;
+ }
+
+ lock_extent(io_tree, diocb->lockstart, diocb->lockend, GFP_NOFS);
Ok so between teh btrfs_wait_ordered_range and the lock_extent we could have had
another ordered extent come in, so here we should be checking to see if there is
an ordered extent, and if there is put it and go back to getlock. Look at
btrfs_page_mkwrite() for an example of what I'm talking about. It would
probably be good to move all the size read stuff under the lock_extent(), but I
couldn't quite figure out how to do it nicely, so either way.
+ mutex_unlock(&diocb->inode->i_mutex);
Holding the mutex elegantly fixes that whole problem
because I am protecting everything from new writes.
+/* for consistent eof processing between inline/compressed/normal
+ * extents, an unaligned eof gets special treatment, read into temp
+ * and memcpy to user on completion the part that does not match
+ * the users I/O alignment (for now always 511)
+ */
+static void btrfs_dio_eof_tail(u32 *filetail, int eof, struct btrfs_diocb *diocb)
+{
+ if (eof)
+ *filetail &= 511;
+ else
+ *filetail = 0; /* aligned direct to user memory */
+}
Again, blocksize alignment, not 512.
Again, feature decision required about blocksize.
+ /* device start and length may not be sector aligned or
+ * user memory address/length vectors may not be aligned
+ * on a device sector because device sector size is > 512.
+ * we might have different size devices in the filesystem,
+ * so retry all copies to see if any meet the alignment.
+ */
+ iomask = bdev_logical_block_size(btrfs_map_stripe_bdev(extcb->em, dvn)) - 1;
+ if ((extcb->diodev[dvn].physical & iomask) || (dev_left & iomask) ||
+ (!temp_pages &&
+ btrfs_dio_not_aligned(iomask, (u32)dev_left,
+ &extcb->diocb->umc))) {
The btrfs_dio_inline_read check doesn't seem necessary since we did the
alignment check in btrfs_direct_IO, you can just kill it.
Nope, I can only kill it if we say all devices are forced to be
the same blocksize and we know what that is in btrfs_direct_IO.
+
+static void btrfs_dio_put_next_in(struct bio_vec *vec,
+ struct btrfs_dio_extcb *extcb)
+{
+ while (vec->bv_len) {
+ unsigned int bv_len;
+ if (extcb->bo_frag) {
+ /* current bi_io_vec is part of this put-back */
+ vec->bv_len += extcb->bo_frag;
+ extcb->bo_frag = 0;
+ /* else put-back begins at previous bi_io_vec or bio */
+ } else if (extcb->bo_bvn) {
+ extcb->bo_bvn--;
+ } else {
+ extcb->bo_now--;
+ extcb->bo_bvn = extcb->order[extcb->bo_now]->bi_vcnt - 1;
+ }
+
+ bv_len = extcb->order[extcb->bo_now]->bi_io_vec[extcb->bo_bvn].bv_len;
+ if (vec->bv_len < bv_len) {
+ extcb->bo_frag = vec->bv_len;
+ vec->bv_len = 0;
+ return;
+ }
+ vec->bv_len -= bv_len;
+ }
+}
Again, this is all quite scary and is very fragile, I would hate to have to try
and figure out what was going wrong in here. Is there a way we can do this less
complicated?
I'm afraid to tell you, but this is the simplified version.
I would hope not fragile, but it is scary.
Again, this is all about those painful user memory addresses not being
aligned with our logical blocks, so they can cross multiple devices
and within 1 device bio they can cross 2 bio vecs because the user
memory is only 512 byte aligned, not page-aligned, thus there is part
of the block in one page and part in another page. And if they do that,
every logical block in the read is in 2 pages instead of 1... arrgh!
It would all be so easy if we did not transfer into user memory or
did not do checksums.
Other than only allowing logical block bounded transfers, I did not
think of a solution to make it simple. But ideas are always welcome.
+ /* ugly again - compressed extents can end with an implied hole */
+ if (!extcb->error && extcb->icb.out_len != len) {
+ while (extcb->umc.todo) {
+ struct bio_vec uv;
+ char *out;
+
+ extcb->error = btrfs_dio_get_user_bvec(&uv, &extcb->umc);
+ if (extcb->error)
+ goto fail;
+ out = kmap_atomic(uv.bv_page, KM_USER0);
+ memset(out + uv.bv_offset, 0, uv.bv_len);
+ kunmap_atomic(out, KM_USER0);
Umm, I'm just going to close my eyes and pretend this is necessary.
This one is another one I blame on Chris ;)
I found this wonderful surprise via fsx after getting past that other
wonderful surprise with the inline in a large file.
But on the other hand, this tiny drop of ugly in my ocean of ugly
is better than changing other parts of btrfs.
+static int btrfs_dio_drop_workbuf(struct btrfs_dio_extcb *extcb)
+{
+ extcb->icb.workspace = NULL;
+ extcb->tmpbuf = NULL;
+ extcb->tmpbuf_size = 0;
+ return 0;
+}
+
Can change this to just be a void
yes!!!, finally an easy one.
+/* submit kernel temporary pages for compressed read */
+static int btrfs_dio_add_temp_pages(u64 *dev_left, struct btrfs_dio_extcb *extcb,
+ int dvn)
+{
+ while (extcb->diodev[dvn].vecs && *dev_left) {
+ unsigned int pglen = min_t(long, *dev_left, PAGE_SIZE);
+ struct page *page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+
+ if (!page)
+ return -ENOMEM;
+ if (!bio_add_page(extcb->diodev[dvn].bio, page, pglen, 0)) {
+ extcb->diodev[dvn].vecs = 0;
+ page_cache_release(page);
+ return 0;
+ }
+ extcb->csum_pg1 = page;
+ extcb->iolen += pglen;
+ extcb->diodev[dvn].physical += pglen;
+ *dev_left -= pglen;
+ extcb->diodev[dvn].vecs--;
+ }
+
+ return 0;
+}
Ok so I assume this is you have all of the kmap horrors I keep seeing? Is it
because we read the compressed data into these temporary pages, and then
uncompress the data into user pages? I'd just like to know for my own
edification.
Yes, compression is the main use of the temp pages, to hold the compressed
stuff which can be smaller or bigger than the actual user read. I have to
do the read from disk into the temp pages, which becomes the source to the
zlib_inflate and the user memory is the output. The extra complication here
is that we can be spanning devices, the "dvn", even for a small extent
because it is packed into the chunk, so you might need 28k of compressed data
but 12k is at the end of the dev0 stripe and 16k is at the beginning of the
dev1 stripe. The "extcb->csum_pg1 = page;" thing is for the other use of
temp pages, to read the part of the logical block at the beginning or/and
end of uncompressed reads that the user did not want so we can checksum it.
+ out = kmap_atomic(uv.bv_page, KM_USER0);
+ memset(out + uv.bv_offset, 0, uv.bv_len);
+ kunmap_atomic(out, KM_USER0);
+
/me hands jim a zero_user_page()
There was actually a reason I chose not to use it, but I don't have access
to the kernel source right now so I'll figure that out later and either
explain why or change it if there was no good reason.
Check to make sure path isn't NULL.
Need proper tabbing.
Another nit, tab and then space to inside the (, so it looks like this
if (blah &&
blah) {
do foo
}
Just one of these min_t's is needed.
Check to make sure alloc_page works.
OK, thanks, will do.
This is alot of code so I'm sure I've missed other things, but I think that
covers all the major points. One other thing is there are _alot_ of random
space problems. Before you re-submit I'd recommend editing your ~/.vimrc and
add
let c_space_errors = 1
and then open up this file in vim and see all of the hilighted red areas where
there is trailing whitespace and get rid of it. checkpatch.pl will also catch
that if you don't want to use vim, you should probably use checkpatch.pl anyway
since it checks other stuff. I tested this patch and it worked fine with fsx
btw, so it mostly just needs to be cleaned up. Thanks,
Thanks for the review, and even more for both sets of testing you did.
jim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html