|[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]|
Reading -less- than the entire file is a required attribute of the S3 API, where the Range HTTP header is specified to the GET method, supplying the byte range for the request. This corrects an otherwise obvious limitation in the protocol: if you desire only a 4k chunk of a 2GB file, you should not be forced to download all of the 2GB file.
Partial-GET is also a must-have feature for my other two hacking projects, itd and nfs4d. When executing a SCSI READ, itd will not want to download a huge amount of data, just to handle a 4-LBA request. Similarly with nfs4d, executing a READ of an NFS file should not require nfs4d to download more data than required from chunkd.
For tabled, the implementation requires a bit of modification to the event-driven GET code path, but nothing overly burdensome. It largely relies on chunkd, though, to provide the ability to retrieve only a portion of the specified object.
For chunkd, the implementation of partial-GET is also relatively straightforward, but it introduces a few minor protocol issues.
Presently, we checksum the entire object at PUT time, and return that checksum at GET time, so that the client may verify the [strong] checksum to ensure no data corruption occurred.
A partial-GET implies the checksum is useless, and must be recomputed just for the object subset being requested. Unfortunately, this also implies a key optimization, checksum offload (which goes straight from kernel pages to NIC TCP output via DMA, all in hardware) becomes impossible.
On an unencrypted GET, chunkd executes sendfile(2), thereby eliminating several memory copies that would otherwise be made by the app and by the kernel. sendfile(2) automatically reads data from an fd, and writes that data to another fd, all without ever exposing that data directly to the app. As such, partial-GET with checksumming would require replacing
sendfile(out_fd, in_fd, &offset, bytes); with while (buffer not completely written to out_fd) read(in_fd, buf, count) SHA1_hash(buf) write(out_fd, buf, count)The protocol issue is related. If we are to deliver the checksum in the -header-, that implies that entire partial-GET object data must be read and checksummed prior to creating the message header. Then, the message header and object data is sent. Incredibly inefficient. The time-honored solution is putting the checksum at the end of the data stream, thereby allowing the checksum to be generating during data transmission.
Another issue this raises is checksum verification. Ideally we want to have pre-stored checksum, so that the local node can verify at data transmission time that what it reads off disk matches what it wrote $N days ago. Simply creating a checksum of what you write(2) to a TCP connection does not protect against disk corruption.
One solution is to update the chunkd disk format (again), and introduce checksums for each fixed-block, ie. one checksum for each 64k in a file. This would enable chunkd to verify, prior to sending data on a partial-GET, that the data pulled off disk is not corrupted.
Just some food for thought :) Jeff -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html