Some clarifications:
> Patchset based on 'tmp' branch e6bd18d8938986c997c45f0ea95b221d4edec095.
All patches are against btrfs-progs.
====
The rest of rambling is about kernel code, which handles supers.
I have read what I've wrote last night (braindump of insane!)
and will try to elaborate a bit:
> Poking at valgrind warnings I have noticed very worrying problem.
> When we (over)write superblock we take 4096 continuous bytes in memory.
The common pattern in disk-io.c is the following:
struct btrfs_super_block *sb = root->fs_info->sb // or ->sb_for_commit
memcpy(dest, sb, BTRFS_SUPER_INFO_SIZE /* 4096 */);
With a memcpy we go out of sizeof(*sb) (~2.5K) and pick more fields from fs_info struct.
Let's look at them:
struct btrfs_fs_info {
...
struct btrfs_super_block super_copy;
struct btrfs_super_block super_for_commit;
struct block_device *__bdev;
struct super_block *sb;
struct inode *btree_inode;
struct backing_dev_info bdi;
struct mutex trans_mutex;
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
struct mutex ordered_operations_mutex;
struct rw_semaphore extent_commit_sem;
struct rw_semaphore cleanup_work_sem;
struct rw_semaphore subvol_sem;
struct srcu_struct subvol_srcu;
struct list_head trans_list;
struct list_head hashers;
struct list_head dead_roots;
struct list_head caching_block_groups;
spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
atomic_t nr_async_submits;
...
So we copyout (and even checksum!) atomic counters and other volatile stuff
(I haven't looked at mutexes and semaphores, but i'm sure there is yet some
volatile stuff)
> In kernel the structures reside in btrfs_fs_info structure, so we compute
> CRC for:
> struct btrfs_super_block super_copy;
> struct btrfs_super_block super_for_commit;
> and then write it to disk. [H]ere we have 2 issues:
> 1. kernel pointers and other random stuff leaks out to kernel.
> It's nondeterministic and leaks out data (not too bad,
> as it should be accessible only for root, but still)
> 2. more serious: is there guarantee, that noone will kick-in
> between CRC computation and superblock outwrite?
>
> What if some of mutexes, semaphores or lists will change
> it's internal state? Some async thread will kick it
> an we will end-up writing superblock with invalid CRC!
> This might well be the cause of recend superblock
> corruptions under heavy load + hangup retorted to the list.
>
> Consider the following call chain:
> [somewhere in write_dev_supers ...]
>
> bh->b_end_io = btrfs_end_buffer_write_sync;
> crc = ~(u32)0;
> crc = btrfs_csum_data(NULL, (char *)sb +
> BTRFS_CSUM_SIZE, crc,
> BTRFS_SUPER_INFO_SIZE -
> BTRFS_CSUM_SIZE);
> btrfs_csum_final(crc, sb->csum);
Now the problem should be a bit more clear: sb is a thing pointing to the middle
of fs_info. We checksum it with data after it in fs_info.
> bh = __getblk(device->bdev, bytenr / 4096,
> BTRFS_SUPER_INFO_SIZE);
>
> memcpy(bh->b_data, sb, BTRFS_SUPER_INFO_SIZE);
and here we write all the checksummed contents. Is there guard, which
prevents things from updating fs_info?
>
> /* one reference for submit_bh */
> get_bh(bh);
>
> set_buffer_uptodate(bh);
> lock_buffer(bh);
> bh->b_end_io = btrfs_end_buffer_write_sync;
Am I too paranoid about the issue?
Thanks!
--
Sergei
Attachment:
signature.asc
Description: PGP signature
