On 2018年08月02日 18:35, Andrei Borzenkov wrote: > > > Отправлено с iPhone > >> 2 авг. 2018 г., в 12:16, Martin Steigerwald <martin@xxxxxxxxxxxx> написал(а): >> >> Hugo Mills - 01.08.18, 10:56: >>>> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote: >>>> I know it's a decade-old question, but I'd like to hear your >>>> thoughts >>>> of today. By now, I became a heavy BTRFS user. Almost everywhere I >>>> use BTRFS, except in situations when it is obvious there is no >>>> benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and >>>> server computers are mainly running on BTRFS with only a few file >>>> systems on ext4. I even installed BTRFS in corporate productive >>>> systems (in those cases, the systems were mainly on ext4; but there >>>> were some specific file systems those exploited BTRFS features). >>>> >>>> But there is still one question that I can't get over: if you store >>>> a >>>> database (e.g. MySQL), would you prefer having a BTRFS volume >>>> mounted >>>> with nodatacow, or would you just simply use ext4? >>> >>> Personally, I'd start with btrfs with autodefrag. It has some >>> degree of I/O overhead, but if the database isn't performance-critical >>> and already near the limits of the hardware, it's unlikely to make >>> much difference. Autodefrag should keep the fragmentation down to a >>> minimum. >> >> I read that autodefrag would only help with small databases. >> > > I wonder if anyone actually > > a) quantified performance impact > b) analyzed the cause It's caused by btrfs' poor fsync() performance and lock-hot metadata operations. The root cause is how we design btrfs' btree. For snapshot and only for snapshot, we use one btree for one *subvolume*, unlike other fses which normally use one btree for one *inode* (both dir and file). This means each time we need to modify anything, including updating EXTENT_DATA pointer, or adding new child inode pointer, we need to do write lock of the whole subvolume tree root to the leaf. Which makes the tree root pretty lock hot. In short, in btrfs we need to lock and race on a big tree, while for other fses, they only need to lock and race on different small trees. (And that's why they can't support fast fs level snapshot) That's the root cause of btrfs' slow metadata performance. We have a lot of optimization to speedup the process, from delayed-ref to tree log. But it's still slow compared to other fses. For fsync() we have log tree optimization, which only logs related data pointer and inodes updates, and skips some full transaction operations. It indeeds make fsync() much faster, but still slower than other fses, due to metadata design. BTW, nodatacow indeed improves performance, but it's mostly due to the following factors: 1) No csum calculation Although csum calculation can be balanced to multi cpu cores/threads, and CRC32 is pretty fast, it still introduces overhead. 2) Some overwrite no longer needs to modify subvolume tree If we're doing overwrite, and there is no need to do CoW (for snapshot), we can skip updating EXTENT_DATA, and this reduces a lot of tree write lock and improve performance. > > I work with NetApp for a long time and I can say from first hand experience that fragmentation had zero impact on OLTP workload. It did affect backup performance as was expected, but this could be fixed by periodic reallocation (defragmentation). > > And even that needed quite some time to observe (years) on pretty high load database with regular backup and replication snapshots. > > If btrfs is so susceptible to fragmentation, what is the reason for it? I heard some reports of fragmentation, but mostly related to extent booking and ENOSPC, not really related to performance. And IIRC I did some old performance tests on HDD using btrfs and xfs/ext4. Using autodefrag mount option in fact reduces performance on btrfs. Thanks, Qu > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Attachment:
signature.asc
Description: OpenPGP digital signature
