Re: [PATCH v2] Btrfs: heuristic add shannon entropy calculation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Oct 08, 2017 at 04:11:59PM +0300, Timofey Titovets wrote:
> Byte distribution check in heuristic will filter edge data
> cases and some time fail to classificate input data
> 
> Let's fix that by adding Shannon entropy calculation,
> that will cover classification of most other data types.
> 
> As Shannon entropy need log2 with some precision to work,
> let's use ilog2(N) and for increase precision,
> by do ilog2(pow(N, 4))
> 
> Shannon entropy slightly changed for avoiding signed numbers
> and divisions.
> 
> Changes:
>   v1 -> v2:
>     - Replace log2_lshift4 wiht ilog2
>     - To compensate accuracy errors of ilog2
>       @ENTROPY_LVL_ACEPTABLE 70 -> 65
>       @ENTROPY_LVL_HIGH      85 -> 80
>     - Drop usage of u64 from shannon calculation
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@xxxxxxxxx>
> ---
>  fs/btrfs/compression.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 82 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 0060bc4ae5f2..8efbce5633b5 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -34,6 +34,7 @@
>  #include <linux/slab.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sort.h>
> +#include <linux/log2.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -1224,6 +1225,60 @@ int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
>  	return 1;
>  }
> 
> +/*
> + * Shannon Entropy calculation
> + *
> + * Pure byte distribution analyze fail to determine
> + * compressiability of data. Try calculate entropy to
> + * estimate the average minimum number of bits needed
> + * to encode a sample data.
> + *
> + * For comfort, use return of percentage of needed bit's,
> + * instead of bit's amaount directly.
> + *
> + * @ENTROPY_LVL_ACEPTABLE - below that threshold sample has low byte
> + * entropy and can be compressible with high probability
> + *
> + * @ENTROPY_LVL_HIGH - data are not compressible with high probability,
> + *
> + * Use of ilog2() decrease precision, so to see ~ correct answer
> + * LVL decreased by 5.
> + */
> +
> +#define ENTROPY_LVL_ACEPTABLE 65
> +#define ENTROPY_LVL_HIGH 80
> +
> +/*
> + * For increase precision in shannon_entropy calculation
> + * Let's do pow(n, M) to save more digits after comma
> + *
> + * Max int bit lengh 64
> + * ilog2(MAX_SAMPLE_SIZE) -> 13
> + * 13*4 = 52 < 64 -> M = 4
> + * So use pow(n, 4)
> + */
> +static inline u32 ilog2_w(u64 n)
> +{
> +	return ilog2(n * n * n * n);
> +}
> +
> +static u32 shannon_entropy(struct heuristic_ws *ws)
> +{
> +	const u32 entropy_max = 8*ilog2_w(2);
> +	u32 entropy_sum = 0;
> +	u32 p, p_base, sz_base;
> +	u32 i;
> +
> +	sz_base = ilog2_w(ws->sample_size);
> +	for (i = 0; i < BUCKET_SIZE && ws->bucket[i].count > 0; i++) {
> +		p = ws->bucket[i].count;
> +		p_base = ilog2_w(p);
> +		entropy_sum += p * (sz_base - p_base);
> +	}
> +
> +	entropy_sum /= ws->sample_size;
> +	return entropy_sum * 100 / entropy_max;
> +}

That's much better :) I'll add the patch to the rest of the heuristics.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux