Here is a script to locate duplicate data WITHIN files:
On some test file sets of binary data with no duplicated files, about 3%
of the data blocks were duplicated, and about 0.1% of the data blocks
were nulls. The data was mainly elf and win32 binaries plus some random
game data, office documents and a few images.
This code is hideously slow, so don't give it more than a couple of MB
of files to chew through at once. In retrospect I should've just
written it in plain fast C instead of fighting with bash pipes!
Note to get "verbose" output, just remove everything after the word
"sort" in the code.
___________________ V CODE V ____________________________
#!/bin/bash
# **********************************************************#
# Redunt data detector #
# #
# Simple Script to take an MD5 hash of every block in every #
# file in a folder and detect identical blocks #
# #
# Copyright 2008 Oliver Mattos, Released under the GPL. #
# **********************************************************#
# WARNING - This script is very inefficient, so don't run it
# with more than 50,000 blocks at once.
: ${1?"Usage: $0 PATH [BlockSize]"}
BS=${2-512} #Block Size in bytes, can be specified on command line
NULLCOUNT="0"
DUPCOUNT="0"
TOTCOUNT="0"
NULLHASH=`dd if=/dev/zero bs=$BS count=1 2>/dev/null | md5sum -b`
NULLHASH="${NULLHASH:0:32}"
find "$1" | \
while read i; do
if [ "` stat "$i" -c%f `" == "81a4" ]; then
LEN=` stat "$i" -c%s `
BC=0
while [ $LEN -gt $[$BC * $BS] ]; do
echo `dd if="$i" bs=$BS count=1 skip=$BC 2>/dev/null | md5sum -b`
$BC $i
BC=$[ BC + 1 ]
done
fi;
done | sort | while read j; do
OLDHASH=$HASH
HASH=${j:0:32}
TOTCOUNT=$[ $TOTCOUNT + 1 ]
if [ "$HASH" == "$OLDHASH" ]; then
DUPCOUNT=$[ DUPCOUNT + 1 ]
if [ "$HASH" == "$NULLHASH" ]; then
NULLCOUNT=$[ NULLCOUNT + 1 ]
fi
fi
echo Hashed $TOTCOUNT $BS byte blocks, found $DUPCOUNT redundant \
blocks of data, of which $NULLCOUNT blocks were null.
done | tail -n 1
# these last two lines are a bodge because the variables dont seem to
# come out of the while properly, probably something todo with the
# pipes...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html