Software Deduplication: Quick comparison of save ratings
Posted 2015-07-31 17:59 by Traesk. Edited 2016-03-30 04:03 by Traesk. 2996 views.
Purpose: To see if any of the alternatives is significantly better or worse, and ultimately decide what OS/filesystem to potentially use on a fileserver.
Scope/limitations: This test does not consider factors other than the actual save ratings, such as performance, security, and cost. The method is very limited and might not necessarily reflect the results in a real-world scenario. There might be some minor difference in the results due to rounding and possibly inconsistent variables used. This is the first time I get in touch with most of these systems, so the results are probably not optimized. The results were pretty clear though, so +/- some percent shouldn't matter.
Method: I used 22.3GiB worth of Windows XP installation ISOs, 52 ISOs in total. No file was exactly the same, but some contained much duplicate data, like the Swedish XP Home Edition vs the Swedish N-version of XP Home Edition. I deduplicated these files and noted how much space I saved compared to the 22.3GiB.
Please see the updated results!
Introduction
Duplication is the method of removing duplicate data in files. That way, if you have two similar files, theoretically only the "new" or "different" data takes up space in the second file.There are two ways to do deduplication; "post-process" and "in-line" (sometimes called by other names). Post-process scans the files after they are copied and then removes duplicate data. In-line compares the files before they are written, so duplicate data are never written at all. In-line need much RAM to keep all necessary information available, but should copy identical files much faster as duplicate data of the second file doesn't have to be written to the disk.
Btrfs
The filesystem Btrfs supports post-process deduplication, but you need a tool that tells it what do dedup. There seems to be two alternatives: bedup and duperemove. Bedup only do whole-file-deduplication (I also verified this by running the tool, with no space savings as a result), rather than chunks, so it's not really what we are looking for.General info
Duperemove v0.09 was used in this test.Ubuntu 14.10
uname -a
Linux hostname 3.16.0-30-generic #40-Ubuntu SMP Mon Jan 12 22:06:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
btrfs --version
Btrfs v3.14.1
blockdev --getbsz /dev/sda3
4096
Testing
Used space before running duperemove:df /dev/sda3 --block-size=M
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda3 82969M 22935M 58008M 29% /media/traesk/020d9e3e-0a61-4d06-8b78-c5b4dd975bd9
Running with default block size of 128K
duperemove -rdh /media/traesk/020d9e3e-0a61-4d06-8b78-c5b4dd975bd9/
df /dev/sda3 --block-size=M
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda3 82969M 19611M 61325M 25% /media/traesk/020d9e3e-0a61-4d06-8b78-c5b4dd975bd9
Testing with half the block size, 64K
duperemove -rdh -b 64k /media/traesk/020d9e3e-0a61-4d06-8b78-c5b4dd975bd9/
df /dev/sda3 --block-size=M
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda3 82969M 19468M 61469M 25% /media/traesk/020d9e3e-0a61-4d06-8b78-c5b4dd975bd9
Finally, I attempted to set the block size to the same block size as Btrfs itself; 4K
duperemove -rdh -b 4k /media/traesk/020d9e3e-0a61-4d06-8b78-c5b4dd975bd9/
...however, it froze my virtual PC after an hour or so, and after increasing the RAM of the VM I let it run a day before I gave up and just shut it down. Considering the tests with 64K and 128K took half an hour or so, I did not care to try this any further.
Saved space with 128K block size: (22835.2-19611)/22835.2= 14.1%
Saved space with 64K block size: (22835.2-19468)/22835.2= 14.7%
I find it interesting that the log says:
Using 128K blocks
Using hash: SHA256
Using 2 threads for file hashing phase
[…]
Kernel processed data (excludes target files): 4.0G
Comparison of extent info shows a net change in shared extents of: 7.4G
...even though the actual savings are 3-3.5GiB. I'm not sure how to interpret this.
Opendedup
Opendedup is a Java-based file system on top of your regular file system that does in-line deduplication.General info
sdfs-2.0.10_amd64.deb was used in this test. Underlying filesystem was Btrfs.Testing
Default settingsmkfs.sdfs --volume-name=tank --volume-capacity=100gb --base-path=/media/traesk/ee291934-34db-448c-98cb-c96447c29259/
mount.sdfs tank /media/tank
sdfscli --volume-info
Volume Capacity : 100 GB
Volume Current Logical Size : 22 GB
Volume Max Percentage Full : 95.0%
Volume Duplicate Data Written : 11 GB
Unique Blocks Stored: 10 GB
Unique Blocks Stored after Compression : 10 GB
Cluster Block Copies : 2
Volume Virtual Dedup Rate (Unique Blocks Stored/Current Size) : -9.96%
Volume Actual Storage Savings (Compressed Unique Blocks Stored/Current Size) : 50.92%
Compression Rate: 0.0%
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 75G 12G 59G 17% /media/traesk/ee291934-34db-448c-98cb-c96447c292591
sdfs:/etc/sdfs/tank-volume-cfg.xml:6442 101G 11G 90G 11% /media/tank
"Variable block deduplication"
mkfs.sdfs --volume-name=tank --volume-capacity=100gb --hash-type=VARIABLE_MURMUR3 --base-path=/media/traesk/ee291934-34db-448c-98cb-c96447c29259/
mount.sdfs tank /media/tank
sdfscli --volume-info
Volume Capacity : 100 GB
Volume Current Logical Size : 22 GB
Volume Max Percentage Full : 95.0%
Volume Duplicate Data Written : 12 GB
Unique Blocks Stored: 9 GB
Unique Blocks Stored after Compression : 9 GB
Cluster Block Copies : 2
Volume Virtual Dedup Rate (Unique Blocks Stored/Current Size) : -8.94%
Volume Actual Storage Savings (Compressed Unique Blocks Stored/Current Size) : 56.97%
Compression Rate: 3.31%
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 75G 12G 59G 17% /media/traesk/28b8182f-185b-41b7-a117-b667e6d78f09
sdfs:/etc/sdfs/tank-volume-cfg.xml:6442 101G 9,7G 91G 10% /media/tank
Savings with default settings: (22.3-12)/22.3 = 46.2%Savings with Variable block deduplication: (22.3-12)/22.3 = 46.2%
I should have gotten df as megabyte instead for this one. Also, seems Opendedup use some storage on the underlying filesystem as well, that's probably why my calculations show a lower savings rate than sdfscli does.
Windows
In Windows Server 2012 Microsoft introduced post-process deduplication for NTFS. Note that the new filesystem ReFS is not supported.General info
Windows Server 2012 R2 with Update 1, otherwise untouched.Testing
Enable-DedupVolume e:
Set-DedupVolume e: -MinimumFileAgeDays 0
Start-DedupJob e: –Type Optimization
Start-DedupJob e: -Type GarbageCollection
Start-DedupJob e: -Type Scrubbing
Get used space
Get-WmiObject win32_logicaldisk -filter "DeviceID='E:'" | ForEach-Object {($_.size - $_.freespace) / 1GB}
9,1517219543457
Get-DedupVolume | Format-List
Volume : E:
VolumeId : \\?\Volume{bb855b83-aeed-11e4-80b5-000c29fa07e3}\
Enabled : True
UsageType : Default
DataAccessEnabled : True
Capacity : 50.94 GB
FreeSpace : 41.78 GB
UsedSpace : 9.15 GB
UnoptimizedSize : 22.67 GB
SavedSpace : 13.52 GB
SavingsRate : 59 %
MinimumFileAgeDays : 0
MinimumFileSize : 32768
NoCompress : False
ExcludeFolder :
ExcludeFileType :
ExcludeFileTypeDefault : {edb, jrs}
NoCompressionFileType : {asf, mov, wma, wmv...}
ChunkRedundancyThreshold : 100
Verify : False
OptimizeInUseFiles : False
OptimizePartialFiles : False
Savings: (22.3-9.15)/22.3 = 59.0%
ZFS
ZFS is another filesystem capable of deduplication, but this one does it in-line and no additional software is required. Originally part of Sun OpenSolaris, it was made closed-source when Oracle took over. This led to the creation of OpenZFS which is now (obviously) developed independently of Oracle's version.General info
Ubuntu 14.10uname -a
Linux hostname 3.16.0-30-generic #40-Ubuntu SMP Mon Jan 12 22:06:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
zpool get version tank
NAME PROPERTY VALUE SOURCE
tank version - default
zfs get version tank
NAME PROPERTY VALUE SOURCE
tank version 5 -
dmesg | grep -E 'SPL:|ZFS'
[ 1383.325360] SPL: Loaded module v0.6.3-14~utopic
[ 1383.612907] ZFS: Loaded module v0.6.3-15~utopic, ZFS pool version 5000, ZFS filesystem version 5
[ 2530.532362] SPL: using hostid 0x007f0101
Oracle Solaris
uname -a
SunOS hostname 5.11 11.2 i86pc i386 i86pc
zpool get version tank
NAME PROPERTY VALUE SOURCE
tank version 35 default
zfs get version tank
NAME PROPERTY VALUE SOURCE
tank version 6 -
Testing
Ubuntu: Deduplication on, compression offzpool create -O dedup=on tank /dev/sda3
zdb -DD tank
DDT-sha256-zap-duplicate: 31239 entries, size 274 on disk, 142 in core
DDT-sha256-zap-unique: 116596 entries, size 282 on disk, 152 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 114K 14.2G 14.2G 14.2G 114K 14.2G 14.2G 14.2G
2 30.2K 3.78G 3.78G 3.78G 63.8K 7.97G 7.97G 7.97G
4 306 38.2M 38.2M 38.2M 1.20K 153M 153M 153M
8 1 128K 128K 128K 12 1.50M 1.50M 1.50M
16 2 256K 256K 256K 47 5.88M 5.88M 5.88M
Total 144K 18.0G 18.0G 18.0G 179K 22.3G 22.3G 22.3G
dedup = 1.24, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.24
Ubuntu: Deduplication on, compression on
zpool create -O dedup=on -O compression=on tank /dev/sda3
zdb -DD tank
DDT-sha256-zap-duplicate: 31238 entries, size 279 on disk, 143 in core
DDT-sha256-zap-unique: 116596 entries, size 283 on disk, 151 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 114K 14.2G 13.8G 13.8G 114K 14.2G 13.8G 13.8G
2 30.2K 3.78G 3.63G 3.63G 63.8K 7.97G 7.67G 7.67G
4 306 38.2M 36.3M 36.3M 1.20K 153M 145M 145M
8 1 128K 4.50K 4.50K 12 1.50M 54K 54K
16 1 128K 4.50K 4.50K 24 3M 108K 108K
Total 144K 18.0G 17.5G 17.5G 179K 22.3G 21.6G 21.6G
dedup = 1.24, compress = 1.03, copies = 1.00, dedup * compress / copies = 1.28
Solaris: Deduplication on, compression on
zpool create -O compression=on -O dedup=on tank c2t1d0
df -k tank
Filesystem 1024-blocks Used Available Capacity Mounted on
tank 30859235 22681047 8134226 74% /tank
Note that ZFS reports a larger total disk-size rather than less used space.
zdb -DD tank
DDT-sha256-zap-duplicate: 31238 entries, size 278 on disk, 143 in core
DDT-sha256-zap-unique: 116596 entries, size 283 on disk, 148 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 114K 14.2G 13.8G 13.8G 114K 14.2G 13.8G 13.8G
2 30.2K 3.78G 3.63G 3.63G 63.8K 7.97G 7.67G 7.67G
4 306 38.2M 36.3M 36.3M 1.20K 153M 145M 145M
8 1 128K 4.50K 4.50K 12 1.50M 54K 54K
16 1 128K 4.50K 4.50K 24 3M 108K 108K
Total 144K 18.0G 17.5G 17.5G 179K 22.3G 21.6G 21.6G
dedup = 1.24, compress = 1.03, copies = 1.00, dedup * compress / copies = 1.28
zpool list tank
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tank 25.8G 17.5G 8.22G 68% 1.23x ONLINE -
Ubuntu: Deduplication on, Compression on with gzip-9
zpool create -O dedup=on -O compression=gzip-9 tank /dev/sda3
zdb -DD tank
DDT-sha256-zap-duplicate: 31238 entries, size 279 on disk, 142 in core
DDT-sha256-zap-unique: 116596 entries, size 285 on disk, 151 in core
DDT histogram (aggregated over all DDTs):
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 114K 14.2G 13.5G 13.5G 114K 14.2G 13.5G 13.5G
2 30.2K 3.78G 3.55G 3.55G 63.8K 7.97G 7.51G 7.51G
4 306 38.2M 35.3M 35.3M 1.20K 153M 141M 141M
8 1 128K 512 512 12 1.50M 6K 6K
16 1 128K 512 512 24 3M 12K 12K
Total 144K 18.0G 17.1G 17.1G 179K 22.3G 21.2G 21.2G
dedup = 1.24, compress = 1.05, copies = 1.00, dedup * compress / copies = 1.30
Ubuntu: Savings with Deduplication on, compression off : (22.3-18)/22.3 = 19.3%Ubuntu: Savings with Deduplication on, compression on : (22.3-17.5)/22.3 = 21.5%
Solaris: Savings with Deduplication on, compression on: (22.3-17.5)/22.3 = 21.5%
Ubuntu: Savings with Deduplication on, compression on with gzip-9: (22.3-17.1)/22.3 = 23.3%
OpenZFS on Ubuntu and Oracle ZFS on Solaris seem to get more or less the exact same results.
Summary and conclusion
Please see the updated results!
The results clearly show that the efficiency when it comes to deduplication does not give me any reason to leave my comfort-zone and abandon Windows for Linux or UNIX.