A key feature of FastCDC is chunk size normalization. Normalization helps to improve the distribution of chunk sizes, increasing the number of chunks close to the target average size and reducing the number of chunks clipped by the maximum chunk size, as compared to the Rabin-based chunking algorithm used in restic/chunker.
The histograms below show the chunk size distribution for fastcdc-go and restic/chunker on 1GiB of random data, each with average chunk size 1MiB, minimum chunk size 256 KiB and maximum chunk size 4MiB. The normalization level for fastcdc-go is set to 2.
Compared the restic/chunker, the distribution of fastcdc-go is less skewed (standard deviation 345KiB vs. 964KiB).
FastCDC-Go is licensed unser the Apache 2.0 License. See LICENSE for details.
Xia, Wen, et al. “Fastcdc: a fast and efficient content-defined chunking approach for data deduplication.” 2016 USENIX Annual Technical Conference