One of the big things to remember with dedup is that it is
block-oriented (as is compression) - it deals with things in discrete
chunks, (usually) not the entire file as a stream. So, let's do a
File A is 100MB in size. From ZFS's standpoint, let's say it's made up
of 100 1MB blocks (or chunks, or slabs). Let's also say that none of the
blocks are identical (which is highly likely) - that is, no block
Thus, with dedup on, this file takes up 100MB of space. If I do a "cp
fileA fileB", no more additional space will be taken up.
However, let's say I then add 1 bit of data to the very front of file A.
Now, block alignments have changed for the entire file, so all the 1MB
blocks checksum differently. Thus, in this case, adding 1 bit of data to
file A actually causes 100MB+1bit of new data to be used, as now none of
file B's block are the same as file A. Therefore, after 1 additional
bit has been written, total disk usage is 200MB+1 bit.
If compression were being used, file A originally would likely take up <
100MB, and file B would take up the same amount; thus, the two together
could take up, say 150MB together (with a conservative 25% compression
ratio). After writing 1 new bit to file A, file A almost certainly
compresses the same as before, so the two files will continue to occupy
150MB of space.
Compression is not obsoleted by dedup. They both have their places,
depending on the data being stored, and the usage pattern of that data.
On Wed, 2010-05-05 at 19:11 -0700, Richard L. Hamilton wrote:
> Another thought is this: _unless_ the CPU is the bottleneck on
> a particular system, compression (_when_ it actually helps) can
> speed up overall operation, by reducing the amount of I/O needed.
> But storing already-compressed files in a filesystem with compression
> is likely to result in wasted effort, with little or no gain to show for it.
> Even deduplication requires some extra effort. Looking at the documentation,
> it implies a particular checksum algorithm _plus_ verification (if the
> or digest matches, then make sure by doing a byte-for-byte compare of the
> blocks, since nothing shorter than the data itself can _guarantee_ that
> they're the same, just like no lossless compression can possibly work for
> all possible bitstreams).
> So doing either of these where the success rate is likely to be too low
> is probably not helpful.
> There are stats that show the savings for a filesystem due to compression
> or deduplication. What I think would be interesting is some advice as to
> how much (percentage) savings one should be getting to expect to come
> out ahead not just on storage, but on overall system performance. Of
> course, no such guidance would exactly fit any particular workload, but
> I think one might be able to come up with some approximate numbers,
> or at least a range, below which those features probably represented
> a waste of effort unless space was at an absolute premium.
Java System Support
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)
zfs-discuss mailing list