[darcs-devel] repository "weak" hash

Sat Apr 10 06:43:48 UTC 2021

TL;DR The repo hash is weak not because recursively hashing hashes is
cryptographically weak in general. Rather, it is weak due to an
insufficient implementation that on the one hand trades security for
efficiency (use of xor) while at the same time doing more work than
necessary (combine all hashes in the repo).

The haddocks for Darcs.Repository.Hashed.repoXor say:

-- | XOR of all hashes of the patches' metadata.
-- It enables to quickly see whether two repositories
-- have the same patches, independently of their order.
-- It relies on the assumption that the same patch cannot
-- be present twice in a repository.
-- This checksum is not cryptographically secure,
-- see http://robotics.stanford.edu/~xb/crypto06b/ .

The last sentence above is based on a misunderstanding und thus very
much misleading. The theorem proved in that paper is about combining the
results of applying two different hash functions (with only partly known
properties) to the same input. Whereas we are concerned with combining
the results of the same hash function (with fully known properties)
applied to different inputs. The latter is quite a common operation:
merkle trees (and thus block chains) are based on the idea that it is
sound to hash hashes recursively (using the same, cryptographically
secure, hash function).

There are two senses in which the "weak hash" is indeed weak:

One is that it uses xor. Instead we should sort the hashes, concatenate
them, and then use the same hash function (sha1) to hash that.

A second and more fundamental one is that this is all about mere meta
data hashes, so the "weak" hash is only secure under the assumption of
global uniqueness of meta data (more precisely: the assumption is that
meta data uniquely maps to patch identity in the strong sense of
"representation in minimal context").

Furthermore: if we work with (=trust) meta data hashes anyway, then
there is no reason to include all patches in the repository hash! It is
enough to hash the meta data hashes from the latest clean tag up to the
head i.e. the patches referenced by the head inventory. Because if we
trust meta data hashes, then we must also trust that a clean tag
uniquely identifies the set of all patches preceding it. Insisting on
combining the (meta data) hashes of all patches in a repo amounts to
trusting in the global uniqueness property for regular patches but
distrusting it for tags; which makes no sense at all if you ask me.

Cheers
Ben
-- 
I would rather have questions that cannot be answered, than answers that
cannot be questioned.  -- Richard Feynman