[darcs-devel] Re: darcs patch: add hash function to FastPackedString (and 1 more)

Benedikt Schmidt ry102 at rz.uni-karlsruhe.de
Tue Apr 5 12:43:07 PDT 2005


Ian Lynagh <igloo at earth.li> writes:

> On Tue, Apr 05, 2005 at 08:25:10AM -0400, Benedikt Schmidt wrote:
>> 
>> Tue Apr  5 14:31:55 CEST 2005  Benedikt Schmidt <beschmi at cloaked.de>
>>   * replace Hunt/Szymanski diff algorithm with one by Myers
>
> Do you have some comparison times, both for large common cases and also
> for cases where H-S struggles?

I did some test with the linux kernel tree (going from 2.2.0 to 2.2.9) and
record/whatsnew takes about half the time with the new code. The H-S code
had problems diffing big files with many changes, e.g. RT#222 or a search and
replace in a big textfile. A diff of the files in #222 takes 0.5s here with
GNU diff -d, about a minute with the H-S code and 2.4s with the new code.
In some cases, creating the edit-script out of the LCS takes as much time
as finding the lcs with the H-S code. I can post some more numbers after
running these tests again.

There were two things that made a real difference for the Myers algo, not
considering lines that are only present in one of the files and using
hash values instead of comparing the strings (often lines have the same
prefix). I think both of these can be applied to the H-S algo too and
there are probably some other things that could be optimized in the H-S
implemetation (e.g. openbsd uses the H-S algo in diff(1)).

>> +foreign import ccall unsafe "fpstring.h hash" hash
>> +    :: Ptr Word8 -> Int -> IO Int32
>> +
>> +int hash(const char *s, int len) {
>> +}
>
> int in C is CInt in Haskell.
> Int in Haskell is HsInt in C.
> You shouldn't use Int with int.
> (darcs could probably do with an audit for this)

Ok, just copied from first_white without thinking, the return
type could lead to real problems on 64 Bit archs.

> ((I haven't followed the code to see if signedness of chars can also be a
> problem))
> (((of course, IMO it should all be in Haskell anyway  :-)  )))

I'll try a haskell version and do some tests, i have used
Data.HashTable.hashString . unpackPS first, but there were much
more collisions IIRC. 

Thanks for the comments,

Benedikt  





More information about the darcs-devel mailing list