[darcs-users] Regular Expression libraries and linker errors

Jason Dagit dagit at codersbase.com
Mon Oct 5 15:30:34 UTC 2009

On Mon, Oct 5, 2009 at 7:28 AM, Petr Rockai <me at mornfall.net> wrote:

> Hi,
> Jason Dagit <dagit at codersbase.com> writes:
> > In my travels profiling the performance of record I noticed that we do
> spend
> > about 1/3 of the time just matching regular expressions on filenames.
> Just one thing... do we match those on String or on ByteString? Because not
> using String would probably lead to another substantial speedup on this.

The strings we use as matchers and the strings we match against are both
small and therefore I think ByteStrings would be a bad thing.  I forget now
if the whole thread has been on darcs-users but I've been exchanging emails
with Simon Marlow recently and I now believe that ByteStrings can be a
pessimization when you have many small ones.  Additionally the heap they
consume is not reported correctly in the heap profiling data so it can be
misleading to benchmark with ByteStrings.

Is there some other reason to believe ByteString will be better here such as
the BS pointer can be passed directly to the regex lib without copying?  I
could see that helping a lot, but then the fragmentation issue would still
be annoying.

> We may
> also want to switch to regex-dfa, since I believe we only care whether we
> have
> a match and not much else. But you are right that regex-pcre or pcre-light
> might be faster (before deciding, it may make a lot of sense to benchmark
> both
> in darcs, though).

I tried regex-dfa (it's the only other regex lib I've been able to try so
far) and it is much much much slower than regex-compat.  I don't know where
I put the numbers but it was a very clear pessimization.  Eric gave me some
advice on solving the iconv issues.  When I get back to this I'll try
regex-pcre and see how it compares to regex-posix.

> Also, I'd say that working over ByteStrings should take priority, although
> it
> may be quite tricky to do. I guess we obtain those filepaths from
> System.Directory or such, which tends to work with String?
I was working on a System.FilePath.ByteString (or whatever the module is
named), so that we could benchmark with our filepaths as ByteStrings but
after I found out about fragmentation and the way the GC struggles with some
aspects of ByteStrings I put the project away for a while.  It seems that
ByteStrings aren't going to be very useful unless the strings are more than
4K, for example.  Otherwise you will waste the difference per ByteString.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091005/dc385b77/attachment.htm>

More information about the darcs-users mailing list