[darcs-users] Regular Expression libraries and linker errors

Jason Dagit dagit at codersbase.com
Sat Oct 3 08:32:38 UTC 2009


Hello,

In my travels profiling the performance of record I noticed that we do spend
about 1/3 of the time just matching regular expressions on filenames.

Initially I thought the reason might be that we didn't compile and re-use
the matches that we create.  But, we do compile them and we do reuse them.
I checked and double checked this.

I did learn some things:
1) We use regex-compat.
2) regex-compat is a very thin wrapper on regex-posix (even on windows).
3) regex-compat isn't the way to get compatibility between the various regex
libs, it's a way to be compatible with the *old* Haskell regex API.
4) regex-posix uses the current Haskell regex API.
5) The default regex that we provide (eg., on darcs init), are not fully
optimized and may not do what people expect in all cases.

Here is what I propose:
a) We switch to regex-posix.
b) We invest a small bit of time writing a function to optimize a list of
simple regexes into one big but efficient regex.

I think (a) is mostly trivial and I could probably do this in an afternoon,
although probably not this weekend.  We would do this for maintenance
reasons.

And (b) is because I did some performance testing and found out that if we
just naively OR the current default list of binary file matches together it
slows down darcs but if we carefully rewrite them it speeds up darcs.

Specifically, our current list looks like:
\.(foo|FOO)$

I think we should transform that to:
\.[fF][oO][oO]$

I think that better captures the case-insensitive intent that we had.

Originally it looked like:
\.foo$
\.FOO$

But, I think .Foo and .fOo should match as well.

Even easier than (b) is to change the defaults to the format I propose
(using [] instead of (|)).  But, that doesn't help people who have the older
formats.  I can also imagine other stop gap proposals like making a
standalone commandline tool that can optimize the regexs and write them back
out so people have a chance to review them.  But, having darcs optimize them
on the fly (or adding that to the regex-base library) is nice because then
they throw any old regex at darcs and it tries to clean it up before using
it.

It's possible that regex-pcre gives better performance than regex-posix but
when I switch over to using that I got some weird messages building darcs:
[138 of 138] Compiling Main             ( src/darcs.hs,
dist/build/darcs/darcs-tmp/Main.o )
Linking dist/build/darcs/darcs ...
Undefined symbols:
  "_iconv_close", referenced from:
      _h_iconv_close in libHShaskeline-0.6.1.6.a(h_iconv.o)
  "_iconv", referenced from:
      _h_iconv in libHShaskeline-0.6.1.6.a(h_iconv.o)
  "_iconv_open", referenced from:
      _h_iconv_open in libHShaskeline-0.6.1.6.a(h_iconv.o)
ld: symbol(s) not found
collect2: ld returned 1 exit status

Seems like that should be related to haskeline, but it wasn't happening
before I told darcs to use regex-pcre.  Does anyone else know anything about
this?

Thanks,
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091003/79b3ff46/attachment.htm>


More information about the darcs-users mailing list