[darcs-users] Regular Expression libraries and linker errors

Jason Dagit dagit at codersbase.com
Tue Oct 6 03:05:35 UTC 2009


On Mon, Oct 5, 2009 at 6:18 PM, Trent W. Buck <twb at cybersource.com.au>wrote:

> Petr Rockai <me at mornfall.net> writes:
>
> > Jason Dagit <dagit at codersbase.com> writes:
> >> In my travels profiling the performance of record I noticed that we
> >> do spend about 1/3 of the time just matching regular expressions on
> >> filenames.
> > Just one thing... do we match those on String or on ByteString?
>
> In principle, I would like to match Unicode codepoints, not bytes.
>

On OS X, man regex give these two definitions:
     int
     regcomp(regex_t *restrict preg, const char *restrict pattern,
         int cflags);
     int
     regexec(const regex_t *restrict preg, const char *restrict string,
         size_t nmatch, regmatch_t pmatch[restrict], int eflags);

So, both regcomp and regexec take vectors of bytes.  If a wchar version
exists then I don't think the Haskell bindings are using them.

I think as long as you're lucky enough that the regex and string are in the
same encoding then ByteString and String will be equivalent in their
matching ability here.  Unfortunately I don't think darcs makes an such
guarantees.


>
> In practice, I avoid non-ASCII and non-printable characters in file
> names, because there are so many such issues on Unix :-(
>
> > Because not using String would probably lead to another substantial
> > speedup on this. We may also want to switch to regex-dfa, since I
> > believe we only care whether we have a match and not much else.
>
> I see no downside there.
>

I was going to agree that we don't need the extra capabilities like
extracting matches and doing replaces but it just occurred to me that we
could probably re-implement some things, like decode_white/encode_white
using regexps and potentially get better performance.  It's worth doing
performance test to see.  I'll try to get some data on this.


>
> > But you are right that regex-pcre or pcre-light might be faster
> > (before deciding, it may make a lot of sense to benchmark both in
> > darcs, though).
>
> I have no problem switching from EREs to PCREs, but if we do so, please
> lets do it for all of Darcs at once!
>

Agreed.


>
> As well as benchmarking, someone will need to check that the default
> regexps that Darcs HAS shipped will have the same semantics after
> switching to PCRE.
>

If it comes to this, do you think you would know how to determine this?  I'd
have to do a bit of research to figure it out myself.

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091005/c1c88457/attachment.htm>


More information about the darcs-users mailing list