[darcs-users] Regular Expression libraries and linker errors
Jason Dagit
dagit at codersbase.com
Tue Oct 6 03:05:35 UTC 2009
On Mon, Oct 5, 2009 at 6:18 PM, Trent W. Buck <twb at cybersource.com.au>wrote:
> Petr Rockai <me at mornfall.net> writes:
>
> > Jason Dagit <dagit at codersbase.com> writes:
> >> In my travels profiling the performance of record I noticed that we
> >> do spend about 1/3 of the time just matching regular expressions on
> >> filenames.
> > Just one thing... do we match those on String or on ByteString?
>
> In principle, I would like to match Unicode codepoints, not bytes.
>
On OS X, man regex give these two definitions:
int
regcomp(regex_t *restrict preg, const char *restrict pattern,
int cflags);
int
regexec(const regex_t *restrict preg, const char *restrict string,
size_t nmatch, regmatch_t pmatch[restrict], int eflags);
So, both regcomp and regexec take vectors of bytes. If a wchar version
exists then I don't think the Haskell bindings are using them.
I think as long as you're lucky enough that the regex and string are in the
same encoding then ByteString and String will be equivalent in their
matching ability here. Unfortunately I don't think darcs makes an such
guarantees.
>
> In practice, I avoid non-ASCII and non-printable characters in file
> names, because there are so many such issues on Unix :-(
>
> > Because not using String would probably lead to another substantial
> > speedup on this. We may also want to switch to regex-dfa, since I
> > believe we only care whether we have a match and not much else.
>
> I see no downside there.
>
I was going to agree that we don't need the extra capabilities like
extracting matches and doing replaces but it just occurred to me that we
could probably re-implement some things, like decode_white/encode_white
using regexps and potentially get better performance. It's worth doing
performance test to see. I'll try to get some data on this.
>
> > But you are right that regex-pcre or pcre-light might be faster
> > (before deciding, it may make a lot of sense to benchmark both in
> > darcs, though).
>
> I have no problem switching from EREs to PCREs, but if we do so, please
> lets do it for all of Darcs at once!
>
Agreed.
>
> As well as benchmarking, someone will need to check that the default
> regexps that Darcs HAS shipped will have the same semantics after
> switching to PCRE.
>
If it comes to this, do you think you would know how to determine this? I'd
have to do a bit of research to figure it out myself.
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091005/c1c88457/attachment.htm>
More information about the darcs-users
mailing list