[darcs-users] Applying formal descriptions to files

Max Battcher me at worldmaker.net
Sun Feb 8 06:35:10 UTC 2009


Trent W. Buck wrote:
> 
> "Max Battcher" <me at worldmaker.net> writes:
> 
> > * Lexers are fast.
> > * Lexers preserve the formatting of documents.
> > * Lexers are dumb, but recover quickly from errors.  (Ever watched
> your
> > editor's lexers twist through error states as you type?)
> 
> Emacs, at least, does not perform "proper" lexical analysis (except
> perhaps nxml).  That is, Emacs does not turn a stream of codepoints
> into
> a stream of lexical tokens.  Instead, it invariably just uses a bunch
> of
> *separate* regular expressions to overlay faces on matching regions of
> text.

IIRC, some (many?) of Vim's syntax files do the same/similar, and I have
seen that Vim has a wide variety of techniques exhibited amongst syntax
files, with sometimes huge differences between any two author's methods.  I
did note in my email that lexer's vary quite a bit in implementation details
and to some extent "proper" lexical analysis doesn't really ever exist.  But
also, as I said, most techniques can be converted between each other with a
small bit of work.  From my suggestions, I do think Pygments is perhaps the
best example lexer library to work with, particularly because the lexers are
mostly consistent in technique.  Pygments uses a nice declarative style for
its lexers and its stack-based FSM technique is simple, easy to read, and
works for most languages.  Also, Pygments generates a very nice stream of
tokens...  Here's the first few tokens of a random file for a quick example:

>>> from pygments.lexers import PythonLexer
>>> import pygments
>>> f = open('C:\Users\Max\Repos\darcsforge\darcsforge\patches\models.py')
>>> code = f.read()
>>> f.close()
>>> lex = pygments.lex(code, PythonLexer())
>>> list(lex)[:25]
[(Token.Comment, u'###'), (Token.Text, u'\n'), (Token.Comment, u'#
Darcsforge '), (Token.Text, u'\n'), (Token.Comment, u'#'), (Token.Text,
u'\n'), (Token.Comment, u'# Copyright (C) 2006-2008 Max Battcher
<me at worldmaker.net>.'), (Token.Text, u'\n'), (Token.Comment, u'# Licensed
for distribution and usage under the Microsoft Reciprocal License.'),
(Token.Text, u'\n'), (Token.Comment, u'###'), (Token.Text, u'\n'),
(Token.Keyword.Namespace, u'from'), (Token.Text, u' '),
(Token.Name.Namespace, u'django.db'), (Token.Text, u' '),
(Token.Keyword.Namespace, u'import'), (Token.Text, u' '), (Token.Name,
u'models'), (Token.Text, u'\n'), (Token.Keyword.Namespace, u'from'),
(Token.Text, u' '), (Token.Name.Namespace, u'django.contrib.sites.models'),
(Token.Text, u' '), (Token.Keyword.Namespace, u'import')]

As you can see, the Pygments Token types (first item in each tuple) are
pretty descriptive (and pretty well standardized across all Pygments lexers,
due to these types' correlation with their syntax highlighting classes), and
you should be able to see that all of the contents of the file are included
in the stream (which should be more obvious by the fact that syntax
highlighted results from Pygments are formed as expected).

I really would be interested to see what can be done with even a rudimentary
diff tool comparing two streams of tokens like the above example...

--
--Max Battcher--
http://worldmaker.net



More information about the darcs-users mailing list