[darcs-users] Applying formal descriptions to files

Max Battcher me at worldmaker.net
Sun Feb 8 03:52:23 UTC 2009

Maurício wrote:
> I understand your points. But I had in mind that something a lot
> more simple could work. When doing Haskell code, for instance, I
> wouldn't need a faithfull description of the language. To me,
> something like "a set of blocks of code, where consecutive spaces
> do not matter except in the beggining of a block, where blocks are
> separete by a blank line" would be enough. Of course, my Haskell
> layout is unusual -- I type it like Latex, using 'par' to layout
> the result -- but that would be fine for me.

Not everyone is happy with code auto-layout and auto-formatting.  It seems
that there will always be some situations that beg for manual formatting.
In my experience languages like Haskell and Python that are
whitespace-dependent surprisingly seem to be sometimes more prone to manual
whitespace and layout tweaking (beyond just what the syntax requires) than
languages that don't have syntactic whitespace.  But then, I've also seen
people carefully reformat function calls in C-based languages based on
obscure aesthetic choices that they couldn't even quite summarize, much less
automate...  ("It just looks better this way in this particular case...")

Certainly with a small, private code watcher tool you can auto-layout all of
your code to your heart's content, but consensus seems to agree that should
darcs offer smarter patch types for code, whitespace and other formatting
niceties need to be taken into account.
> In some way, darcs already has a description for text files: a
> set of lines. Of course, general pourpose description of Haskell,
> C++ or Perl would probably be more difficult that writing a
> compiler :) But just beeing able to write something very simple
> that we could attach to a file at its creation would be helpfull.
> As long as that simple description could transform the file to
> a 'canonical' version, even the work of back convertion would be
> unnecessary.

I think at this point I would be wary of anything that requires
canonicalization or conversion.  I don't think that a lot of developers want
that sort of thing as there seems to be more failed experiments in that area
of work than working examples...  If smarter syntactic versioning in darcs
is to happen, it should probably do as much as it can to do so with
documents as they are.

I promised you some of my own ideas on how such a thing might work, and I've
got a more recent sort of hunch on how to accomplish it.  I feel like I may
have posted something or another on the topic previously, but a brief search
didn't turn up anything...

Let's start by going back to basic parsing theory.  You asked about BNF/WSN
parsers and I explained some of the issues with targeting them towards
version control.  Such parsers work great for compilers that can refuse to
compile a document, but not so great for tracking the changes of a "living"
document.  Such parsers are generally "slow" (at least, in terms of
computing ten years ago) and "fragile", but "smart" (they show the
relationships between pieces of code that a compiler uses to produce working
translations to machine code).

What I skipped over in my description in the previous email is that a parser
is actually the second step in most compilers (but not all).  A typical
parser takes as an input a stream of tokens and outputs a syntactic tree
that describes the relationship between tokens and language constructs.  The
first step generally called a lexer (also sometimes called a tokenizer, for
obvious reasons).  There are several reasons for the classic lexer/parser
split, but the remaining important is that a lexer is "fast" and "dumb" and
often far easier to write/debug than the parser.

The role of a lexer is to take a source document and split it into a stream
of tokens.  If a parser focuses on the sentence and paragraph structure of a
language, tokens are the words of the language.  A lexer walks through the
document finding all the basic words of the document, often coloring them
with a small amount of type information, and sometimes (depending on the
parser/compiler type) noting where exactly that word came from in the source
document (often so that the parser can report the location of errors).
Splitting a file by lines is exactly one example of a trivial lexer (as you
point out on your comment above), and I think 

I use "coloring" as a specifically chosen metaphor, because lexers are the
backbone of syntax highlighting.  When your editor colors your code as you
type, its more than likely is using a simplified lexer of some sort, rather
than a full parser.  An editor can't use a full parser for many of the
reasons that it appears a full parser isn't very useful to version control,
plus the additional problem that as fast as most parsers are on today's
hardware they are still rarely suitable for real-time use as a coder types.
(This is becoming less and less of an issue, certainly: many modern IDEs
such as Visual Studio run several mini-parsers in the background at
near-real-time...)  I think the reasons that lexers are extremely suitable
for syntax highlighting have some merit to usage in version control and I
have a hunch that if someone were to build a smarter patch system (for

The disadvantages of using a lexer rather than a parser is that you lose a
lot of the "smarts" that you might use to your advantage.  One further
disadvantage of using a generic lexer engine is that if there are a
multitude of "standards" for parsing, there are even more weird, wild
variations on defining a lexer.  However, unlike parsing there are well
known repositories of lexers (thanks to syntax highlighting) such as
Vim/Emacs script repositories.  Also, unlike much of modern parsing, it is
somewhat easy to convert from one lexer representation to another.  Most
lexers are simply collections of regular expressions with some sort of
simple glue logic or finite state machine.

Among other things, some of the benefits: 

* Lexers are fast.
* Lexers preserve the formatting of documents.
* Lexers are dumb, but recover quickly from errors.  (Ever watched your
editor's lexers twist through error states as you type?)  Errors are
generally "just another token type", and rarely affect more than a few
lines; and most of them can be seen in the syntax highlighting of a person's
editor of choice.
* Today's line-based diffs can be seen as a trivial sub-case of a general
lexer-based engine.  (That is, theoretically, existing hunk patches could be
seen as specialized versions of a more general set of patch classes.)

If someone wanted to try playing with building a prototype of such a system
I would suggest with starting with Pygments (pygments.org), which is an
existing lexer engine designed for syntax highlighting.  It's written in
Python, has a strong API and selection of lexers for many common languages.
I don't know of any existing Diff algorithms for token streams, so one would
need to be written.  Beyond that, then there are the questions of what
commutation would look like (and ideally lexer patches would commute with
hunk patches)...

Well, that was a rather big info dump.  Hopefully there is some useful data
there, and I am curious to see someone experiment with the idea I've

--Max Battcher--

More information about the darcs-users mailing list