As an alternative to the computed gotos, you can use regular functions with the `[[musttail]]` attribute in Clang or GCC to achieve basically the same thing - the call in the tail position is replaced with a `jmp` instruction to the next function rather than to the label, and stack usage remains constant because the current frame is reutililzed for the called function. `musttail` requires that the calling function and callee have the same signature, and a prototype.
This approach has the advantage that it will work everywhere and not only on compilers that support the computed gotos - it just won't optimize it on compilers that don't support `musttail`. (Though it has been proposed to standardize it in a future version of C).
It might also work better with code navigation tools that show functions, but not labels, and enables modularity as we can split rules over multiple translation units.
Performance wise should basically be the same - though it's been argued that it may do better in some cases because the compiler's register allocator doesn't do a great job in large functions with computed gotos - whereas in musttail approach each function is a smaller unit and optimized separately.
I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)
Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?
The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.
The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)
The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.
I agree. But, I also like the discipline of lexing from `FILE*`. I've ended up with cleaner separation of concerns throughout the front-end stack, because I can't dip back into the well, unless I'm thinking very clearly about that operation. For instance, I keep around coordinates of things, rather than pointers, etc.
The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.
Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.
At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".
Unfortunately, operating a byte at a time means there's a hard limit on performance.
A truly performant lexer needs to jump ahead as far as possible. This likely involves SIMD (or SWAR) since unfortunately the C library fails to provide most of the important interfaces.
As an example that the C library can handle tolerably, while lexing a string, you should repeatedly call `strcspn(input, "\"\\\n")` to skip over chunks of ordinary characters, then only special-case the quote, backslash, newline and (implicit!) NUL after each jump. Be sure to correctly distinguish between an embedded NUL and the one you probably append to represent EOF (or, if streaming [which requires quite a bit more logic], end of current chunk).
Unfortunately, there's a decent chance your implementation of `strcspn` doesn't optimize for the possibility of small sets, and instead constructs a full 256-bit bitset. And even if it does, this strategy won't work for larger sets such as "all characters in an identifier" (you'd actually use `strspn` since this is positive), for which you'll want to take advantage of the characters being adjacent.
Edit: yikes, is this using a hash without checking for collisions?!?
You can go pretty far processing one byte at a time in hardware. You just keep making the pipeline deeper and pushing the frequency. And then to combat dependent parsing you add speculative execution to avoid bubbles.
You can get pretty far with a branch per byte, as long as the bulk of the work is done w/ SIMD (like character classification). But yeah, LUT lookup per byte is not recommended.
You are somewhat right, I used tagging masks to differntiate between different types of atoms [1]. But yes, interning will be backed by a correct implementation of a hashmap with some collision handling in the future.
Is lexing ever a bottleneck though? Even if you push for lexing and parsing 10M lines/second [1], I'd argue that semantic analysis and codegen (for AOT-compiled languages) will dominate the timings.
That said, there's no reason not to squeeze every bit of performance out of it!
[1]: In this talk about the Carbon language, Chandler Carruth shows and explains some goals/challenges regarding performance: https://youtu.be/ZI198eFghJk?t=1462
A simple hello world in C++ can pull in dozens of megabytes of header files.
Years back I worked at a C++ shop with a big codebase (hundreds of millions of LOC when you included vendored dependencies). Compile times there were sometimes dominated by parsing speed! Now, I don't remember the exact breakdown of lexing vs parsing, but I did look at it under a profiler.
It's very easy in C++ projects to structure your code such that you inadvertently cause hundreds of megabytes of sources to be parsed by each single #include. In such a case, lexing and parsing costs can dominate build times. Precompiled headers help, but not enough...
For a statically typed language, it's very unlikely that the lexer shows up as a bottleneck. Compilation time will likely be dominated by semantic analysis, type checking, and code generation.
For a dynamically typed language where there isn't as much for the compiler to do, then the lexer might be a more noticeable chunk of compile times. As one of the V8 folks pointed out to me years ago, the lexer is the only part of the compiler that has to operate on every single individual byte of input. Everything else gets the luxury of greater granularity, so the lexer can be worth optimizing.
Byte at a time means not-fast but I suppose it's all relative. The benchmarks would benefit from a re2c version, I'd expect that to beat the computed goto one. Easier for the compiler to deal with, mostly.
This is fun and all, but I wonder what’s the largest program that’s ever been written in this new language (purple garden)? Seems like it will be a while before the optimizations pay off.
I found it quite tricky to apply its ideas to the more general syntax for a programming language, but with a bunch of hacking and few subtle changes to the language itself, the performance difference over one-character-at-a-time was quite substantial (about 6-10x).
Do you have benchmarks that show the hand rolled jump table has a significant impact?
The only reason this raises an eyebrow is that I've seen conflicting anec/data on this, depending pretty hard on target microarchitecture and the program itself.
Sadly that was one of the things I did benchmark, but didn't write down the results. I read a lot that naive switch is faster because the compiler knows how to optimise them better, but for my architecture and benchmarks the computed gotos were faster
The jump table is interesting, although I guess the performance of switch will be similar if properly optimized with the compiler, but would not be able to tell without trying. Also different compilers might take different approaches.
A few months ago I built a toy boolean expression parser as a weekend project. The main goal was simple: evaluate an expression and return true or false. It supported basic types like int, float, string, arrays, variables, and even custom operators.
The syntax and grammar were intentionally kept simple. I wanted the whole implementation to be self-contained and compact, something that could live in just a .h and .cc file. Single pass for lexing, parsing, and evaluation.
After having the first version working, I kind of challenged myself to make it faster and tried many things.
Once the first version was functional, I challenged myself to optimize it for speed. Here are some of the performance-related tricks I remember using:
- No string allocations: used the input *str directly, relying on pointer manipulation instead of allocating memory for substrings.
- Stateful parsing: maintained a parsing state structure passed by reference to avoid unnecessary copies or allocations.
- Minimized allocations: tried to avoid heap allocations wherever possible. Some were unavoidable during evaluation, but I kept them to a minimum.
- Branch prediction-friendly design: used lookup tables to assist with token identification (mapping the first character to token type and validating identifier characters).
- Inline literal parsing: converted integer and float literals to their native values directly during lexing instead of deferring conversion to a later phase.
I think all the tricks are mentioned in the article already.
For what is worth, here is the project:
https://github.com/pausan/tinyrulechecker
I used this expression to assess the performance on an Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (launched Q3 2018):
myfloat.eq(1.9999999) || myint.eq(32)
I know it is a simple expression and likely a larger expression would perform worse due to variables lookups, ... I could get a speed of 287MB/s or 142ns per evaluation (7M evaluations per second). I was gladly surprised to reach those speeds given that 1 evaluation is a full cycle of lexing, parsing and evaluating the expression itself.
The next step I thought was also to use SIMD for tokenizing, but not sure it would have helped a lot on the overall expression evaluation times, I seem to recall most of the time was spent on the parser or evaluation phases anyway, not the lexer.
Most of them, jump tables work in rust, mmapping too. deferred numeric parsing, keeping allocations to a minimum, string slices, interning and inline hashing all work in rust, go, c, c++; you name it.
As an alternative to the computed gotos, you can use regular functions with the `[[musttail]]` attribute in Clang or GCC to achieve basically the same thing - the call in the tail position is replaced with a `jmp` instruction to the next function rather than to the label, and stack usage remains constant because the current frame is reutililzed for the called function. `musttail` requires that the calling function and callee have the same signature, and a prototype.
You'd replace the JUMP_TARGET macro:
With: Then move the jump table out to the top level and replace each `&&` with `&`.See diff (untested): https://www.diffchecker.com/V4yH3EyF/
This approach has the advantage that it will work everywhere and not only on compilers that support the computed gotos - it just won't optimize it on compilers that don't support `musttail`. (Though it has been proposed to standardize it in a future version of C).
It might also work better with code navigation tools that show functions, but not labels, and enables modularity as we can split rules over multiple translation units.
Performance wise should basically be the same - though it's been argued that it may do better in some cases because the compiler's register allocator doesn't do a great job in large functions with computed gotos - whereas in musttail approach each function is a smaller unit and optimized separately.
Can't wait for mandatory TCO coming to Rust. But it's not there yet. https://github.com/phi-go/rfcs/blob/guaranteed-tco/text/0000...
I recommend taking a look at the ClickHouse SQL Lexer:
https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
It supports SIMD for accelerated character matching, it does not do any allocations, and it is very small (compiles to a few KB of WASM code).
I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)
Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?
The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.
The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)
The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.
Have you considered making your lexer operate in push mode instead?
This does mean you have to worry about partial tokens ... but if you limit yourself to feeding full lines that mostly goes away.
Besides, for reasonable-size workloads, "read the whole file ahead of time" is usually a win. The only time it's tempting not to do so is for REPLs.
I agree. But, I also like the discipline of lexing from `FILE*`. I've ended up with cleaner separation of concerns throughout the front-end stack, because I can't dip back into the well, unless I'm thinking very clearly about that operation. For instance, I keep around coordinates of things, rather than pointers, etc.
The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.
Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.
But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.
At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".
Unfortunately, operating a byte at a time means there's a hard limit on performance.
A truly performant lexer needs to jump ahead as far as possible. This likely involves SIMD (or SWAR) since unfortunately the C library fails to provide most of the important interfaces.
As an example that the C library can handle tolerably, while lexing a string, you should repeatedly call `strcspn(input, "\"\\\n")` to skip over chunks of ordinary characters, then only special-case the quote, backslash, newline and (implicit!) NUL after each jump. Be sure to correctly distinguish between an embedded NUL and the one you probably append to represent EOF (or, if streaming [which requires quite a bit more logic], end of current chunk).
Unfortunately, there's a decent chance your implementation of `strcspn` doesn't optimize for the possibility of small sets, and instead constructs a full 256-bit bitset. And even if it does, this strategy won't work for larger sets such as "all characters in an identifier" (you'd actually use `strspn` since this is positive), for which you'll want to take advantage of the characters being adjacent.
Edit: yikes, is this using a hash without checking for collisions?!?
You can go pretty far processing one byte at a time in hardware. You just keep making the pipeline deeper and pushing the frequency. And then to combat dependent parsing you add speculative execution to avoid bubbles.
Eventually you land on recreating the modern cpu.
You can get pretty far with a branch per byte, as long as the bulk of the work is done w/ SIMD (like character classification). But yeah, LUT lookup per byte is not recommended.
You are somewhat right, I used tagging masks to differntiate between different types of atoms [1]. But yes, interning will be backed by a correct implementation of a hashmap with some collision handling in the future.
[1]: https://github.com/xNaCly/purple-garden/blob/master/cc.c#L76...
Lexing is almost never a bottleneck. I'd much rather see a "Strategies for Readable Lexers".
Lexing being the major performance bottleneck in a compiler is a great problem to have.
Is lexing ever a bottleneck though? Even if you push for lexing and parsing 10M lines/second [1], I'd argue that semantic analysis and codegen (for AOT-compiled languages) will dominate the timings.
That said, there's no reason not to squeeze every bit of performance out of it!
[1]: In this talk about the Carbon language, Chandler Carruth shows and explains some goals/challenges regarding performance: https://youtu.be/ZI198eFghJk?t=1462
A simple hello world in C++ can pull in dozens of megabytes of header files.
Years back I worked at a C++ shop with a big codebase (hundreds of millions of LOC when you included vendored dependencies). Compile times there were sometimes dominated by parsing speed! Now, I don't remember the exact breakdown of lexing vs parsing, but I did look at it under a profiler.
It's very easy in C++ projects to structure your code such that you inadvertently cause hundreds of megabytes of sources to be parsed by each single #include. In such a case, lexing and parsing costs can dominate build times. Precompiled headers help, but not enough...
It depends a lot on the language.
For a statically typed language, it's very unlikely that the lexer shows up as a bottleneck. Compilation time will likely be dominated by semantic analysis, type checking, and code generation.
For a dynamically typed language where there isn't as much for the compiler to do, then the lexer might be a more noticeable chunk of compile times. As one of the V8 folks pointed out to me years ago, the lexer is the only part of the compiler that has to operate on every single individual byte of input. Everything else gets the luxury of greater granularity, so the lexer can be worth optimizing.
Cool exercise and thank you for the blog post.
I did a similar thing (for fun) for the tokenizer associated to a Swift derivates language written in C++.
My approach was however very different of yours:
- No macro, no ASM, just explicit vectorization using std.simd
- No hand rolled allocator. Just std::vector and SOA.
- No hashing for keyword. They are short. A single SIMD load / compare is often enough for a comparison
- All the lookup tables are compile time generated from the token list using constexpr to keep the code small and maintainable.
I was able to reach around 8 Mloc/s on server grade hardware, single core.
Byte at a time means not-fast but I suppose it's all relative. The benchmarks would benefit from a re2c version, I'd expect that to beat the computed goto one. Easier for the compiler to deal with, mostly.
Is lexing really ever the bottleneck? Why focus effort here?
This is fun and all, but I wonder what’s the largest program that’s ever been written in this new language (purple garden)? Seems like it will be a while before the optimizations pay off.
I havent written a lot, but it needs to be fast so i can be motivated to program more by the fast iteration
simdjson is another project to look at for ideas.
I found it quite tricky to apply its ideas to the more general syntax for a programming language, but with a bunch of hacking and few subtle changes to the language itself, the performance difference over one-character-at-a-time was quite substantial (about 6-10x).
Do you have benchmarks that show the hand rolled jump table has a significant impact?
The only reason this raises an eyebrow is that I've seen conflicting anec/data on this, depending pretty hard on target microarchitecture and the program itself.
Sadly that was one of the things I did benchmark, but didn't write down the results. I read a lot that naive switch is faster because the compiler knows how to optimise them better, but for my architecture and benchmarks the computed gotos were faster
I really appreciate the commitment to bench-marking in this one. The memoization speedup for number processing was particularly surprising.
Are you referring to the part where he said "crazy 15ms/64% faster" ?
Wait do modern compilers not use jump tables for large switch statements?
Some definitely do.
The jump table is interesting, although I guess the performance of switch will be similar if properly optimized with the compiler, but would not be able to tell without trying. Also different compilers might take different approaches.
A few months ago I built a toy boolean expression parser as a weekend project. The main goal was simple: evaluate an expression and return true or false. It supported basic types like int, float, string, arrays, variables, and even custom operators.
The syntax and grammar were intentionally kept simple. I wanted the whole implementation to be self-contained and compact, something that could live in just a .h and .cc file. Single pass for lexing, parsing, and evaluation.
After having the first version working, I kind of challenged myself to make it faster and tried many things.
Once the first version was functional, I challenged myself to optimize it for speed. Here are some of the performance-related tricks I remember using:
I think all the tricks are mentioned in the article already.For what is worth, here is the project:
I used this expression to assess the performance on an Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (launched Q3 2018): I know it is a simple expression and likely a larger expression would perform worse due to variables lookups, ... I could get a speed of 287MB/s or 142ns per evaluation (7M evaluations per second). I was gladly surprised to reach those speeds given that 1 evaluation is a full cycle of lexing, parsing and evaluating the expression itself.The next step I thought was also to use SIMD for tokenizing, but not sure it would have helped a lot on the overall expression evaluation times, I seem to recall most of the time was spent on the parser or evaluation phases anyway, not the lexer.
It was a fun project.
... written in C.
Not sure how many of these translate to other languages.
Most of them, jump tables work in rust, mmapping too. deferred numeric parsing, keeping allocations to a minimum, string slices, interning and inline hashing all work in rust, go, c, c++; you name it.