Iterator Matching into Variable Sized Slices in Rust
Nov 5, 2022, by Ryan HeywoodFor the rewrite of my blog engine, I had a thought: a lot of work has gone into making programs print nice and pretty things into my terminal, but there's no good way for me to get that output represented on a website in a convenient format. I could take a screenshot, but that doesn't really feel clean to me. I could just take the raw output of the command and put that as text, but that isn't pretty at all. I want a solution where I can take a command, run it in my terminal, and have it able to be Just Showed Up on my site.
How Terminals Display Colored Text
Note: If you only care about some funky Rust slice traversal, you can feel free to skip this section.
Terminals use a system called ANSI escape codes; in particular, they're using
the "Select Graphics Rendition" code. This is going to be the primary point of
this article. I will do my best to summarize it in a very basic fashion but
it's worth checking out the Wikipedia and the console_codes(4)
man page if
you want to learn more. It's a very interesting topic.
To start working on this, let's find a program that we can tell to output color
codes at any time. Given this post is going to be about Rust, I think cargo
is the most appropriate example. We can also use the program xxd
to inspect
the data of the file and look at the hex representation of the color codes.
# Redirect stderr to stdout so we can capture the text to output.txt
|
Created binary (application) `umbrella` package
00000000: 1b5b 306d 1b5b 306d 1b5b 316d 1b5b 3332 .[0m.[0m.[1m.[32
00000010: 6d20 2020 2020 4372 6561 7465 641b 5b30 m Created.[0
00000020: 6d20 6269 6e61 7279 2028 6170 706c 6963 m binary (applic
00000030: 6174 696f 6e29 2060 756d 6272 656c 6c61 ation) `umbrella
00000040: 6020 7061 636b 6167 650a ` package.
We can see that the first character to be included is a '\u{1b}'
character,
followed by a '['
. The first character, hereafter referred to as the "escape"
character, is present for almost every ANSI escape code. However, the second
character is only present for what are called "Control Sequences" (CS or CSI).
Most of the operations done on your terminal will be through CSI codes. This
includes things like moving around your cursor, clearing the screen, and
changing properties for text such as boldness, underline, and color.
The item after the CSI component, for the first element a "0"
, is a parameter
for the CSI sequence. Parameters are usually a u8
but there is no reason a
parameter couldn't be a u16
or something larger. For the purpose of
formatting the output of a command, we only care about the 8-bit values. In
this case, the value 0 actually represents a "reset". From there, we can see
that it moves to a 1
(bold), a 32 (green text), and then the text "Created".
It then resets the terminal to its standard settings and continues writing the
rest of the output.
The last item in this sequence is the character 'm'
. This defines the type of
the CSI code to be a "Set Graphic Rendition" (SGR) code. Because the type of
the CSI sequence is defined at the end of the parameter list, it is therefore
required to parse an entire parameter list before determining what CSI
sequence we're in. This has lead to poor optimizations, where CSI sequences
that only require a couple parameters could be fed up to 16 (according to the
console_codes(4)
manual page).
This is not a good design, but it's a design that has existed for quite a long
time and will probably exist for a lot longer. Parsing this sequence into a
valid type is worth an article itself, but for now I will mention that I've
forked the ansi-parser
crate and will be using that later in
the article. I can pass it a string and it will give me a list of either Text
or AnsiSequence
.
Moving to HTML
Console codes are interesting because they can be arbitrarily turned on or off without managing state between them. However, while some browsers could probably support this kind of shenanigan, for this use case we will try to write valid HTML. This means that we should collect all graphics settings at once, then write the block of text formatted using those graphics settings, then - while continuing to use those graphics settings - add or remove some additional settings and get ready to write the next chunk.
This can be easily done in practice by maintaining a GraphicsModeState
and
updating it when reaching an AnsiSequence
, or outputting HTML tags for that
state when reaching a Text
.
use ;
pub
We have the fundamental requirements of our software laid out, but we are missing a few functions:
GraphicsModeState::clone_from_scan(&self, &[u8]) -> GraphicsModeState
GraphicsModeState::build_tags(&self) -> (String, String)
We're going to tackle the less difficult of these first.
Generating HTML Tags
static COLORS: = ;
This assumes that you have a CSS stylesheet that has the relevant variables for the terminal colors. This is left as an exercise to the reader.
The bulk of the function is relatively simple: we're generating HTML opening
and closing tags for each possible option in the GraphicsModeState
, then
returning, first, the concatenated tags; second, the concatenated reversed list
of tags. This way, they're closed in a "last tag created, first tag closed"
fashion like HTML expects.
AnsiSequence::SetGraphicsMode
Variant
The Back when things were simpler, terminals only had access to a total of 32
colors, if you counted 8 normal colors for the foreground, 8 bright colors for
the foreground, and 16 colors of a similar fashion for the background. In
practice, this requires 17 codes: 30-37
were reserved for the foreground,
40-47
were reserved for the background, and 1
would sometimes be
interpreted as "increase frequency". This means that for every possible
operation that could be done to change the state, you would only need one
parameter.
We could start implementing this now, but it would be futile, as eventually
graphics cards would implement a 256-color lookup table. This is a lot more
colors than could be adequately represented by a u8
, therefore requiring the
creation of a parameter that would designate the next parameter as the 8bit
color value. This now means that, if we wanted to simply match
over any
valid sequence, we would now need to match two values. However, in the interest
of future-proofing the system, they also made the value require an additional
parameter to designate that it was from the 256 color set. At this point, we
now have 3 values.
However, it gets worse: as the rise of 24 bit color computing arose, eventually the demand for 24 bit colors in the terminal grew as well. With 24 bits, and parameters only accepting 8 bits, we are now stuck with five parameters. One to designate a color sequence, one to designate it's 24bit color, and the three other parameters to represent red, green, and blue.
Pattern Matching Slices
Since cargo
gives us one operation per CSI sequence, we can actually match
over this pretty easily:
We can actually run this code now and see that it does give a valid output:
% cargo add --git https://github.com/RyanSquared/ansi-parser-rs
% cargo add --git html_escape
% cargo run
Compiling umbrella v0.1.0 (/home/ryan/builds/enigma/projects/umbrella)
Finished dev [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/umbrella`
Created binary (application) `umbrella` package
However, as previously mentioned, we must be able to take multiple sets of parameters. What happens if someone wants to reset the terminal, but also apply a color at the same time? Currently, we'll get a color reset, but that's it. This example command will run properly in our terminal:
|
HelloWorld
But if we run this through our program, at its current stage, we get this:
HelloWorld
Hello World
This is due to the fact we're only matching over one value in input
. In this
case, we are only detecting the reset character, not the character after it.
Luckily, we can iterate through the slice, deciding to take extra parameters if
needed by using the Iterator::next_chunk()
method:
let iter = input.iter;
while let Some = iter.next
... Wait, what's that red squiggly line in my editor?
Checking umbrella v0.1.0 (/home/ryan/builds/enigma/projects/umbrella)
error[E0658]: use of unstable library feature 'iter_next_chunk': recently added
--> src/main.rs:80:46
|
80 | if let Ok([_, n]) = iter.next_chunk() {
| ^^^^^^^^^^
|
= note: see issue #98326 <https://github.com/rust-lang/rust/issues/98326> for more information
For more information about this error, try `rustc --explain E0658`.
error: could not compile `umbrella` due to previous error
Ah. I guess not.
That's fine. I'll just... Loop over the slice and increment it automagically
using a macro over the match
block, until next_chunk
is stable. I hope that
it comes sooner than the next CSI code comes out and I have to look at this
code again.
This code looks miraculously like the code I wrote earlier. By design :)
When expanded, the code looks like this:
loop
The significance of this code is, the input is automatically incremented. There is no possible case where an infinite loop happens, and there's always an exit code. For future proofing, I've even added an option to skip over codes that I don't know about.
We can run this code again and see that it now correctly formats the output:
HelloWorld
Hello World
I hope that iter_over!
isn't useful for that long, and that next_chunk()
gets stabilized soon, but until then, I think it's a pretty nifty macro.
I have included the source of this example in the blog repository. If you'd
like to run the example yourself, you can do so. It should be as simple as
cargo run
.