Technical writings of Shkrt
The StringScanner library is part of Ruby’s standard library, you probably heard about it if you ever had to write a text parser. Regular expressions are ok when we have to extract just a small parts of text according to patterns, but if we are in a need for a fully charged lexical parser, the StringScanner comes in handy.
The StringScanner operates with pointers and regular expressions, which makes easier to extract information from loosely structured or completely non-structured texts.
The real-world example can be found at the source code of Shrine library, or its data_uri
plugin, to be more precise.
This plugin deals with data:URL encoded files, i.e. the image file that is encoded as a base64 string. As you may know,
the data:URL encoded files have a certain structure:
data:[<MIME-type>][;charset=<encoding>][;base64],<data>
This means that the data:URL string begins with data:
, followed by content’s MIME-type, followed by encoding,
followed by ;base64
, and in the end, comes the base64-encoded data itself.
The data:URL encoded file’s string representation may look like this:
data:image/jpeg;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAw
AAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFz
ByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSp
a/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJl
ZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uis
F81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PH
hhx4dbgYKAAA7
This is the perfect ground for StringScanner. Let’s look more thoroughly at the Shrine code.
First, StringScanner is require
d:
Then the custom error is defined - this would be used when parser happens to find any invalid data:
Then, a number of constants are defined. This is a common practice for parsers.
The constant indicating that we have reached a data:
part of the string:
The constant that serves as a mime-type extractor:
Base64 marker:
Indicating the only comma that separates ;base64
part from the data string
These constants are defined accordingly to the data:URL format specification and each constant corresponds to the lexical part of the encoded string.
Next, the parsing happens in the private parse_data_uri
method.
First, initialize a new instance of StringScanner:
Then, parsing goes step by step over data:URL encoded string according to data:URL specification:
Finding data:
part, and raise error if the string does not contain it:
Find a mime-type marker and store its value in a variable:
Finding ;base64
part to make a step over it and store its value in a variable:
Finally, scanner reaches a data string (or raises a custom error, if there is not CONTENT_SEPARATOR
present, which is the comma):
And then the method returns a hash, which contains scanner.post_match
value, which in our case holds the encoded string.
This example is very simple and educative, as for me. Let’s go ahead to the next thing we can build using StringScanner. This will be a simple calculator that will evaluate arithmetic operations from a string. So, the following test should pass:
StringScanner should help us reveal the following entities from given string:
The first operand of the arithmetic operation. It can be an arbitrary number of digits.
The operation itself, its symbol should be among */-+
Second operand of the arithmetic operation. It can be also an arbitrary number of digits.
Spacings between operands - these are optional
First, we define a class with all constants.
And then we scan for each operand, spaces, and operation sequentially:
The operation definitions are pretty straightforward then:
I did not mention error management here because it’s as simple as in the first Shrine library example. I hope these two examples would be helpful to build your picture about the usefulness of StringScanner library.
Suggested reading:
[ruby
]