Technical writings of Shkrt
The StringScanner library is part of Ruby’s standard library, you probably heard about it if you ever had to write a text parser. Regular expressions are ok when we have to extract just a small parts of text according to patterns, but if we are in a need for a fully charged lexical parser, the StringScanner comes in handy.
The StringScanner operates with pointers and regular expressions, which makes easier to extract information from loosely structured or completely non-structured texts.
The real-world example can be found at the source code of Shrine library, or its data_uri
plugin, to be more precise.
This plugin deals with data:URL encoded files, i.e. the image file that is encoded as a base64 string. As you may know,
the data:URL encoded files have a certain structure:
data:[<MIME-type>][;charset=<encoding>][;base64],<data>
This means that the data:URL string begins with data:
, followed by content’s MIME-type, followed by encoding,
followed by ;base64
, and in the end, comes the base64-encoded data itself.
The data:URL encoded file’s string representation may look like this:
data:image/jpeg;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAw
AAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFz
ByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSp
a/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJl
ZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uis
F81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PH
hhx4dbgYKAAA7
This is the perfect ground for StringScanner. Let’s look more thoroughly at the Shrine code.
First, StringScanner is require
d:
require "strscan"
Then the custom error is defined - this would be used when parser happens to find any invalid data:
class ParseError < Error; end
Then, a number of constants are defined. This is a common practice for parsers.
The constant indicating that we have reached a data:
part of the string:
DATA_REGEXP = /data:/
The constant that serves as a mime-type extractor:
MEDIA_TYPE_REGEXP = /[-\w.+]+\/[-\w.+]+(;[-\w.+]+=[^;,]+)*/
Base64 marker:
BASE64_REGEXP = /;base64/
Indicating the only comma that separates ;base64
part from the data string
CONTENT_SEPARATOR = /,/
These constants are defined accordingly to the data:URL format specification and each constant corresponds to the lexical part of the encoded string.
Next, the parsing happens in the private parse_data_uri
method.
First, initialize a new instance of StringScanner:
scanner = StringScanner.new(uri)
Then, parsing goes step by step over data:URL encoded string according to data:URL specification:
Finding data:
part, and raise error if the string does not contain it:
scanner.scan(DATA_REGEXP) or raise ParseError, "data URI has invalid format"
Find a mime-type marker and store its value in a variable:
media_type = scanner.scan(MEDIA_TYPE_REGEXP)
Finding ;base64
part to make a step over it and store its value in a variable:
base64 = scanner.scan(BASE64_REGEXP)
Finally, scanner reaches a data string (or raises a custom error, if there is not CONTENT_SEPARATOR
present, which is the comma):
scanner.scan(CONTENT_SEPARATOR) or raise ParseError, "data URI has invalid format"
And then the method returns a hash, which contains scanner.post_match
value, which in our case holds the encoded string.
{
content_type: content_type,
base64: !!base64,
data: scanner.post_match,
}
This example is very simple and educative, as for me. Let’s go ahead to the next thing we can build using StringScanner. This will be a simple calculator that will evaluate arithmetic operations from a string. So, the following test should pass:
require 'rspec'
describe Calc do
it 'performs addition' do
string = "5+4"
expect(Calc.call(string)).to eq(9)
end
it 'performs subtraction' do
string = "5-4"
expect(Calc.call(string)).to eq(1)
end
it 'performs multiplication' do
string = "5*4"
expect(Calc.call(string)).to eq(20)
end
it 'performs division' do
string = "5/4"
expect(Calc.call(string)).to eq(1)
end
it 'maintains spaces' do
string = "5 + 50"
expect(Calc.call(string)).to eq(55)
end
end
StringScanner should help us reveal the following entities from given string:
The first operand of the arithmetic operation. It can be an arbitrary number of digits.
The operation itself, its symbol should be among */-+
Second operand of the arithmetic operation. It can be also an arbitrary number of digits.
Spacings between operands - these are optional
First, we define a class with all constants.
class Calc
OPERATIONS = { "+" => :add, "-" => :sub, "*" => :mul, "/" => :div }
SPACE = /\s+/
DIGITS = /\d+/
OPERATION_SYMS = /[-+\/*]/
end
And then we scan for each operand, spaces, and operation sequentially:
class Calc
class << self
def call(str)
scanner = StringScanner.new(str)
first_operand = Integer(scanner.scan(DIGITS))
scanner.scan(SPACE)
operation = OPERATIONS[scanner.scan(OPERATION_SYMS)]
scanner.scan(SPACE)
second_operand = Integer(scanner.scan(DIGITS))
send(operation, first_operand, second_operand)
end
end
end
The operation definitions are pretty straightforward then:
private
def add(first, second)
first + second
end
def sub(first, second)
first - second
end
def mul(first, second)
first * second
end
def div(first, second)
first / second
end
I did not mention error management here because it’s as simple as in the first Shrine library example. I hope these two examples would be helpful to build your picture about the usefulness of StringScanner library.
Suggested reading:
[ruby
]