Technical writings of Shkrt



Ruby Standard library: StringScanner

The StringScanner library is part of Ruby’s standard library, you probably heard about it if you ever had to write a text parser. Regular expressions are ok when we have to extract just a small parts of text according to patterns, but if we are in a need for a fully charged lexical parser, the StringScanner comes in handy.

The StringScanner operates with pointers and regular expressions, which makes easier to extract information from loosely structured or completely non-structured texts.

The real-world example can be found at the source code of Shrine library, or its data_uri plugin, to be more precise. This plugin deals with data:URL encoded files, i.e. the image file that is encoded as a base64 string. As you may know, the data:URL encoded files have a certain structure:


This means that the data:URL string begins with data:, followed by content’s MIME-type, followed by encoding, followed by ;base64, and in the end, comes the base64-encoded data itself. The data:URL encoded file’s string representation may look like this:


This is the perfect ground for StringScanner. Let’s look more thoroughly at the Shrine code.

First, StringScanner is required:

require "strscan"

Then the custom error is defined - this would be used when parser happens to find any invalid data:

class ParseError < Error; end

Then, a number of constants are defined. This is a common practice for parsers.

The constant indicating that we have reached a data: part of the string:

DATA_REGEXP = /data:/

The constant that serves as a mime-type extractor:

MEDIA_TYPE_REGEXP = /[-\w.+]+\/[-\w.+]+(;[-\w.+]+=[^;,]+)*/

Base64 marker:

BASE64_REGEXP = /;base64/

Indicating the only comma that separates ;base64 part from the data string


These constants are defined accordingly to the data:URL format specification and each constant corresponds to the lexical part of the encoded string.

Next, the parsing happens in the private parse_data_uri method.

First, initialize a new instance of StringScanner:

scanner =

Then, parsing goes step by step over data:URL encoded string according to data:URL specification:

Finding data: part, and raise error if the string does not contain it:

scanner.scan(DATA_REGEXP) or raise ParseError, "data URI has invalid format"

Find a mime-type marker and store its value in a variable:

media_type = scanner.scan(MEDIA_TYPE_REGEXP)

Finding ;base64 part to make a step over it and store its value in a variable:

base64 = scanner.scan(BASE64_REGEXP)

Finally, scanner reaches a data string (or raises a custom error, if there is not CONTENT_SEPARATOR present, which is the comma):

scanner.scan(CONTENT_SEPARATOR) or raise ParseError, "data URI has invalid format"

And then the method returns a hash, which contains scanner.post_match value, which in our case holds the encoded string.

  content_type: content_type,
  base64:       !!base64,
  data:         scanner.post_match,

This example is very simple and educative, as for me. Let’s go ahead to the next thing we can build using StringScanner. This will be a simple calculator that will evaluate arithmetic operations from a string. So, the following test should pass:

require 'rspec'

describe Calc do
  it 'performs addition' do
    string = "5+4"
    expect( eq(9)

  it 'performs subtraction' do
    string = "5-4"
    expect( eq(1)

  it 'performs multiplication' do
    string = "5*4"
    expect( eq(20)

  it 'performs division' do
    string = "5/4"
    expect( eq(1)

  it 'maintains spaces' do
    string = "5 + 50"
    expect( eq(55)

StringScanner should help us reveal the following entities from given string:

  1. The first operand of the arithmetic operation. It can be an arbitrary number of digits.

  2. The operation itself, its symbol should be among */-+

  3. Second operand of the arithmetic operation. It can be also an arbitrary number of digits.

  4. Spacings between operands - these are optional

First, we define a class with all constants.

class Calc
  OPERATIONS = { "+" => :add, "-" => :sub, "*" => :mul, "/" => :div }
  SPACE = /\s+/
  DIGITS = /\d+/
  OPERATION_SYMS = /[-+\/*]/

And then we scan for each operand, spaces, and operation sequentially:

class Calc
  class << self
    def call(str)
      scanner =
      first_operand = Integer(scanner.scan(DIGITS))
      operation = OPERATIONS[scanner.scan(OPERATION_SYMS)]
      second_operand = Integer(scanner.scan(DIGITS))
      send(operation, first_operand, second_operand)

The operation definitions are pretty straightforward then:


def add(first, second)
  first + second

def sub(first, second)
  first - second

def mul(first, second)
  first * second

def div(first, second)
  first / second

I did not mention error management here because it’s as simple as in the first Shrine library example. I hope these two examples would be helpful to build your picture about the usefulness of StringScanner library.

Suggested reading:


[ ruby  ]