Technical writings of Shkrt
Though Ruby ecosystem is not very popular within the data-scientist community, when compared to languages like Python, R and Julia, there is the SciRuby foundation, which provides the data science toolkit. One of the core libraries is daru - as states its documentation, “daru is a library for storage, analysis, manipulation and visualization of data in Ruby”. The library is being actively developed and has drawn attention from data science community.
To get acquainted with daru, we can download some dataset and make manipulations with supplied data.
For the purposes of this article, I decided to stop on the City of New York - Demographic statistics broken down by zip code dataset, available in different formats, but we’ll go for the CSV version.
require 'daru'
df = Daru::DataFrame.from_csv('Demographic_Statistics_By_Zip_Code.csv')
After typing this in pry console, you should see the output, which is basically the representation of a newly created
Daru::DataFrame
object, containing dataset’s data, but it’s not of any help at its current state, so we should print out
available summary info of this dataframe:
# Fetching number of dataframe vectors. In this case, we can imagine a dataframe as a table, with vectors being its columns
df.vectors.size
=> 46
# Fetching number of rows
df.nrows
=> 236
# Fetching particular row, 0 to 235 available for this dataframe
# The row is truncated to first 15 entries.
df.row_at(4)
=> #<Daru::Vector(46)>
JURISDICTION NAME 10005
COUNT PARTICIPANTS 2
COUNT FEMALE 2
PERCENT FEMALE 1
COUNT MALE 0
PERCENT MALE 0
COUNT GENDER UNKNOWN 0
PERCENT GENDER UNKNO 0
COUNT GENDER TOTAL 2
PERCENT GENDER TOTAL 100
COUNT PACIFIC ISLAND 0
PERCENT PACIFIC ISLA 0
COUNT HISPANIC LATIN 0
PERCENT HISPANIC LAT 0
COUNT AMERICAN INDIA 0
... ...
df.vectors
=> #<Daru::Index(46): {JURISDICTION NAME, COUNT PARTICIPANTS, COUNT FEMALE, PERCENT FEMALE, COUNT MALE, PERCENT MALE,
COUNT GENDER UNKNOWN, PERCENT GENDER UNKNOWN, COUNT GENDER TOTAL, PERCENT GENDER TOTAL, COUNT PACIFIC ISLANDER,
PERCENT PACIFIC ISLANDER, COUNT HISPANIC LATINO, PERCENT HISPANIC LATINO, COUNT AMERICAN INDIAN,
PERCENT AMERICAN INDIAN, COUNT ASIAN NON HISPANIC, PERCENT ASIAN NON HISPANIC, COUNT WHITE NON HISPANIC,
PERCENT WHITE NON HISPANIC ... PERCENT PUBLIC ASSISTANCE TOTAL}>
# fetching particular column
df['JURISDICTION NAME']
=> #<Daru::Vector(236)>
JURISDICTION NAME
0 10001
1 10002
2 10003
3 10004
4 10005
5 10006
6 10007
7 10009
8 10010
9 10011
10 10012
11 10013
12 10014
13 10016
14 10017
... ...
Now we can already make some sense of this dataset. Each row contains data for jurisdictions determined by zip code.
Getting deeper into summarizing dataset:
puts df.summary
Number of rows: 236
Element:[COUNT FEMALE]
== COUNT FEMALE
n :236
non-missing:236
median: 0.0
mean: 10.2966
std.dev.: 28.1891
std.err.: 1.8350
skew: 4.4564
kurtosis: 22.4802
# ...
# rest of the output is omitted
This command prints out summary information for each column of the dataframe, with columns breakdown.
n
indicates a total number of values
non-missing
is the number of values that are actually present for given column. 236 means that every row has COUNT FEMALE filled.
The median
, mean
, std.dev
, std.err
, skew
and kurtosis
are referring to the most common idioms from the statistical computation.
We also can summarize dataframe partially, using head
and tail
methods. These two are slicing data frame respectively from
its beginning(head) or from its end(tail) accounting only 10 rows by default, and therefore only 10 rows would be summarized.
Element:[COUNT FEMALE]
==
n :10
non-missing:10
median: 1.5
mean: 4.8000
std.dev.: 8.3506
std.err.: 2.6407
skew: 1.2688
kurtosis: -0.3123
Now let’s proceed to actually plot some data and get the taste of the visualization using daru.
The visualization part by now supports three backends, the default nyaplot which can be optionally changed to gnuplotrb or gruff
The most obvious task is to make bar chart containing top 10 jurisdictions with US citizens.
To express our intents more clearly, we can slice our dataset to leave only needed columns:
wf = df['JURISDICTION NAME', 'COUNT US CITIZEN']
=> #<Daru::DataFrame(236x2)>
# ...
wf = wf.sort(['COUNT US CITIZEN']).tail
=> #<Daru::DataFrame(10x2)>
JURISDICTI COUNT US C
119 11218 102
124 11223 102
210 12428 124
222 12754 133
229 12783 197
120 11219 212
228 12779 241
130 11230 245
218 12734 252
232 12789 271
# Here I had to make a new dataframe because 'JURISDICTION NAME' vector is treated as a numeric otherwise
data = Daru::DataFrame.new(x: wf['JURISDICTION NAME'].map(&:to_s), y: wf['COUNT US CITIZEN'])
plot = dt.plot(type: :bar, x: :x, y: :y) do |plot, _diag|
plot.x_label("Jurisdictions")
plot.y_label("US citizen count")
end
plot.export_html
And the html exported plot should appear in the working directory. It is advised to work with jupyter notebooks
for plotting tasks, but the installation of those tools is a separate story and we use export_html
way for the simplicity.
As you might have noticed, extracting, summarizing and manipulating data with daru library is quite simple even you don’t come from data science world, and even simpler if you are already used to work with specialized tools like R, Octave, and others.
Suggested reading:
[ruby
]