Introduction to analyzing datasets with daru library
Though Ruby ecosystem is not very popular within the data-scientist community, when compared to languages like Python, R and Julia,
there is the SciRuby foundation, which provides the data science toolkit. One of the core libraries is
daru - as states its
documentation, “daru is a library for storage, analysis, manipulation and visualization of data in Ruby”. The library is
being actively developed and has drawn attention from data science community.
To get acquainted with daru, we can download some dataset and make manipulations with supplied data.
After typing this in pry console, you should see the output, which is basically the representation of a newly created
Daru::DataFrame object, containing dataset’s data, but it’s not of any help at its current state, so we should print out
available summary info of this dataframe:
Now we can already make some sense of this dataset. Each row contains data for jurisdictions determined by zip code.
Getting deeper into summarizing dataset:
This command prints out summary information for each column of the dataframe, with columns breakdown.
n indicates a total number of values
non-missing is the number of values that are actually present for given column. 236 means that every row has COUNT FEMALE filled.
The median, mean, std.dev, std.err, skew and kurtosis are referring to the most common idioms from the statistical computation.
We also can summarize dataframe partially, using head and tail methods. These two are slicing data frame respectively from
its beginning(head) or from its end(tail) accounting only 10 rows by default, and therefore only 10 rows would be summarized.
Now let’s proceed to actually plot some data and get the taste of the visualization using daru.
The visualization part by now supports three backends, the default nyaplot
which can be optionally changed to gnuplotrb or gruff
The most obvious task is to make bar chart containing top 10 jurisdictions with US citizens.
To express our intents more clearly, we can slice our dataset to leave only needed columns:
And the html exported plot should appear in the working directory. It is advised to work with jupyter notebooks
for plotting tasks, but the installation of those tools is a separate story and we use export_html way for the simplicity.
As you might have noticed, extracting, summarizing and manipulating data with daru library is quite simple even you don’t come from
data science world, and even simpler if you are already used to work with specialized tools like R, Octave, and others.