Introduction to daru (Data Analysis in RUby)¶

Sameer Deshmukh¶

Deccan Ruby Conf 2015, Pune, India.¶

In [1]:

require 'daru'
require 'distribution'
require 'gnuplotrb'

Out[1]:

true

Creating a Daru::Vector¶

Vectors are indexed by passing data using the index option, and named with name

In [2]:

vector = Daru::Vector.new(
  [20,40,25,50,45,12], index: ['cherry', 'apple', 'barley', 'wheat', 'rice', 'sugar'], 
  name: "Prices of stuff.")

Out[2]:

Daru::Vector:15070620 size: 6
	Prices of stuff.
cherry	20
apple	40
barley	25
wheat	50
rice	45
sugar	12

Retreive a single value¶

Specify the index you want to retrieve in the #[] operator

In [3]:

vector['rice']

Out[3]:

Retreive multiple values¶

Multiple values can be retreived at the same time as another Daru::Vector by separating them with commas.

In [4]:

vector['rice', 'wheat', 'sugar']

Out[4]:

Daru::Vector:14387920 size: 3
	Prices of stuff.
rice	45
wheat	50
sugar	12

Retreive a slice with a Range¶

Specifying a range of indexes will retrieve a slice of the Daru::Vector

In [5]:

vector['barley'..'sugar']

Out[5]:

Daru::Vector:14063700 size: 4
	Prices of stuff.
barley	25
wheat	50
rice	45
sugar	12

Assign a value¶

Assign a value by specifying the index directly to the #[]= operator

In [6]:

vector['barley'] = 1500
vector

Out[6]:

Daru::Vector:15070620 size: 6
	Prices of stuff.
cherry	20
apple	40
barley	1500
wheat	50
rice	45
sugar	12

Creating a Daru::DataFrame¶

The :index option is used for specifying the row index of the DataFrame and the :order option determines the order in which they will be stored.

Note that this is only one way of creating a DataFrame. There are around 8 different ways you can do so, depending on your use case.

In [7]:

df = Daru::DataFrame.new({
  'col0' => [1,2,3,4,5,6],
  'col2' => ['a','b','c','d','e','f'],
  'col1' => [11,22,33,44,55,66]
  }, 
  index: ['one', 'two', 'three', 'four', 'five', 'six'], 
  order: ['col0', 'col1', 'col2']
)

Out[7]:

Daru::DataFrame:13337740 rows: 6 cols: 3
	col0	col1	col2
one	1	11	a
two	2	22	b
three	3	33	c
four	4	44	d
five	5	55	e
six	6	66	f

Accessing a Column¶

A DataFrame column can be accessed using the DataFrame#[] operator.

Note that it returns a Daru::Vector

In [8]:

df['col1']

Out[8]:

Daru::Vector:13292960 size: 6
	col1
one	11
two	22
three	33
four	44
five	55
six	66

Accessing multiple Columns¶

Multiple columns can be accessed by separating them with a comma. The result is another DataFrame.

In [9]:

df['col2', 'col0']

Out[9]:

Daru::DataFrame:12423020 rows: 6 cols: 2
	col2	col0
one	a	1
two	b	2
three	c	3
four	d	4
five	e	5
six	f	6

Accessing a Range of Columns¶

A slice of the DataFrame by columns can be obtained by specifying a Range in #[]

In [10]:

df['col1'..'col2']

Out[10]:

Daru::DataFrame:12007160 rows: 6 cols: 2
	col1	col2
one	11	a
two	22	b
three	33	c
four	44	d
five	55	e
six	66	f

Assigning a Column¶

You can assign a Daru::Vector to a column and the indexes of the Vector will be automatically matched to that of the DataFrame.

In [11]:

df['col1'] = Daru::Vector.new(['this', 'is', 'some','new','data','here'], 
  index: ['one', 'three','two','six','four', 'five'])
df

Out[11]:

Daru::DataFrame:13337740 rows: 6 cols: 3
	col0	col1	col2
one	1	this	a
two	2	some	b
three	3	is	c
four	4	data	d
five	5	here	e
six	6	new	f

Accessing a Row¶

A single row can be accessed using the #row[] function.

In [12]:

df.row['four']

Out[12]:

Daru::Vector:11115780 size: 3
	four
col0	4
col1	data
col2	d

Accessing a Range of Rows¶

Specifying a Range of Row indexes in #row[] will select a DataFrame with those rows

In [13]:

df.row['three'..'five']

Out[13]:

Daru::DataFrame:9135240 rows: 3 cols: 3
	col0	col1	col2
three	3	is	c
four	4	data	d
five	5	here	e

Assigning a Row¶

You can also assign a Row with Daru::Vector. Notice that indexes are mathced according to the order of the DataFrame.

In [14]:

df.row['five'] = [666,555,333]

Out[14]:

[666, 555, 333]

Statistics on Vector with missing data¶

A host of static and rolling statistics methods are provided on Daru::Vector.

Note that missing data (very common in most real world scenarios) is gracefully handled

In [15]:

vector = Daru::Vector.new([1,3,5,nil,2,53,nil])
vector.mean

Out[15]:

12.8

Statistics on DataFrame¶

DataFrame statistics will basically apply the concerned method on all numerical columns of the DataFrame.

In [16]:

df.mean

Out[16]:

Daru::Vector:8060380 size: 1
	mean
col0	113.66666666666667

Useful statistics about the vectors in a DataFrame can be observed with #describe

In [17]:

df.describe

Out[17]:

Daru::DataFrame:7470980 rows: 5 cols: 1
	col0
count	6
mean	113.66666666666667
std	270.5924364550249
min	1
max	666

Time Series Support¶

Daru offers a robust time series manipulation API for indexing data based on timestamps. This makes daru a viable tool for analyzing financial data (or any data that changes with time)

The DateTimeIndex¶

The DateTimeIndex is a special index for indexing data based on timestamps.

A date index range can be created using the DateTimeIndex.date_range function. The :freq option decides the time frequency between each timestamp in the date index.

In [18]:

index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 1000, :freq => '3D')

Out[18]:

#<DateTimeIndex:6151760 offset=3D periods=1000 data=[2012-01-01T00:00:00+00:00...2020-03-16T00:00:00+00:00]>

A Daru::Vector can be created by simply passing the newly created index object into the :index argument.

In [19]:

timeseries = Daru::Vector.new(1000.times.map {rand}, index: index)

Out[19]:

Daru::Vector:5628020 size: 1000
	nil
2012-01-01T00:00:00+00:00	0.692831672574459
2012-01-04T00:00:00+00:00	0.6971783281963972
2012-01-07T00:00:00+00:00	0.34687766698487965
2012-01-10T00:00:00+00:00	0.5509404993547384
2012-01-13T00:00:00+00:00	0.10166975999865946
2012-01-16T00:00:00+00:00	0.34183413903843207
2012-01-19T00:00:00+00:00	0.018428168123970967
2012-01-22T00:00:00+00:00	0.7792652522504137
2012-01-25T00:00:00+00:00	0.24793667731961144
2012-01-28T00:00:00+00:00	0.7200752551979407
2012-01-31T00:00:00+00:00	0.770756064084555
2012-02-03T00:00:00+00:00	0.6475396341969668
2012-02-06T00:00:00+00:00	0.00034544180080875453
2012-02-09T00:00:00+00:00	0.9881939271758362
2012-02-12T00:00:00+00:00	0.042428559674003274
2012-02-15T00:00:00+00:00	0.6604582692043693
2012-02-18T00:00:00+00:00	0.6446959879056338
2012-02-21T00:00:00+00:00	0.11606340772777746
2012-02-24T00:00:00+00:00	0.5238981665473298
2012-02-27T00:00:00+00:00	0.25979569124671453
2012-03-01T00:00:00+00:00	0.1808967702663009
2012-03-04T00:00:00+00:00	0.04614156947957693
2012-03-07T00:00:00+00:00	0.8935716437439504
2012-03-10T00:00:00+00:00	0.7197074871013468
2012-03-13T00:00:00+00:00	0.20741375904156445
2012-03-16T00:00:00+00:00	0.501647901862296
2012-03-19T00:00:00+00:00	0.9470421480253584
2012-03-22T00:00:00+00:00	0.2954430257659184
2012-03-25T00:00:00+00:00	0.18422816661946229
2012-03-28T00:00:00+00:00	0.48737285121462925
2012-03-31T00:00:00+00:00	0.7549290269495055
2012-04-03T00:00:00+00:00	0.8216050188191338
...	...
2020-03-16T00:00:00+00:00	0.8324422863437039

Accessing data by partial timestamps¶

When a Vector or DataFrame is indexed by a DateTimeIndex, it allows you to partially specify the date to retreive all the data that belongs to that date.

For example, to access all the data belonging to the year 2012.

In [20]:

timeseries['2012']

Out[20]:

Daru::Vector:15406520 size: 122
	nil
2012-01-01T00:00:00+00:00	0.692831672574459
2012-01-04T00:00:00+00:00	0.6971783281963972
2012-01-07T00:00:00+00:00	0.34687766698487965
2012-01-10T00:00:00+00:00	0.5509404993547384
2012-01-13T00:00:00+00:00	0.10166975999865946
2012-01-16T00:00:00+00:00	0.34183413903843207
2012-01-19T00:00:00+00:00	0.018428168123970967
2012-01-22T00:00:00+00:00	0.7792652522504137
2012-01-25T00:00:00+00:00	0.24793667731961144
2012-01-28T00:00:00+00:00	0.7200752551979407
2012-01-31T00:00:00+00:00	0.770756064084555
2012-02-03T00:00:00+00:00	0.6475396341969668
2012-02-06T00:00:00+00:00	0.00034544180080875453
2012-02-09T00:00:00+00:00	0.9881939271758362
2012-02-12T00:00:00+00:00	0.042428559674003274
2012-02-15T00:00:00+00:00	0.6604582692043693
2012-02-18T00:00:00+00:00	0.6446959879056338
2012-02-21T00:00:00+00:00	0.11606340772777746
2012-02-24T00:00:00+00:00	0.5238981665473298
2012-02-27T00:00:00+00:00	0.25979569124671453
2012-03-01T00:00:00+00:00	0.1808967702663009
2012-03-04T00:00:00+00:00	0.04614156947957693
2012-03-07T00:00:00+00:00	0.8935716437439504
2012-03-10T00:00:00+00:00	0.7197074871013468
2012-03-13T00:00:00+00:00	0.20741375904156445
2012-03-16T00:00:00+00:00	0.501647901862296
2012-03-19T00:00:00+00:00	0.9470421480253584
2012-03-22T00:00:00+00:00	0.2954430257659184
2012-03-25T00:00:00+00:00	0.18422816661946229
2012-03-28T00:00:00+00:00	0.48737285121462925
2012-03-31T00:00:00+00:00	0.7549290269495055
2012-04-03T00:00:00+00:00	0.8216050188191338
...	...
2012-12-29T00:00:00+00:00	0.26155523165437944

Or to access data whose time stamp is March 2012...

In [21]:

timeseries['2012-3']

Out[21]:

Daru::Vector:14832480 size: 11
	nil
2012-03-01T00:00:00+00:00	0.1808967702663009
2012-03-04T00:00:00+00:00	0.04614156947957693
2012-03-07T00:00:00+00:00	0.8935716437439504
2012-03-10T00:00:00+00:00	0.7197074871013468
2012-03-13T00:00:00+00:00	0.20741375904156445
2012-03-16T00:00:00+00:00	0.501647901862296
2012-03-19T00:00:00+00:00	0.9470421480253584
2012-03-22T00:00:00+00:00	0.2954430257659184
2012-03-25T00:00:00+00:00	0.18422816661946229
2012-03-28T00:00:00+00:00	0.48737285121462925
2012-03-31T00:00:00+00:00	0.7549290269495055

Specifying the date precisely will return the exact data point (You can also pass a ruby DateTime object for precisely obtaining data).

In [22]:

timeseries['2012-3-10']

Out[22]:

0.7197074871013468

Say you have per second data about the price of a commodity and want to access the prices for the minute on 23rd of March 2012 at 12:42 pm

In [23]:

index      = Daru::DateTimeIndex.date_range(
  :start => '2012-3-23 11:00', :periods => 20000, :freq => 'S')

seconds_ts = Daru::Vector.new(20000.times.map { rand(50) }, index: index)
seconds_ts['2012-3-23 12:42']

Out[23]:

Daru::Vector:28416340 size: 60
	nil
2012-03-23T12:42:00+00:00	4
2012-03-23T12:42:01+00:00	32
2012-03-23T12:42:02+00:00	35
2012-03-23T12:42:03+00:00	35
2012-03-23T12:42:04+00:00	14
2012-03-23T12:42:05+00:00	1
2012-03-23T12:42:06+00:00	43
2012-03-23T12:42:07+00:00	39
2012-03-23T12:42:08+00:00	20
2012-03-23T12:42:09+00:00	16
2012-03-23T12:42:10+00:00	43
2012-03-23T12:42:11+00:00	0
2012-03-23T12:42:12+00:00	27
2012-03-23T12:42:13+00:00	43
2012-03-23T12:42:14+00:00	43
2012-03-23T12:42:15+00:00	18
2012-03-23T12:42:16+00:00	35
2012-03-23T12:42:17+00:00	39
2012-03-23T12:42:18+00:00	35
2012-03-23T12:42:19+00:00	23
2012-03-23T12:42:20+00:00	25
2012-03-23T12:42:21+00:00	13
2012-03-23T12:42:22+00:00	5
2012-03-23T12:42:23+00:00	43
2012-03-23T12:42:24+00:00	13
2012-03-23T12:42:25+00:00	28
2012-03-23T12:42:26+00:00	2
2012-03-23T12:42:27+00:00	42
2012-03-23T12:42:28+00:00	29
2012-03-23T12:42:29+00:00	36
2012-03-23T12:42:30+00:00	44
2012-03-23T12:42:31+00:00	36
...	...
2012-03-23T12:42:59+00:00	8

Visualization¶

Simple Visualization with interactive graphs¶

Plotting a simple scatter plot from a DataFrame. Nyaplot integration provides interactivity.

DataFrame denoting Ice Cream sales of a particular food chain in a city according to the maximum recorded temperature in that city. It also lists the staff strength present in each city.

In [24]:

df = Daru::DataFrame.new({
  :temperature => [30.4, 23.5, 44.5, 20.3, 34, 24, 31.45, 28.34, 37, 24],
  :sales       => [350, 150, 500, 200, 480, 250, 330, 400, 420, 560],
  :city        => ['Pune', 'Delhi']*5,
  :staff       => [15,20]*5
})
df

Out[24]:

Daru::DataFrame:4800060 rows: 10 cols: 4
	city	sales	staff	temperature
0	Pune	350	15	30.4
1	Delhi	150	20	23.5
2	Pune	500	15	44.5
3	Delhi	200	20	20.3
4	Pune	480	15	34
5	Delhi	250	20	24
6	Pune	330	15	31.45
7	Delhi	400	20	28.34
8	Pune	420	15	37
9	Delhi	560	20	24

The plot below is between Temperature in the city and the sales of ice cream.

In [25]:

df.plot(type: :scatter, x: :temperature, y: :sales) do |plot, diagram|
  plot.x_label "Temperature"
  plot.y_label "Sales"
  plot.yrange [100, 600]
  plot.xrange [15, 50]
  diagram.tooltip_contents([:city, :staff])
  # Set the color scheme for this diagram.
  diagram.color(Nyaplot::Colors.qual) 
  # Change color of each point WRT to the city that it belongs to.
  diagram.fill_by(:city)
  # Shape each point WRT to the city that it belongs to.
  diagram.shape_by(:city) 
end

Use with GNU plot¶

Plotting a time series with it's rolling mean¶

Init a random number generator for creating a normal distribution

In [26]:

rng = Distribution::Normal.rng

Out[26]:

#<Proc:0x0000000368b250@/home/ubuntu/.rvm/gems/ruby-2.2.1/gems/distribution-0.7.3/lib/distribution/normal/gsl.rb:8 (lambda)>

In [27]:

index  = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000)
vector = Daru::Vector.new(1000.times.map {rng.call}, index: index)
vector = vector.cumsum
rolling_mean = vector.rolling_mean 60

GnuplotRB::Plot.new(
  [vector      , with: 'lines', title: 'Vector'], 
  [rolling_mean, with: 'lines', title: 'Rolling Mean'],
  xlabel: 'Time', ylabel: 'Value'
)

Out[27]:

Arel-like syntax¶

Web devs will feel right at home!

Fast and intuitive syntax for retreiving data with boolean indexing.

The 'where' clause¶

In [28]:

df = Daru::DataFrame.new({
  a: [1,2,3,4,5,6]*100,
  b: ['a','b','c','d','e','f']*100,
  c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)
df

Out[28]:

Daru::DataFrame:5195920 rows: 600 cols: 3
	a	b	c
102	1	a	11
177	2	b	22
354	3	c	33
163	4	d	44
230	5	e	55
332	6	f	66
171	1	a	11
123	2	b	22
470	3	c	33
471	4	d	44
309	5	e	55
23	6	f	66
15	1	a	11
26	2	b	22
312	3	c	33
484	4	d	44
386	5	e	55
72	6	f	66
506	1	a	11
96	2	b	22
183	3	c	33
90	4	d	44
451	5	e	55
278	6	f	66
529	1	a	11
87	2	b	22
256	3	c	33
415	4	d	44
421	5	e	55
485	6	f	66
139	1	a	11
482	2	b	22
...	...	...	...
513	6	f	66

Compares with a bunch of scalar quantities and returns a DataFrame wherever they return *true*

In [29]:

df.where(df[:a].eq(2).or(df[:c].eq(55)))

Out[29]:

Daru::DataFrame:14856680 rows: 200 cols: 3
	a	b	c
177	2	b	22
230	5	e	55
123	2	b	22
309	5	e	55
26	2	b	22
386	5	e	55
96	2	b	22
451	5	e	55
87	2	b	22
421	5	e	55
482	2	b	22
254	5	e	55
52	2	b	22
282	5	e	55
267	2	b	22
304	5	e	55
36	2	b	22
424	5	e	55
303	2	b	22
353	5	e	55
376	2	b	22
115	5	e	55
55	2	b	22
7	5	e	55
478	2	b	22
239	5	e	55
356	2	b	22
530	5	e	55
99	2	b	22
81	5	e	55
595	2	b	22
436	5	e	55
...	...	...	...
532	5	e	55

Daru::DataFrame:5195920 rows: 600 cols: 3
	a	b	c
102	1	a	11
177	2	b	22
354	3	c	33
163	4	d	44
230	5	e	55
332	6	f	66
171	1	a	11
123	2	b	22
470	3	c	33
471	4	d	44
309	5	e	55
23	6	f	66
15	1	a	11
26	2	b	22
312	3	c	33
484	4	d	44
386	5	e	55
72	6	f	66
506	1	a	11
96	2	b	22
183	3	c	33
90	4	d	44
451	5	e	55
278	6	f	66
529	1	a	11
87	2	b	22
256	3	c	33
415	4	d	44
421	5	e	55
485	6	f	66
139	1	a	11
482	2	b	22
...	...	...	...
513	6	f	66

Daru::DataFrame:14856680 rows: 200 cols: 3
	a	b	c
177	2	b	22
230	5	e	55
123	2	b	22
309	5	e	55
26	2	b	22
386	5	e	55
96	2	b	22
451	5	e	55
87	2	b	22
421	5	e	55
482	2	b	22
254	5	e	55
52	2	b	22
282	5	e	55
267	2	b	22
304	5	e	55
36	2	b	22
424	5	e	55
303	2	b	22
353	5	e	55
376	2	b	22
115	5	e	55
55	2	b	22
7	5	e	55
478	2	b	22
239	5	e	55
356	2	b	22
530	5	e	55
99	2	b	22
81	5	e	55
595	2	b	22
436	5	e	55
...	...	...	...
532	5	e	55

Daru::DataFrame:5195920 rows: 600 cols: 3
	a	b	c
102	1	a	11
177	2	b	22
354	3	c	33
163	4	d	44
230	5	e	55
332	6	f	66
171	1	a	11
123	2	b	22
470	3	c	33
471	4	d	44
309	5	e	55
23	6	f	66
15	1	a	11
26	2	b	22
312	3	c	33
484	4	d	44
386	5	e	55
72	6	f	66
506	1	a	11
96	2	b	22
183	3	c	33
90	4	d	44
451	5	e	55
278	6	f	66
529	1	a	11
87	2	b	22
256	3	c	33
415	4	d	44
421	5	e	55
485	6	f	66
139	1	a	11
482	2	b	22
...	...	...	...
513	6	f	66

Daru::DataFrame:14856680 rows: 200 cols: 3
	a	b	c
177	2	b	22
230	5	e	55
123	2	b	22
309	5	e	55
26	2	b	22
386	5	e	55
96	2	b	22
451	5	e	55
87	2	b	22
421	5	e	55
482	2	b	22
254	5	e	55
52	2	b	22
282	5	e	55
267	2	b	22
304	5	e	55
36	2	b	22
424	5	e	55
303	2	b	22
353	5	e	55
376	2	b	22
115	5	e	55
55	2	b	22
7	5	e	55
478	2	b	22
239	5	e	55
356	2	b	22
530	5	e	55
99	2	b	22
81	5	e	55
595	2	b	22
436	5	e	55
...	...	...	...
532	5	e	55

Daru::DataFrame:5195920 rows: 600 cols: 3
	a	b	c
102	1	a	11
177	2	b	22
354	3	c	33
163	4	d	44
230	5	e	55
332	6	f	66
171	1	a	11
123	2	b	22
470	3	c	33
471	4	d	44
309	5	e	55
23	6	f	66
15	1	a	11
26	2	b	22
312	3	c	33
484	4	d	44
386	5	e	55
72	6	f	66
506	1	a	11
96	2	b	22
183	3	c	33
90	4	d	44
451	5	e	55
278	6	f	66
529	1	a	11
87	2	b	22
256	3	c	33
415	4	d	44
421	5	e	55
485	6	f	66
139	1	a	11
482	2	b	22
...	...	...	...
513	6	f	66

Daru::DataFrame:14856680 rows: 200 cols: 3
	a	b	c
177	2	b	22
230	5	e	55
123	2	b	22
309	5	e	55
26	2	b	22
386	5	e	55
96	2	b	22
451	5	e	55
87	2	b	22
421	5	e	55
482	2	b	22
254	5	e	55
52	2	b	22
282	5	e	55
267	2	b	22
304	5	e	55
36	2	b	22
424	5	e	55
303	2	b	22
353	5	e	55
376	2	b	22
115	5	e	55
55	2	b	22
7	5	e	55
478	2	b	22
239	5	e	55
356	2	b	22
530	5	e	55
99	2	b	22
81	5	e	55
595	2	b	22
436	5	e	55
...	...	...	...
532	5	e	55