require 'daru'
require 'distribution'
require 'gnuplotrb'
true
Vectors are indexed by passing data using the index option, and named with name
vector = Daru::Vector.new(
[20,40,25,50,45,12], index: ['cherry', 'apple', 'barley', 'wheat', 'rice', 'sugar'],
name: "Prices of stuff.")
| Daru::Vector:15070620 size: 6 | |
|---|---|
| Prices of stuff. | |
| cherry | 20 |
| apple | 40 |
| barley | 25 |
| wheat | 50 |
| rice | 45 |
| sugar | 12 |
Specify the index you want to retrieve in the #[] operator
vector['rice']
45
Multiple values can be retreived at the same time as another Daru::Vector by separating them with commas.
vector['rice', 'wheat', 'sugar']
| Daru::Vector:14387920 size: 3 | |
|---|---|
| Prices of stuff. | |
| rice | 45 |
| wheat | 50 |
| sugar | 12 |
Specifying a range of indexes will retrieve a slice of the Daru::Vector
vector['barley'..'sugar']
| Daru::Vector:14063700 size: 4 | |
|---|---|
| Prices of stuff. | |
| barley | 25 |
| wheat | 50 |
| rice | 45 |
| sugar | 12 |
Assign a value by specifying the index directly to the #[]= operator
vector['barley'] = 1500
vector
| Daru::Vector:15070620 size: 6 | |
|---|---|
| Prices of stuff. | |
| cherry | 20 |
| apple | 40 |
| barley | 1500 |
| wheat | 50 |
| rice | 45 |
| sugar | 12 |
The :index option is used for specifying the row index of the DataFrame and the :order option determines the order in which they will be stored.
Note that this is only one way of creating a DataFrame. There are around 8 different ways you can do so, depending on your use case.
df = Daru::DataFrame.new({
'col0' => [1,2,3,4,5,6],
'col2' => ['a','b','c','d','e','f'],
'col1' => [11,22,33,44,55,66]
},
index: ['one', 'two', 'three', 'four', 'five', 'six'],
order: ['col0', 'col1', 'col2']
)
| Daru::DataFrame:13337740 rows: 6 cols: 3 | |||
|---|---|---|---|
| col0 | col1 | col2 | |
| one | 1 | 11 | a |
| two | 2 | 22 | b |
| three | 3 | 33 | c |
| four | 4 | 44 | d |
| five | 5 | 55 | e |
| six | 6 | 66 | f |
A DataFrame column can be accessed using the DataFrame#[] operator.
Note that it returns a Daru::Vector
df['col1']
| Daru::Vector:13292960 size: 6 | |
|---|---|
| col1 | |
| one | 11 |
| two | 22 |
| three | 33 |
| four | 44 |
| five | 55 |
| six | 66 |
Multiple columns can be accessed by separating them with a comma. The result is another DataFrame.
df['col2', 'col0']
| Daru::DataFrame:12423020 rows: 6 cols: 2 | ||
|---|---|---|
| col2 | col0 | |
| one | a | 1 |
| two | b | 2 |
| three | c | 3 |
| four | d | 4 |
| five | e | 5 |
| six | f | 6 |
A slice of the DataFrame by columns can be obtained by specifying a Range in #[]
df['col1'..'col2']
| Daru::DataFrame:12007160 rows: 6 cols: 2 | ||
|---|---|---|
| col1 | col2 | |
| one | 11 | a |
| two | 22 | b |
| three | 33 | c |
| four | 44 | d |
| five | 55 | e |
| six | 66 | f |
You can assign a Daru::Vector to a column and the indexes of the Vector will be automatically matched to that of the DataFrame.
df['col1'] = Daru::Vector.new(['this', 'is', 'some','new','data','here'],
index: ['one', 'three','two','six','four', 'five'])
df
| Daru::DataFrame:13337740 rows: 6 cols: 3 | |||
|---|---|---|---|
| col0 | col1 | col2 | |
| one | 1 | this | a |
| two | 2 | some | b |
| three | 3 | is | c |
| four | 4 | data | d |
| five | 5 | here | e |
| six | 6 | new | f |
A single row can be accessed using the #row[] function.
df.row['four']
| Daru::Vector:11115780 size: 3 | |
|---|---|
| four | |
| col0 | 4 |
| col1 | data |
| col2 | d |
Specifying a Range of Row indexes in #row[] will select a DataFrame with those rows
df.row['three'..'five']
| Daru::DataFrame:9135240 rows: 3 cols: 3 | |||
|---|---|---|---|
| col0 | col1 | col2 | |
| three | 3 | is | c |
| four | 4 | data | d |
| five | 5 | here | e |
You can also assign a Row with Daru::Vector. Notice that indexes are mathced according to the order of the DataFrame.
df.row['five'] = [666,555,333]
[666, 555, 333]
A host of static and rolling statistics methods are provided on Daru::Vector.
Note that missing data (very common in most real world scenarios) is gracefully handled
vector = Daru::Vector.new([1,3,5,nil,2,53,nil])
vector.mean
12.8
DataFrame statistics will basically apply the concerned method on all numerical columns of the DataFrame.
df.mean
| Daru::Vector:8060380 size: 1 | |
|---|---|
| mean | |
| col0 | 113.66666666666667 |
Useful statistics about the vectors in a DataFrame can be observed with #describe
df.describe
| Daru::DataFrame:7470980 rows: 5 cols: 1 | |
|---|---|
| col0 | |
| count | 6 |
| mean | 113.66666666666667 |
| std | 270.5924364550249 |
| min | 1 |
| max | 666 |
Daru offers a robust time series manipulation API for indexing data based on timestamps. This makes daru a viable tool for analyzing financial data (or any data that changes with time)
The DateTimeIndex is a special index for indexing data based on timestamps.
A date index range can be created using the DateTimeIndex.date_range function. The :freq option decides the time frequency between each timestamp in the date index.
index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 1000, :freq => '3D')
#<DateTimeIndex:6151760 offset=3D periods=1000 data=[2012-01-01T00:00:00+00:00...2020-03-16T00:00:00+00:00]>
A Daru::Vector can be created by simply passing the newly created index object into the :index argument.
timeseries = Daru::Vector.new(1000.times.map {rand}, index: index)
| Daru::Vector:5628020 size: 1000 | |
|---|---|
| nil | |
| 2012-01-01T00:00:00+00:00 | 0.692831672574459 |
| 2012-01-04T00:00:00+00:00 | 0.6971783281963972 |
| 2012-01-07T00:00:00+00:00 | 0.34687766698487965 |
| 2012-01-10T00:00:00+00:00 | 0.5509404993547384 |
| 2012-01-13T00:00:00+00:00 | 0.10166975999865946 |
| 2012-01-16T00:00:00+00:00 | 0.34183413903843207 |
| 2012-01-19T00:00:00+00:00 | 0.018428168123970967 |
| 2012-01-22T00:00:00+00:00 | 0.7792652522504137 |
| 2012-01-25T00:00:00+00:00 | 0.24793667731961144 |
| 2012-01-28T00:00:00+00:00 | 0.7200752551979407 |
| 2012-01-31T00:00:00+00:00 | 0.770756064084555 |
| 2012-02-03T00:00:00+00:00 | 0.6475396341969668 |
| 2012-02-06T00:00:00+00:00 | 0.00034544180080875453 |
| 2012-02-09T00:00:00+00:00 | 0.9881939271758362 |
| 2012-02-12T00:00:00+00:00 | 0.042428559674003274 |
| 2012-02-15T00:00:00+00:00 | 0.6604582692043693 |
| 2012-02-18T00:00:00+00:00 | 0.6446959879056338 |
| 2012-02-21T00:00:00+00:00 | 0.11606340772777746 |
| 2012-02-24T00:00:00+00:00 | 0.5238981665473298 |
| 2012-02-27T00:00:00+00:00 | 0.25979569124671453 |
| 2012-03-01T00:00:00+00:00 | 0.1808967702663009 |
| 2012-03-04T00:00:00+00:00 | 0.04614156947957693 |
| 2012-03-07T00:00:00+00:00 | 0.8935716437439504 |
| 2012-03-10T00:00:00+00:00 | 0.7197074871013468 |
| 2012-03-13T00:00:00+00:00 | 0.20741375904156445 |
| 2012-03-16T00:00:00+00:00 | 0.501647901862296 |
| 2012-03-19T00:00:00+00:00 | 0.9470421480253584 |
| 2012-03-22T00:00:00+00:00 | 0.2954430257659184 |
| 2012-03-25T00:00:00+00:00 | 0.18422816661946229 |
| 2012-03-28T00:00:00+00:00 | 0.48737285121462925 |
| 2012-03-31T00:00:00+00:00 | 0.7549290269495055 |
| 2012-04-03T00:00:00+00:00 | 0.8216050188191338 |
| ... | ... |
| 2020-03-16T00:00:00+00:00 | 0.8324422863437039 |
When a Vector or DataFrame is indexed by a DateTimeIndex, it allows you to partially specify the date to retreive all the data that belongs to that date.
For example, to access all the data belonging to the year 2012.
timeseries['2012']
| Daru::Vector:15406520 size: 122 | |
|---|---|
| nil | |
| 2012-01-01T00:00:00+00:00 | 0.692831672574459 |
| 2012-01-04T00:00:00+00:00 | 0.6971783281963972 |
| 2012-01-07T00:00:00+00:00 | 0.34687766698487965 |
| 2012-01-10T00:00:00+00:00 | 0.5509404993547384 |
| 2012-01-13T00:00:00+00:00 | 0.10166975999865946 |
| 2012-01-16T00:00:00+00:00 | 0.34183413903843207 |
| 2012-01-19T00:00:00+00:00 | 0.018428168123970967 |
| 2012-01-22T00:00:00+00:00 | 0.7792652522504137 |
| 2012-01-25T00:00:00+00:00 | 0.24793667731961144 |
| 2012-01-28T00:00:00+00:00 | 0.7200752551979407 |
| 2012-01-31T00:00:00+00:00 | 0.770756064084555 |
| 2012-02-03T00:00:00+00:00 | 0.6475396341969668 |
| 2012-02-06T00:00:00+00:00 | 0.00034544180080875453 |
| 2012-02-09T00:00:00+00:00 | 0.9881939271758362 |
| 2012-02-12T00:00:00+00:00 | 0.042428559674003274 |
| 2012-02-15T00:00:00+00:00 | 0.6604582692043693 |
| 2012-02-18T00:00:00+00:00 | 0.6446959879056338 |
| 2012-02-21T00:00:00+00:00 | 0.11606340772777746 |
| 2012-02-24T00:00:00+00:00 | 0.5238981665473298 |
| 2012-02-27T00:00:00+00:00 | 0.25979569124671453 |
| 2012-03-01T00:00:00+00:00 | 0.1808967702663009 |
| 2012-03-04T00:00:00+00:00 | 0.04614156947957693 |
| 2012-03-07T00:00:00+00:00 | 0.8935716437439504 |
| 2012-03-10T00:00:00+00:00 | 0.7197074871013468 |
| 2012-03-13T00:00:00+00:00 | 0.20741375904156445 |
| 2012-03-16T00:00:00+00:00 | 0.501647901862296 |
| 2012-03-19T00:00:00+00:00 | 0.9470421480253584 |
| 2012-03-22T00:00:00+00:00 | 0.2954430257659184 |
| 2012-03-25T00:00:00+00:00 | 0.18422816661946229 |
| 2012-03-28T00:00:00+00:00 | 0.48737285121462925 |
| 2012-03-31T00:00:00+00:00 | 0.7549290269495055 |
| 2012-04-03T00:00:00+00:00 | 0.8216050188191338 |
| ... | ... |
| 2012-12-29T00:00:00+00:00 | 0.26155523165437944 |
Or to access data whose time stamp is March 2012...
timeseries['2012-3']
| Daru::Vector:14832480 size: 11 | |
|---|---|
| nil | |
| 2012-03-01T00:00:00+00:00 | 0.1808967702663009 |
| 2012-03-04T00:00:00+00:00 | 0.04614156947957693 |
| 2012-03-07T00:00:00+00:00 | 0.8935716437439504 |
| 2012-03-10T00:00:00+00:00 | 0.7197074871013468 |
| 2012-03-13T00:00:00+00:00 | 0.20741375904156445 |
| 2012-03-16T00:00:00+00:00 | 0.501647901862296 |
| 2012-03-19T00:00:00+00:00 | 0.9470421480253584 |
| 2012-03-22T00:00:00+00:00 | 0.2954430257659184 |
| 2012-03-25T00:00:00+00:00 | 0.18422816661946229 |
| 2012-03-28T00:00:00+00:00 | 0.48737285121462925 |
| 2012-03-31T00:00:00+00:00 | 0.7549290269495055 |
Specifying the date precisely will return the exact data point (You can also pass a ruby DateTime object for precisely obtaining data).
timeseries['2012-3-10']
0.7197074871013468
Say you have per second data about the price of a commodity and want to access the prices for the minute on 23rd of March 2012 at 12:42 pm
index = Daru::DateTimeIndex.date_range(
:start => '2012-3-23 11:00', :periods => 20000, :freq => 'S')
seconds_ts = Daru::Vector.new(20000.times.map { rand(50) }, index: index)
seconds_ts['2012-3-23 12:42']
| Daru::Vector:28416340 size: 60 | |
|---|---|
| nil | |
| 2012-03-23T12:42:00+00:00 | 4 |
| 2012-03-23T12:42:01+00:00 | 32 |
| 2012-03-23T12:42:02+00:00 | 35 |
| 2012-03-23T12:42:03+00:00 | 35 |
| 2012-03-23T12:42:04+00:00 | 14 |
| 2012-03-23T12:42:05+00:00 | 1 |
| 2012-03-23T12:42:06+00:00 | 43 |
| 2012-03-23T12:42:07+00:00 | 39 |
| 2012-03-23T12:42:08+00:00 | 20 |
| 2012-03-23T12:42:09+00:00 | 16 |
| 2012-03-23T12:42:10+00:00 | 43 |
| 2012-03-23T12:42:11+00:00 | 0 |
| 2012-03-23T12:42:12+00:00 | 27 |
| 2012-03-23T12:42:13+00:00 | 43 |
| 2012-03-23T12:42:14+00:00 | 43 |
| 2012-03-23T12:42:15+00:00 | 18 |
| 2012-03-23T12:42:16+00:00 | 35 |
| 2012-03-23T12:42:17+00:00 | 39 |
| 2012-03-23T12:42:18+00:00 | 35 |
| 2012-03-23T12:42:19+00:00 | 23 |
| 2012-03-23T12:42:20+00:00 | 25 |
| 2012-03-23T12:42:21+00:00 | 13 |
| 2012-03-23T12:42:22+00:00 | 5 |
| 2012-03-23T12:42:23+00:00 | 43 |
| 2012-03-23T12:42:24+00:00 | 13 |
| 2012-03-23T12:42:25+00:00 | 28 |
| 2012-03-23T12:42:26+00:00 | 2 |
| 2012-03-23T12:42:27+00:00 | 42 |
| 2012-03-23T12:42:28+00:00 | 29 |
| 2012-03-23T12:42:29+00:00 | 36 |
| 2012-03-23T12:42:30+00:00 | 44 |
| 2012-03-23T12:42:31+00:00 | 36 |
| ... | ... |
| 2012-03-23T12:42:59+00:00 | 8 |
Plotting a simple scatter plot from a DataFrame. Nyaplot integration provides interactivity.
DataFrame denoting Ice Cream sales of a particular food chain in a city according to the maximum recorded temperature in that city. It also lists the staff strength present in each city.
df = Daru::DataFrame.new({
:temperature => [30.4, 23.5, 44.5, 20.3, 34, 24, 31.45, 28.34, 37, 24],
:sales => [350, 150, 500, 200, 480, 250, 330, 400, 420, 560],
:city => ['Pune', 'Delhi']*5,
:staff => [15,20]*5
})
df
| Daru::DataFrame:4800060 rows: 10 cols: 4 | ||||
|---|---|---|---|---|
| city | sales | staff | temperature | |
| 0 | Pune | 350 | 15 | 30.4 |
| 1 | Delhi | 150 | 20 | 23.5 |
| 2 | Pune | 500 | 15 | 44.5 |
| 3 | Delhi | 200 | 20 | 20.3 |
| 4 | Pune | 480 | 15 | 34 |
| 5 | Delhi | 250 | 20 | 24 |
| 6 | Pune | 330 | 15 | 31.45 |
| 7 | Delhi | 400 | 20 | 28.34 |
| 8 | Pune | 420 | 15 | 37 |
| 9 | Delhi | 560 | 20 | 24 |
The plot below is between Temperature in the city and the sales of ice cream.
df.plot(type: :scatter, x: :temperature, y: :sales) do |plot, diagram|
plot.x_label "Temperature"
plot.y_label "Sales"
plot.yrange [100, 600]
plot.xrange [15, 50]
diagram.tooltip_contents([:city, :staff])
# Set the color scheme for this diagram.
diagram.color(Nyaplot::Colors.qual)
# Change color of each point WRT to the city that it belongs to.
diagram.fill_by(:city)
# Shape each point WRT to the city that it belongs to.
diagram.shape_by(:city)
end
rng = Distribution::Normal.rng
#<Proc:0x0000000368b250@/home/ubuntu/.rvm/gems/ruby-2.2.1/gems/distribution-0.7.3/lib/distribution/normal/gsl.rb:8 (lambda)>
index = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000)
vector = Daru::Vector.new(1000.times.map {rng.call}, index: index)
vector = vector.cumsum
rolling_mean = vector.rolling_mean 60
GnuplotRB::Plot.new(
[vector , with: 'lines', title: 'Vector'],
[rolling_mean, with: 'lines', title: 'Rolling Mean'],
xlabel: 'Time', ylabel: 'Value'
)
df = Daru::DataFrame.new({
a: [1,2,3,4,5,6]*100,
b: ['a','b','c','d','e','f']*100,
c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)
df
| Daru::DataFrame:5195920 rows: 600 cols: 3 | |||
|---|---|---|---|
| a | b | c | |
| 102 | 1 | a | 11 |
| 177 | 2 | b | 22 |
| 354 | 3 | c | 33 |
| 163 | 4 | d | 44 |
| 230 | 5 | e | 55 |
| 332 | 6 | f | 66 |
| 171 | 1 | a | 11 |
| 123 | 2 | b | 22 |
| 470 | 3 | c | 33 |
| 471 | 4 | d | 44 |
| 309 | 5 | e | 55 |
| 23 | 6 | f | 66 |
| 15 | 1 | a | 11 |
| 26 | 2 | b | 22 |
| 312 | 3 | c | 33 |
| 484 | 4 | d | 44 |
| 386 | 5 | e | 55 |
| 72 | 6 | f | 66 |
| 506 | 1 | a | 11 |
| 96 | 2 | b | 22 |
| 183 | 3 | c | 33 |
| 90 | 4 | d | 44 |
| 451 | 5 | e | 55 |
| 278 | 6 | f | 66 |
| 529 | 1 | a | 11 |
| 87 | 2 | b | 22 |
| 256 | 3 | c | 33 |
| 415 | 4 | d | 44 |
| 421 | 5 | e | 55 |
| 485 | 6 | f | 66 |
| 139 | 1 | a | 11 |
| 482 | 2 | b | 22 |
| ... | ... | ... | ... |
| 513 | 6 | f | 66 |
Compares with a bunch of scalar quantities and returns a DataFrame wherever they return *true*
df.where(df[:a].eq(2).or(df[:c].eq(55)))
| Daru::DataFrame:14856680 rows: 200 cols: 3 | |||
|---|---|---|---|
| a | b | c | |
| 177 | 2 | b | 22 |
| 230 | 5 | e | 55 |
| 123 | 2 | b | 22 |
| 309 | 5 | e | 55 |
| 26 | 2 | b | 22 |
| 386 | 5 | e | 55 |
| 96 | 2 | b | 22 |
| 451 | 5 | e | 55 |
| 87 | 2 | b | 22 |
| 421 | 5 | e | 55 |
| 482 | 2 | b | 22 |
| 254 | 5 | e | 55 |
| 52 | 2 | b | 22 |
| 282 | 5 | e | 55 |
| 267 | 2 | b | 22 |
| 304 | 5 | e | 55 |
| 36 | 2 | b | 22 |
| 424 | 5 | e | 55 |
| 303 | 2 | b | 22 |
| 353 | 5 | e | 55 |
| 376 | 2 | b | 22 |
| 115 | 5 | e | 55 |
| 55 | 2 | b | 22 |
| 7 | 5 | e | 55 |
| 478 | 2 | b | 22 |
| 239 | 5 | e | 55 |
| 356 | 2 | b | 22 |
| 530 | 5 | e | 55 |
| 99 | 2 | b | 22 |
| 81 | 5 | e | 55 |
| 595 | 2 | b | 22 |
| 436 | 5 | e | 55 |
| ... | ... | ... | ... |
| 532 | 5 | e | 55 |