This notebook deals with categorical data support which is now added to Daru. With this Daru can handle categorical data.
require 'daru'
true
Initialize a vector whose data is categorical by specifying type: :category
dv = Daru::Vector.new [:a, 1, :a, 1, :c], type: :category
| Daru::Vector(5) | |
|---|---|
| 0 | a |
| 1 | 1 |
| 2 | a |
| 3 | 1 |
| 4 | c |
dv.frequencies
| Daru::Vector(3) | |
|---|---|
| a | 2 |
| 1 | 2 |
| c | 1 |
You can initialize it with some predefined categories even though they do not exist using categories option.
dv = Daru::Vector.new [:a, 1, :a, 1, :c], type: :category, categories: [:a, :b, :c, 1]
| Daru::Vector(5) | |
|---|---|
| 0 | a |
| 1 | 1 |
| 2 | a |
| 3 | 1 |
| 4 | c |
categories option initalizes new categories and also specify the order in which they should occur. So now if you see the frequency table it would be ordered with the order you specified.
dv.frequencies
| Daru::Vector(4) | |
|---|---|
| a | 2 |
| b | 0 |
| c | 1 |
| 1 | 2 |
Since categorical data can be ordered as well as unordered you can specify whether the vector is ordered or not using the ordered: true or ordered: false during initialization.
dv = Daru::Vector.new [:a, 1, :a, 1, :c], categories: [:a, :b, :c, 1], ordered: false, type: :category
| Daru::Vector(5) | |
|---|---|
| 0 | a |
| 1 | 1 |
| 2 | a |
| 3 | 1 |
| 4 | c |
dv.min
ArgumentError: Can not apply min when vector is unordered. To make the categorical data ordered, use #ordered = true /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:383:in `assert_ordered' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:216:in `min' (pry):7:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
As you can see you can't do the comparision if vector is not ordered. Lets make it ordered.
dv = Daru::Vector.new [:a, 1, :a, 1, :c], ordered: true, categories: [:a, :b, :c, 1], type: :category
| Daru::Vector(5) | |
|---|---|
| 0 | a |
| 1 | 1 |
| 2 | a |
| 3 | 1 |
| 4 | c |
dv.min
:a
dv.sort!
| Daru::Vector(5) | |
|---|---|
| 0 | a |
| 2 | a |
| 4 | c |
| 1 | 1 |
| 3 | 1 |
Beside during the initialization you can also set the categories after the vector has been initialized.
dv = Daru::Vector.new [:a, 1, :c, 1, :c], type: :category
dv.categories = [:a, :b, :c, 1]
[:a, :b, :c, 1]
You can also check all the categories associated with the vector.
dv.categories
[:a, :b, :c, 1]
You can specify if the vector has to be treated as ordered or not after initialization of vector.
Note: By default the vector will be unordered
dv = Daru::Vector.new [:a, 1, :c, 1, :c], type: :category
dv.ordered?
false
dv.ordered = true
dv.ordered?
true
Here are a few measures to summarize categorical vector.
dv = Daru::Vector.new [:a, :a, :a, :b, :b, :c], type: :category
dv.summary
| Daru::Vector(6) | |
|---|---|
| size | 6 |
| categories | 3 |
| max_freq | 3 |
| max_category | a |
| min_freq | 1 |
| min_category | c |
Gives the frequency of each category in the order they occur.
dv = Daru::Vector.new ['third']*3 + ['second']*2 + ['first'], type: :category, categories: ['first', 'second', 'third']
dv.frequencies
| Daru::Vector(3) | |
|---|---|
| first | 1 |
| second | 2 |
| third | 3 |
Note: These operations only apply if the vector is ordered.
dv
| Daru::Vector(6) | |
|---|---|
| 0 | third |
| 1 | third |
| 2 | third |
| 3 | second |
| 4 | second |
| 5 | first |
dv.min
ArgumentError: Can not apply min when vector is unordered. To make the categorical data ordered, use #ordered = true /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:383:in `assert_ordered' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:216:in `min' (pry):23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
dv.ordered = true
true
dv.min
"first"
dv.max
"third"
dv.sort!
| Daru::Vector(6) | |
|---|---|
| 5 | first |
| 3 | second |
| 4 | second |
| 0 | third |
| 1 | third |
| 2 | third |
Associates new categories with the vector.
Note: In order to insert a new categorical value you need to use #add_category to make sure this category is registered in the vector. For example -
dv
| Daru::Vector(6) | |
|---|---|
| 5 | first |
| 3 | second |
| 4 | second |
| 0 | third |
| 1 | third |
| 2 | third |
dv[0] = 'fourth'
ArgumentError: Invalid category fourth, to add a new category use #add_category /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:505:in `modify_category_at' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:144:in `[]=' (pry):29:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run' /home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load' /home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
dv.add_category 'fourth'
dv[0] = 'fourth'
dv
| Daru::Vector(6) | |
|---|---|
| 5 | first |
| 3 | second |
| 4 | second |
| 0 | fourth |
| 1 | third |
| 2 | third |
dv.categories
["first", "second", "third", "fourth"]
You can rename subset of existing categories by passing a hash mapping old ones to new ones.
dv = Daru::Vector.new [1, 2, 'third', 2, 1], type: :category
| Daru::Vector(5) | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | third |
| 3 | 2 |
| 4 | 1 |
dv.rename_categories 1 => 'first', 2 => 'second'
dv
| Daru::Vector(5) | |
|---|---|
| 0 | first |
| 1 | second |
| 2 | third |
| 3 | second |
| 4 | first |
Indexing works similar to an ordinary vector, so you can expect these methods to do the same as with ordinary vector. Here are few examples:
dv = Daru::Vector.new [1, 1, 2, 2, 3, 1], index: :a..:f, type: :category
| Daru::Vector(6) | |
|---|---|
| a | 1 |
| b | 1 |
| c | 2 |
| d | 2 |
| e | 3 |
| f | 1 |
dv[0..2]
| Daru::Vector(3) | |
|---|---|
| a | 1 |
| b | 1 |
| c | 2 |
dv.at -1
1
dv.set_at [0, 1], 3
dv
| Daru::Vector(6) | |
|---|---|
| a | 3 |
| b | 3 |
| c | 2 |
| d | 2 |
| e | 3 |
| f | 1 |
Daru uses Arel-like syntax for querying data.
dv = Daru::Vector.new ['I', 'II', 'I', 'III', 'IV', 'I', 'II'], type: :category, categories: ['I', 'II', 'III', 'IV']
dv.ordered = true
dv.frequencies
| Daru::Vector(4) | |
|---|---|
| I | 3 |
| II | 2 |
| III | 1 |
| IV | 1 |
dv.where(dv.eq('I'))
| Daru::Vector(3) | |
|---|---|
| 0 | I |
| 2 | I |
| 5 | I |
dv.where(dv.gt('II'))
| Daru::Vector(2) | |
|---|---|
| 3 | III |
| 4 | IV |
df = Daru::DataFrame.new({
a: (1..7).to_a,
b: ('a'..'g').to_a,
c: ['I', 'II', 'I', 'III', 'IV', 'I', 'II']
})
| Daru::DataFrame(7x3) | |||
|---|---|---|---|
| a | b | c | |
| 0 | 1 | a | I |
| 1 | 2 | b | II |
| 2 | 3 | c | I |
| 3 | 4 | d | III |
| 4 | 5 | e | IV |
| 5 | 6 | f | I |
| 6 | 7 | g | II |
df.c = df.c.to_category
df
| Daru::DataFrame(7x3) | |||
|---|---|---|---|
| a | b | c | |
| 0 | 1 | a | I |
| 1 | 2 | b | II |
| 2 | 3 | c | I |
| 3 | 4 | d | III |
| 4 | 5 | e | IV |
| 5 | 6 | f | I |
| 6 | 7 | g | II |
df.where(df.c.gt('I') & df.c.lt('IV'))
| Daru::DataFrame(3x3) | |||
|---|---|---|---|
| a | b | c | |
| 1 | 2 | b | II |
| 3 | 4 | d | III |
| 6 | 7 | g | II |
Categorical data supports 4 types of contrast coding schemes-
dv = Daru::Vector.new ['I', 'II', 'I', 'III', 'IV', 'I', 'II'], type: :category, categories: ['I', 'II', 'III', 'IV']
dv.name = 'Rank'
dv.contrast_code
| Daru::DataFrame(7x3) | |||
|---|---|---|---|
| Rank_II | Rank_III | Rank_IV | |
| 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 |
| 4 | 0 | 0 | 1 |
| 5 | 0 | 0 | 0 |
| 6 | 1 | 0 | 0 |
You can set the base category using #base_category=
dv.base_category = 'IV'
dv.contrast_code
| Daru::DataFrame(7x3) | |||
|---|---|---|---|
| Rank_I | Rank_II | Rank_III | |
| 0 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 |
| 6 | 0 | 1 | 0 |
To use any other coding using #coding_scheme
dv.coding_scheme = :deviation
dv.contrast_code
| Daru::DataFrame(7x3) | |||
|---|---|---|---|
| Rank_I | Rank_II | Rank_III | |
| 0 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | -1 | -1 | -1 |
| 5 | 1 | 0 | 0 |
| 6 | 0 | 1 | 0 |