R DT data.table Join
This is part of PML notes. HDD: r_data_table_start
Prepare Data #
library(dplyr)
library(readr)
library(data.table)
hero = "
name,       alignment, gender,   publisher
Magneto,    bad,       male,     MarvelDuplicate
Storm,      good,      female,   MarvelDuplicate
Batman,     good,      male,     DC
Joker,      bad,       male,     DC
Catwoman,   bad,       female,   DC
Hellboy,    good,      male,     Dark Horse Comics
"
hero = read_csv(hero, trim_ws = TRUE, skip = 1)
hero = data.table(hero)
publisher = "
publisher,   yr_founded
DC,              1934
MarvelDuplicate, 1939
MarvelDuplicate, 8888
Image,           1992
"
publisher = read_csv(publisher, trim_ws = TRUE, skip = 1)
publisher = data.table(publisher)
For all the join commands, if a key is set for both dt, then on = 'key_col_name' can be elided.
WARN: row orders are not guaranteed
Left-Join & Inner-Join #
left-join will keep all left-side table rows, including no-matched rows
by default, nomatch cells := NA
hero[publisher, on = "publisher"]
| name | alignment | gender | publisher | yr_founded | 
|---|---|---|---|---|
| Batman | good | male | DC | 1934 | 
| Joker | bad | male | DC | 1934 | 
| Catwoman | bad | female | DC | 1934 | 
| Magneto | bad | male | MarvelDuplicate | 1939 | 
| Storm | good | female | MarvelDuplicate | 1939 | 
| Magneto | bad | male | MarvelDuplicate | 8888 | 
| Storm | good | female | MarvelDuplicate | 8888 | 
| NA | NA | NA | Image | 1992 | 
nomatch = 0 means delete no-matched rows (becomes inner-join)
hero[publisher, nomatch = 0, on = "publisher"]
| name | alignment | gender | publisher | yr_founded | 
|---|---|---|---|---|
| Batman | good | male | DC | 1934 | 
| Joker | bad | male | DC | 1934 | 
| Catwoman | bad | female | DC | 1934 | 
| Magneto | bad | male | MarvelDuplicate | 1939 | 
| Storm | good | female | MarvelDuplicate | 1939 | 
| Magneto | bad | male | MarvelDuplicate | 8888 | 
| Storm | good | female | MarvelDuplicate | 8888 | 
Right-Join & Inner, all = TRUE #
right-join will keep all right-side table rows, unfortunately, we have to swap two data.table’s to make a right-join (using left-join format)
publisher[hero, on = "publisher"]
| publisher | yr_founded | name | alignment | gender | 
|---|---|---|---|---|
| MarvelDuplicate | 1939 | Magneto | bad | male | 
| MarvelDuplicate | 8888 | Magneto | bad | male | 
| MarvelDuplicate | 1939 | Storm | good | female | 
| MarvelDuplicate | 8888 | Storm | good | female | 
| DC | 1934 | Batman | good | male | 
| DC | 1934 | Joker | bad | male | 
| DC | 1934 | Catwoman | bad | female | 
| Dark Horse Comics | NA | Hellboy | good | male | 
nomatch = 0 means delete no-matched rows (becomes inner-join again, with row & col orders may diff from the left-join’s inner-join)
Outer-/Full-Join & Inner-Join #
Outer-(or full-) join will keep both tables content, with extra NA’s
we need function merge; param is by = 'key_name', NOT on.
merge(hero, publisher, by = "publisher", all = TRUE)
| publisher | name | alignment | gender | yr_founded | 
|---|---|---|---|---|
| DC | Batman | good | male | 1934 | 
| DC | Joker | bad | male | 1934 | 
| DC | Catwoman | bad | female | 1934 | 
| Dark Horse Comics | Hellboy | good | male | NA | 
| Image | NA | NA | NA | 1992 | 
| MarvelDuplicate | Magneto | bad | male | 1939 | 
| MarvelDuplicate | Magneto | bad | male | 8888 | 
| MarvelDuplicate | Storm | good | female | 1939 | 
| MarvelDuplicate | Storm | good | female | 8888 | 
if not all content is kept, it becomes inner-join, again.
merge(hero, publisher, by = "publisher")
| publisher | name | alignment | gender | yr_founded | 
|---|---|---|---|---|
| DC | Batman | good | male | 1934 | 
| DC | Joker | bad | male | 1934 | 
| DC | Catwoman | bad | female | 1934 | 
| MarvelDuplicate | Magneto | bad | male | 1939 | 
| MarvelDuplicate | Magneto | bad | male | 8888 | 
| MarvelDuplicate | Storm | good | female | 1939 | 
| MarvelDuplicate | Storm | good | female | 8888 | 
Anti-Join #
Give the rows that only exist on left-table
hero[!publisher, on = "publisher"]
| name | alignment | gender | publisher | 
|---|---|---|---|
| Hellboy | good | male | Dark Horse Comics | 
publisher[!hero, on = "publisher"]
| publisher | yr_founded | 
|---|---|
| Image | 1992 | 
Cross-Join #
Give the multiplied rows like matrix-multiplication.
nrow(new_table) := nrow(left_table) * nrow(right_table)

Semi-Join (Only Keep Half Side Content) #
index = unique(hero[publisher, on = "publisher", which=TRUE, allow.cartesian=TRUE])
hero[index]
| name | alignment | gender | publisher | 
|---|---|---|---|
| Batman | good | male | DC | 
| Joker | bad | male | DC | 
| Catwoman | bad | female | DC | 
| Magneto | bad | male | MarvelDuplicate | 
| Storm | good | female | MarvelDuplicate | 
| NA | NA | NA | NA | 
you can add nomatch = 0
Self-Join #
Self-join has not been implemented for DT yet. A for loop or some other methods are necessary.
See self-join definition here.
See examples & solutions for DT here
ref:
data source, which is modified here.
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United Kingdom.1252
##  tz       Europe/Kaliningrad
##  date     2016-10-11
## Packages ------------------------------------------------------------------
##  package    * version  date       source
##  assertthat   0.1      2013-12-06 CRAN (R 3.1.1)
##  chron        2.3-47   2015-06-24 CRAN (R 3.1.3)
##  codetools    0.2-15   2016-10-05 CRAN (R 3.3.1)
##  data.table * 1.9.6    2015-09-19 CRAN (R 3.1.3)
##  DBI          0.5-1    2016-09-10 CRAN (R 3.3.1)
##  devtools     1.12.0   2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10   2016-08-02 CRAN (R 3.3.1)
##  dplyr      * 0.5.0    2016-06-24 CRAN (R 3.3.1)
##  evaluate     0.10     2016-10-11 CRAN (R 3.3.1)
##  formatR      1.4      2016-05-09 CRAN (R 3.3.1)
##  highr        0.6      2016-05-09 CRAN (R 3.3.1)
##  htmltools    0.3.5    2016-03-21 CRAN (R 3.1.3)
##  knitr        1.14     2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5      2014-11-22 CRAN (R 3.3.1)
##  memoise      1.0.0    2016-01-29 CRAN (R 3.3.1)
##  R6           2.2.0    2016-10-05 CRAN (R 3.3.1)
##  Rcpp         0.12.7   2016-09-05 CRAN (R 3.3.1)
##  readr      * 1.0.0    2016-08-03 CRAN (R 3.3.1)
##  rmarkdown    1.0.9010 2016-08-26 Github (rstudio/rmarkdown@c2dfc53)
##  stringi      1.1.2    2016-10-01 CRAN (R 3.3.1)
##  stringr      1.1.0    2016-08-19 CRAN (R 3.3.1)
##  tibble       1.2      2016-08-26 CRAN (R 3.3.1)
##  withr        1.0.2    2016-06-20 CRAN (R 3.3.1)
##  yaml         2.1.13   2014-06-12 CRAN (R 3.1.1)
tibble 1.2 2016-08-26 CRAN (R 3.3.1)