R DT data.table Join
This is part of PML notes. HDD: r_data_table_start
Prepare Data #
library(dplyr)
library(readr)
library(data.table)
hero = "
name, alignment, gender, publisher
Magneto, bad, male, MarvelDuplicate
Storm, good, female, MarvelDuplicate
Batman, good, male, DC
Joker, bad, male, DC
Catwoman, bad, female, DC
Hellboy, good, male, Dark Horse Comics
"
hero = read_csv(hero, trim_ws = TRUE, skip = 1)
hero = data.table(hero)
publisher = "
publisher, yr_founded
DC, 1934
MarvelDuplicate, 1939
MarvelDuplicate, 8888
Image, 1992
"
publisher = read_csv(publisher, trim_ws = TRUE, skip = 1)
publisher = data.table(publisher)
For all the join
commands, if a key is set for both dt, then on = 'key_col_name'
can be elided.
WARN: row orders are not guaranteed
Left-Join & Inner-Join #
left-join will keep all left-side table rows, including no-matched rows
by default, nomatch cells := NA
hero[publisher, on = "publisher"]
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Magneto | bad | male | MarvelDuplicate | 1939 |
Storm | good | female | MarvelDuplicate | 1939 |
Magneto | bad | male | MarvelDuplicate | 8888 |
Storm | good | female | MarvelDuplicate | 8888 |
NA | NA | NA | Image | 1992 |
nomatch = 0 means delete no-matched rows (becomes inner-join)
hero[publisher, nomatch = 0, on = "publisher"]
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Magneto | bad | male | MarvelDuplicate | 1939 |
Storm | good | female | MarvelDuplicate | 1939 |
Magneto | bad | male | MarvelDuplicate | 8888 |
Storm | good | female | MarvelDuplicate | 8888 |
Right-Join & Inner, all = TRUE #
right-join will keep all right-side table rows, unfortunately, we have to swap two data.table’s to make a right-join (using left-join format)
publisher[hero, on = "publisher"]
publisher | yr_founded | name | alignment | gender |
---|---|---|---|---|
MarvelDuplicate | 1939 | Magneto | bad | male |
MarvelDuplicate | 8888 | Magneto | bad | male |
MarvelDuplicate | 1939 | Storm | good | female |
MarvelDuplicate | 8888 | Storm | good | female |
DC | 1934 | Batman | good | male |
DC | 1934 | Joker | bad | male |
DC | 1934 | Catwoman | bad | female |
Dark Horse Comics | NA | Hellboy | good | male |
nomatch = 0 means delete no-matched rows (becomes inner-join again, with row & col orders may diff from the left-join’s inner-join)
Outer-/Full-Join & Inner-Join #
Outer-(or full-) join will keep both tables content, with extra NA’s
we need function merge
; param is by = 'key_name'
, NOT on
.
merge(hero, publisher, by = "publisher", all = TRUE)
publisher | name | alignment | gender | yr_founded |
---|---|---|---|---|
DC | Batman | good | male | 1934 |
DC | Joker | bad | male | 1934 |
DC | Catwoman | bad | female | 1934 |
Dark Horse Comics | Hellboy | good | male | NA |
Image | NA | NA | NA | 1992 |
MarvelDuplicate | Magneto | bad | male | 1939 |
MarvelDuplicate | Magneto | bad | male | 8888 |
MarvelDuplicate | Storm | good | female | 1939 |
MarvelDuplicate | Storm | good | female | 8888 |
if not all
content is kept, it becomes inner-join, again.
merge(hero, publisher, by = "publisher")
publisher | name | alignment | gender | yr_founded |
---|---|---|---|---|
DC | Batman | good | male | 1934 |
DC | Joker | bad | male | 1934 |
DC | Catwoman | bad | female | 1934 |
MarvelDuplicate | Magneto | bad | male | 1939 |
MarvelDuplicate | Magneto | bad | male | 8888 |
MarvelDuplicate | Storm | good | female | 1939 |
MarvelDuplicate | Storm | good | female | 8888 |
Anti-Join #
Give the rows that only exist on left-table
hero[!publisher, on = "publisher"]
name | alignment | gender | publisher |
---|---|---|---|
Hellboy | good | male | Dark Horse Comics |
publisher[!hero, on = "publisher"]
publisher | yr_founded |
---|---|
Image | 1992 |
Cross-Join #
Give the multiplied rows like matrix-multiplication.
nrow(new_table) := nrow(left_table) * nrow(right_table)
Semi-Join (Only Keep Half Side Content) #
index = unique(hero[publisher, on = "publisher", which=TRUE, allow.cartesian=TRUE])
hero[index]
name | alignment | gender | publisher |
---|---|---|---|
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Magneto | bad | male | MarvelDuplicate |
Storm | good | female | MarvelDuplicate |
NA | NA | NA | NA |
you can add nomatch = 0
Self-Join #
Self-join has not been implemented for DT yet. A for loop
or some other methods are necessary.
See self-join definition here.
See examples & solutions for DT here
ref:
data source, which is modified here.
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.1 (2016-06-21)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United Kingdom.1252
## tz Europe/Kaliningrad
## date 2016-10-11
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.1.1)
## chron 2.3-47 2015-06-24 CRAN (R 3.1.3)
## codetools 0.2-15 2016-10-05 CRAN (R 3.3.1)
## data.table * 1.9.6 2015-09-19 CRAN (R 3.1.3)
## DBI 0.5-1 2016-09-10 CRAN (R 3.3.1)
## devtools 1.12.0 2016-06-24 CRAN (R 3.3.1)
## digest 0.6.10 2016-08-02 CRAN (R 3.3.1)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.1)
## evaluate 0.10 2016-10-11 CRAN (R 3.3.1)
## formatR 1.4 2016-05-09 CRAN (R 3.3.1)
## highr 0.6 2016-05-09 CRAN (R 3.3.1)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.1.3)
## knitr 1.14 2016-08-13 CRAN (R 3.3.1)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.1)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.1)
## R6 2.2.0 2016-10-05 CRAN (R 3.3.1)
## Rcpp 0.12.7 2016-09-05 CRAN (R 3.3.1)
## readr * 1.0.0 2016-08-03 CRAN (R 3.3.1)
## rmarkdown 1.0.9010 2016-08-26 Github (rstudio/rmarkdown@c2dfc53)
## stringi 1.1.2 2016-10-01 CRAN (R 3.3.1)
## stringr 1.1.0 2016-08-19 CRAN (R 3.3.1)
## tibble 1.2 2016-08-26 CRAN (R 3.3.1)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.1)
## yaml 2.1.13 2014-06-12 CRAN (R 3.1.1)
tibble 1.2 2016-08-26 CRAN (R 3.3.1)