R DT data.table Join

R DT data.table Join

2016-10-11. Category & Tags: DT, R, Data.table, Join

This is part of PML notes. HDD: r_data_table_start

Prepare Data #

library(dplyr)
library(readr)
library(data.table)
hero = "
name,       alignment, gender,   publisher
Magneto,    bad,       male,     MarvelDuplicate
Storm,      good,      female,   MarvelDuplicate
Batman,     good,      male,     DC
Joker,      bad,       male,     DC
Catwoman,   bad,       female,   DC
Hellboy,    good,      male,     Dark Horse Comics
"
hero = read_csv(hero, trim_ws = TRUE, skip = 1)
hero = data.table(hero)

publisher = "
publisher,   yr_founded
DC,              1934
MarvelDuplicate, 1939
MarvelDuplicate, 8888
Image,           1992
"
publisher = read_csv(publisher, trim_ws = TRUE, skip = 1)
publisher = data.table(publisher)

For all the join commands, if a key is set for both dt, then on = 'key_col_name' can be elided.

WARN: row orders are not guaranteed


Left-Join & Inner-Join #

left-join will keep all left-side table rows, including no-matched rows

by default, nomatch cells := NA

hero[publisher, on = "publisher"]
name alignment gender publisher yr_founded
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Magneto bad male MarvelDuplicate 1939
Storm good female MarvelDuplicate 1939
Magneto bad male MarvelDuplicate 8888
Storm good female MarvelDuplicate 8888
NA NA NA Image 1992

nomatch = 0 means delete no-matched rows (becomes inner-join)

hero[publisher, nomatch = 0, on = "publisher"]
name alignment gender publisher yr_founded
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Magneto bad male MarvelDuplicate 1939
Storm good female MarvelDuplicate 1939
Magneto bad male MarvelDuplicate 8888
Storm good female MarvelDuplicate 8888

Right-Join & Inner, all = TRUE #

right-join will keep all right-side table rows, unfortunately, we have to swap two data.table’s to make a right-join (using left-join format)

publisher[hero, on = "publisher"]
publisher yr_founded name alignment gender
MarvelDuplicate 1939 Magneto bad male
MarvelDuplicate 8888 Magneto bad male
MarvelDuplicate 1939 Storm good female
MarvelDuplicate 8888 Storm good female
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Dark Horse Comics NA Hellboy good male

nomatch = 0 means delete no-matched rows (becomes inner-join again, with row & col orders may diff from the left-join’s inner-join)


Outer-/Full-Join & Inner-Join #

Outer-(or full-) join will keep both tables content, with extra NA’s

we need function merge; param is by = 'key_name', NOT on.

merge(hero, publisher, by = "publisher", all = TRUE)
publisher name alignment gender yr_founded
DC Batman good male 1934
DC Joker bad male 1934
DC Catwoman bad female 1934
Dark Horse Comics Hellboy good male NA
Image NA NA NA 1992
MarvelDuplicate Magneto bad male 1939
MarvelDuplicate Magneto bad male 8888
MarvelDuplicate Storm good female 1939
MarvelDuplicate Storm good female 8888

if not all content is kept, it becomes inner-join, again.

merge(hero, publisher, by = "publisher")
publisher name alignment gender yr_founded
DC Batman good male 1934
DC Joker bad male 1934
DC Catwoman bad female 1934
MarvelDuplicate Magneto bad male 1939
MarvelDuplicate Magneto bad male 8888
MarvelDuplicate Storm good female 1939
MarvelDuplicate Storm good female 8888

Anti-Join #

Give the rows that only exist on left-table

hero[!publisher, on = "publisher"]
name alignment gender publisher
Hellboy good male Dark Horse Comics
publisher[!hero, on = "publisher"]
publisher yr_founded
Image 1992

Cross-Join #

Give the multiplied rows like matrix-multiplication.

nrow(new_table) := nrow(left_table) * nrow(right_table)

cross join

ref


Semi-Join (Only Keep Half Side Content) #

index = unique(hero[publisher, on = "publisher", which=TRUE, allow.cartesian=TRUE])
hero[index]
name alignment gender publisher
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Magneto bad male MarvelDuplicate
Storm good female MarvelDuplicate
NA NA NA NA

you can add nomatch = 0


Self-Join #

Self-join has not been implemented for DT yet. A for loop or some other methods are necessary.

See self-join definition here.

See examples & solutions for DT here


ref:

data source, which is modified here.

semi-Join

SQL Join, chart explained

devtools::session_info()
## Session info --------------------------------------------------------------

##  setting  value
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United Kingdom.1252
##  tz       Europe/Kaliningrad
##  date     2016-10-11

## Packages ------------------------------------------------------------------

##  package    * version  date       source
##  assertthat   0.1      2013-12-06 CRAN (R 3.1.1)
##  chron        2.3-47   2015-06-24 CRAN (R 3.1.3)
##  codetools    0.2-15   2016-10-05 CRAN (R 3.3.1)
##  data.table * 1.9.6    2015-09-19 CRAN (R 3.1.3)
##  DBI          0.5-1    2016-09-10 CRAN (R 3.3.1)
##  devtools     1.12.0   2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10   2016-08-02 CRAN (R 3.3.1)
##  dplyr      * 0.5.0    2016-06-24 CRAN (R 3.3.1)
##  evaluate     0.10     2016-10-11 CRAN (R 3.3.1)
##  formatR      1.4      2016-05-09 CRAN (R 3.3.1)
##  highr        0.6      2016-05-09 CRAN (R 3.3.1)
##  htmltools    0.3.5    2016-03-21 CRAN (R 3.1.3)
##  knitr        1.14     2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5      2014-11-22 CRAN (R 3.3.1)
##  memoise      1.0.0    2016-01-29 CRAN (R 3.3.1)
##  R6           2.2.0    2016-10-05 CRAN (R 3.3.1)
##  Rcpp         0.12.7   2016-09-05 CRAN (R 3.3.1)
##  readr      * 1.0.0    2016-08-03 CRAN (R 3.3.1)
##  rmarkdown    1.0.9010 2016-08-26 Github (rstudio/rmarkdown@c2dfc53)
##  stringi      1.1.2    2016-10-01 CRAN (R 3.3.1)
##  stringr      1.1.0    2016-08-19 CRAN (R 3.3.1)
##  tibble       1.2      2016-08-26 CRAN (R 3.3.1)
##  withr        1.0.2    2016-06-20 CRAN (R 3.3.1)
##  yaml         2.1.13   2014-06-12 CRAN (R 3.1.1)

tibble 1.2 2016-08-26 CRAN (R 3.3.1)

withr 1.0.2 2016-06-20 CRAN (R 3.3.1) #

yaml 2.1.13 2014-06-12 CRAN (R 3.1.1) #