Main purpose of the package
The main purpose of this packgae is twofold:
- To easily translate coded variables into plain text using
standardised dictionaries.
- To provide an infrastructure to create such dictionaries.
The first purpose is assumed to dominate and the package therefore
includes several dictionaries compatible with data from INCA and
Rockan.
Example for “decode”
Assume we have a dataset with some (more or less) typical
RCC-data:
set.seed(12345)
n <- 6
s <- function(x) sample(x, n, replace = TRUE)
incadata <-
data.frame(
KON_VALUE = s(1:2),
region = 1:6,
a_icdo3 = c("C446", "C749", "C159", "C709", "C475", "C320"),
a_tstad = c("0", "1", "1a", "1b", "1c", "2")
)
knitr::kable(incadata)
2 |
1 |
C446 |
0 |
1 |
2 |
C749 |
1 |
2 |
3 |
C159 |
1a |
2 |
4 |
C709 |
1b |
2 |
5 |
C475 |
1c |
2 |
6 |
C320 |
2 |
Manually specified decodings
decode
is a generic S3-function that accepts R-object of
different kinds. The default method takes an atomic vector with codes
and a specification of how to translate these codes (see
?decode
).
library(decoder)
decode(incadata$KON_VALUE, "kon")
## [1] "Kvinna" "Man" "Kvinna" "Kvinna" "Kvinna" "Kvinna"
decode(incadata$a_tstad, "t_rtr")
## [1] "T0" "T1" "T1a" "T1b" "T1c" "T2"
See ?key_value_data
for a list of all available keyvalue
objects included in the package.
The whole data.frame at once
There is also a method for data.frames, where translations are made
automatically for columns with recognised names.
incadata_decoded <- decode(incadata)
## New decoded columns added:
## * KON_VALUE_kon_beskrivning
## * region_region_beskrivning
knitr::kable(incadata_decoded)
2 |
1 |
C446 |
0 |
Kvinna |
Region Sthlm/Gotland |
1 |
2 |
C749 |
1 |
Man |
Region Uppsala/Örebro |
2 |
3 |
C159 |
1a |
Kvinna |
Region Sydöstra |
2 |
4 |
C709 |
1b |
Kvinna |
Region Syd |
2 |
5 |
C475 |
1c |
Kvinna |
Region Väst |
2 |
6 |
C320 |
2 |
Kvinna |
Region Norr |
All the original columns from incadata are preserved but are now
accompanied by some additional columns with identical names suffixed by
“_beskrivning”. (“_Beskrivning” with capital “B” will be used if there
are already some column names with suffixes “_Beskrivning” or “_Värde”
in the data, otherwise lower case variable names are generally preffered
in all the rcc packages).
A note of caution
To automatically transform all columns of a data.frame can be
“dangerous” if the data.frame includes a column with arbitrary data but
with a name just coincidentally beeing recognised as a standard variable
name. It is therefore recommended to use this function mostly in
interactive mode and to always look at the message specifying which
columns that were automatically detected.
It should however also be noted that the variable names are hard
coded and have to be excact (although case differnecies are ignored).
Hence, the variable “a_icdo3” were not recognised (even though its name
could have been easily matched by a regular expression:
grepl("icdo3", "a_icdo3")
.
Several keyvalues for the same code
Note that the same coded variable might be decoded by more than one
keyvalue-object, for example the “a_icdo3”-variable is automatically
decoded by the “icdo3” keyvalue object (see ?icdo3
) with a
dictionary from Rockan. ICD-O3 however, can also be seen as a “developed
subset” of ICD10, wherefore:
decode(incadata$a_icdo3, "icd10se")
## [1] "Malign tumör i huden på övre extremiteten inklusive skuldran"
## [2] "Ospecificerad lokalisation av malign tumör i binjure"
## [3] "Ospecificerad lokalisation av malign tumör i esofagus"
## [4] "Ospecificerad lokalisation av malign tumör i centrala nervsystemets hinnor"
## [5] "Malign tumör i perifera nerver i bäckenet"
## [6] "Malign tumör i glottis"
“icd10”, can also decode non oncological desease codes:
decode(c("D448A", "T009", "F182", "S134C", "C131"), "icd10se")
## [1] "Tumör med multiglandulär lokalisation, typ I"
## [2] "Multipla ytliga skador, ospecificerade"
## [3] "Psykiska störningar och beteendestörningar orsakade av flyktiga lösningsmedel-Beroendesyndrom"
## [4] "Whiplash-skada, WAD III"
## [5] "Malign tumör i aryepiglottiska vecket, hypofaryngeala delen"
Exact versus non exact decoding
So far, we have looked only on cases where the code can be littaraly
translated by the keyvalue object. There is however a third argument to
the decode function, exact
, which is FALSE
by
default. It is often useful to allow some automatic transformation for
variables that we know are coded values but that are not exactly the
same as in the keyvalue object. The most common application is probably
when we want to extract the municipality names from an LKF-variable.
This can obviosly be done by:
x <- c(178405, 138408, 108202, 128706, 048005)
y <- as.numeric(substring(as.character(x), 1, 4))
decode(y, "kommun", exact = TRUE)
## Warning: Some codes could not be translated (4800)
## [1] "Arvika" "Kungsbacka" "Karlshamn" "Trelleborg" NA
But it is also possible to use the original six digit code as is:
## Warning: transformed to match the keyvalue: Only the first 4 characters are
## used.
## Warning: Some codes could not be translated (48005)
## [1] "Arvika" "Kungsbacka" "Karlshamn" "Trelleborg" NA
Note here that the last value decodes to NA
in both
cases. It would be preffered to have the original codes stored as
character, since that preserves the leading 0 in “048005”:
decode("048005", "kommun")
## Warning: transformed to match the keyvalue: Only the first 4 characters are
## used.
## [1] "Nyköping"
This can also be taken care of by reading data through the
use_incadata
function (see
?incadata::use_incadata
), which always classifies numbers
with leading zeros as characters (and not numeric) (but this will not be
covered in this vignette).
Note that use of exact = FALSE
(which is the default for
convenience) always produce warnings when some transformations are done.
It is always recommended to manually confirm that these transformations
are in fact accurate. Unexpected results might otherwise occur!
More than one decoded variable from the same code
Using non exact decoding, the same code might be simultaneously
decoded by more that one keyvalue object using the data.frame
method:
df <-
data.frame(
LKF = c("149804", "147104", "012704", "143505", "126502", "232602")
)
knitr::kable(suppressWarnings(decode(df)))
## New decoded columns added:
## * LKF_forsamling_beskrivning
## * LKF_hemort_beskrivning
## * LKF_lan_beskrivning
## * LKF_kommun_beskrivning
149804 |
NA |
Velinga |
Västra Götalands län |
Tidaholm |
147104 |
NA |
Holmestad |
Västra Götalands län |
Götene |
012704 |
NA |
Salem |
Stockholms län |
Botkyrka |
143505 |
NA |
Mo |
Västra Götalands län |
Tanum |
126502 |
NA |
Björka |
Skåne län |
Sjöbo |
232602 |
Hackås |
Hackås |
Jämtlands län |
Berg |
Consult ?hemort
and ?forsamling
to invest
the differnce between these two variables (hint: a lot have happened
with Swedish administrative geography since 1958).
Example for “code” (the opposite of decode)
In the simplest situation (with a 1:1 relation between the code and
its value), code
is just the inverse of
decode
:
code(c("Karlskrona", "Göteborg", "Härnösand"), "kommun")
## [1] "1080" "1480" "2280"
decode
does however also allow translation with m:1
dictionaries such as snomed
(which gives a differnt result
than snomed3
):
non_unique <- c(90703, 90723, 96153, 99403, 89643, 90443)
decode(non_unique, "snomed")
## [1] "Embryonalt carcinom" "Embryonalt carcinom" "Hårcellsleukemi"
## [4] "Hårcellsleukemi" "Klarcellssarkom" "Klarcellssarkom"
This decoding can not be inverted:
code(decode(non_unique, "snomed"), "snomed")
## key value
## 393 90703 Embryonalt carcinom
## 402 90723 Embryonalt carcinom
## 592 96153 Hårcellsleukemi
## 787 99403 Hårcellsleukemi
## 354 89643 Klarcellssarkom
## 381 90443 Klarcellssarkom
## Error in code(decode(non_unique, "snomed"), "snomed"): Values above have a non unique relation to their key!
This restriction only applies when the requested values are de facto
non unique. It is not tied to the keyvalue object as such.
unique <- c(
"Mucinöst kystadenom (kystom) borderline-typ ( 2 B)",
"Medullärt carcinom, atypiskt",
"Mb Paget non invasiv och intraduktal",
"Lymfangiosarcom, misst",
"Fibröst histiocytom, malignt",
"Ganglioneurom",
"Carcinoid (exkl appendix)",
"Langerhans cell histiocytos, UNS, Histiocytosis X, UNS",
"Brenner tumör, malign",
"Lymfatisk leukemi, UNS")
code(unique, "snomed")
## [1] "84723" "85133" "85432" "91701" "88303" "94900" "82403" "97511" "90003"
## [10] "98203"
The underlying machinery
In all examples above the so called keyvalue object were specified by
a quoted name (such as “kon
” or “region
”
etcetera). These names refer to internal package objects that can not be
lazy loaded from the package but which can still be accesed by the
tripple colon notation:
## key value
## 1 1 Man
## 2 2 Kvinna
## key value
## 1 1 Region Sthlm/Gotland
## 2 2 Region Uppsala/Örebro
## 3 3 Region Sydöstra
## 4 4 Region Syd
## 5 5 Region Väst
## 6 6 Region Norr
These objects might look like ordinary data.frames but they do have
some extra attributes as described by ?as.keyvalue
.
## $names
## [1] "key" "value"
##
## $row.names
## [1] 1 2
##
## $standard_var_names
## [1] "kon_value" "kön" "kon" "sex"
##
## $keyvalue11
## [1] TRUE
##
## $class
## [1] "keyvalue" "data.frame"
The keyvalue11 attribute indicates that “code” can be used
as an inverse of “decode” for translations using this keyvalue obejct,
hence without the problems as described for snomed
. The
standard_var_names attribute provides a list with known names
for the coded version of this variable when found in a data.frame (used
by decode.data.frame
).
A complete list of all standard variable names can also be found in
ALL_STANDARD_VAR_NAMES
(internal object):
knitr::kable(head(decoder:::ALL_STANDARD_VAR_NAMES))
avgm |
avgm |
ben |
ben |
digr |
digr |
dödca |
dodca |
dodca |
dodca |
figo |
figo |
The key-column gives names of variables that are recognised and the
value column gives the name of the corresponding keyvalue object to use
for its decoding.
Examples how to create keyvalue objects
It might sometimes come handy to create your own keyvalue objects for
a specific project. The S3-generic as.keyvalue
has several
methods for this.
For small dictionaries, a named vector might be sufficient:
(kv <- as.keyvalue(c("car" = 1, "bike" = 2, "bus" = 3)))
## key value
## 1 1 car
## 2 2 bike
## 3 3 bus
x <- s(1:3)
decode(x, kv)
## [1] "bike" "car" "bus" "bike" "bike" "bus"
Longer dictionaries might be more convenient to read from disk and
coerce to keyvalue objects using the data.frame method. Dictionaries
with a lot of values for every key, can also be specified by a list
method:
ex <- list(
fruit = c("banana", "orange", "kiwi"),
car = c("SAAB", "Volvo", "taxi", "truck"),
animal = c("elephant")
)
knitr::kable(as.keyvalue(ex))
4 |
SAAB |
car |
5 |
Volvo |
car |
1 |
banana |
fruit |
8 |
elephant |
animal |
3 |
kiwi |
fruit |
2 |
orange |
fruit |
6 |
taxi |
car |
7 |
truck |
car |