$column == 1, ] data[data
Error in data$column: object of type 'closure' is not subsettable
August 19, 2023
I wrote an R function to help identify variable names that already exist as function names, like in c <- 1
or head <- "x"
.
Naming things is hard, yes, but data
is a short and sensible choice for a dataframe, right?
Oh, silly me, I tried to subset a dataframe called data
without actually, y’know, creating it first.
This is a classic stumbling block in R. In short, there’s already a function in base R called data()
(!) and I ended up trying subset it. But you can’t subset a function, hence the error.
Here’s what happens if you subset a non-existent object that has a name that’s different to any existing functions:
‘Object not found’ is a much more helpful error message.
So it’s not a big deal, but using existing function names as variable names is a code smell. Especially if they’re frequently used functions from base R like head()
, str()
, paste()
, etc1.
But R doesn’t stop you from using these names. In general, R is pretty loose with variable naming, though you can’t use a small set of reserved words like TRUE
, if
or NA
2.
For example, here we can call the c()
function to see its (very short) definition. But using it as a variable name obscures the function definition.
function (...) .Primitive("c")
[1] 1
Can we write a generic function to identify if some code contains ‘bad’ variable names in this way?
Of course. I’ve made a function called find_var_names()
. I’m certain the functionality already exists; consider this a thought experiment.
You provide (a) a string of code to evaluate3 and (b) a vector of names to avoid. The code is parsed with getparsedata(parse())
to identify variable names4. It checks for a SYMBOL
token followed by the assignment operators <-
or =
5, or preceded by an assignment operator in the case of ->
6 (i.e. *_ASSIGN
tokens). These variable names are then compared to the set of names provided.
find_var_names <- function(code_string, names_to_find) {
# Parse the string of code to identify R 'tokens'
parsed <- getParseData(parse(text = code_string, keep.source = TRUE))
parsed <- parsed[parsed$text != "", ]
# Identify subsequent tokens (to help find 'x' in x <- 1 and x = 1)
parsed$next_token <-
c(parsed$token[2:nrow(parsed)], NA_character_)
# Identify prior token (to help find 'x' in 1 -> x)
parsed$last_token <-
c(NA_character_, parsed$token[1:nrow(parsed) - 1])
# Identify variable names with left-assignment
lassign <-
parsed[parsed$token == "SYMBOL" & grepl("ASSIGN", parsed$next_token), ]
# Identify row index for variable names following right-assignment
rassign_i <-
which(parsed$token == "RIGHT_ASSIGN" & parsed$next_token == "SYMBOL") + 1
# Filter for right-assigned variable names
rassign <- parsed[rassign_i, ]
# Combine the results and sort by location
var_names <- rbind(lassign, rassign)
var_names <- var_names[sort(row.names(var_names)), ]
# Filter for variable names that are in the provided names list
var_names[var_names$text %in% names_to_find, !grepl("_token", names(var_names))]
}
So, let’s say we have this snippet of R code7 below. It uses some variable names that are already function names, as well as each flavour of assignment.
And here’s a function that grabs the base packages and the function names within. This is what we’ll use as our ‘no-go’ variable names. You could expand this to include other names, like function names from the tidyverse, for example.
get_base_functions <- function() {
base_names <- sessionInfo()$basePkgs
base_pkgs <- paste0("package:", base_names)
lapply(base_pkgs, ls) |> unlist() |> unique() |> sort()
}
tail(get_base_functions())
[1] "xyTable" "xyz.coords" "xzfile" "yinch" "zapsmall"
[6] "zip"
Aside: this function uses a little hack. It specifically grabs the attached base packages from the sessionInfo()
listing. There are other base and ‘recommended’ packages that are actually not attached from the start of your session; see the Priority
value from the output of installed.packages()
.
Now we can run the function to check the code for the list of function names.
naughty_words <- find_var_names(
code_string = demo_code,
names_to_find = get_base_functions()
)
naughty_words
line1 col1 line2 col2 id parent token terminal text
12 3 1 3 4 12 14 SYMBOL TRUE head
3 2 1 2 4 3 5 SYMBOL TRUE data
33 4 8 4 9 33 35 SYMBOL TRUE df
66 6 11 6 11 66 68 SYMBOL TRUE c
The output is what you normally get from getparsedata(parse())
, filtered for the illegal names. Helpfully it shows you the exact row and column indices for where the string exists in the code you provided.
And of course you can just isolate the offenders.
So the variable names a
and b
in demo_code
were ignored because they’re not function names in base R. And the in-built data sets beaver1
and chickwts
were also ignored, because they’re not being used as variable names. And yes, df
—a commonly-used variable name for dataframes—is also a function!
I probably won’t use this function in real life, but maybe the concepts are interesting to you or you can tell me about a linter that does this already.
At least for now, object of type ‘Matthew’ is not upsettable.
Last rendered: 2023-08-22 20:39:44 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.2 compiler_4.3.1 fastmap_1.1.1 cli_3.6.1
[5] tools_4.3.1 htmltools_0.5.5 rstudioapi_0.15.0 yaml_2.3.7
[9] rmarkdown_2.23 knitr_1.43.1 jsonlite_1.8.7 xfun_0.39
[13] digest_0.6.33 rlang_1.1.1 evaluate_0.21
Please note that this post is not a subtweet. I’ve read a bunch of code recently—including my own!—that uses variable names in this way.↩︎
Although the more nefarious among you will know you can put just about anything in backticks and it can be a legit variable name. So `TRUE` <- FALSE
will work, but you’ll have to supply `TRUE`
with the backticks to use it.↩︎
Exercise for the reader: have the function accept script files from a connection, not just as a string. I didn’t bother for this silly demo.↩︎
If you can be parsed, I’ve written about this before.↩︎
If you haven’t already expunged any files containing equals assignment.↩︎
I’ll have to update this in future to work with down-assignment arrows.↩︎
This is an ‘R string’, introduced in R version 4.0.0. It deals with escaping certain characters and quotes within quotes so that you don’t have to. So "x <- "y""
will error but r"(x <- "y")"
will return "x <- \"y\""
. You can use symbols other than parentheses, such as curly braces, if your expression already contains parentheses itself.↩︎