library(gh) # CRAN v1.2.0
library(purrr) # CRAN v0.3.4
tl;dr
Nearly 10 per cent of the commits to this blog’s source involve typo fixes, according to a function I wrote to search commit messages via the {gh} package.
Note
Great news everyone, I improved. I re-rendered this post in July 2023 and the percentage had basically halved to 5%.
Not my typo
I’m sure you’ve seen consecutive Git commits from jaded developers like ‘fix problem’, ‘actually fix problem?’, ‘the fix broke something else’, ‘burn it all down’. Sometimes a few swear words will be thrown in for good measure (look no further than ‘Developers Swearing’ on Twitter).
The more obvious problem from reading the commits for this blog is my incessant keyboard mashing; I think a lot of my commits are there to fix typos.1
So I’ve prepared a little R function to grab the commit messages for a specified repo and find the ones that contain a given search term, like ‘typo’.2
Search commits
{gh} is a handy R package from Gábor Csárdi, Jenny Bryan and Hadley Wickham that we can use to interact with GitHub’s REST API.3 We can also use {purrr} for iterating over the returned API object.
So, here’s one way of forming a function to search commit messages:
<- function(owner, repo, string = "typo") {
search_commits
<- gh::gh(
commits "GET /repos/{owner}/{repo}/commits",
owner = owner, repo = repo,
.limit = Inf
)
<- purrr::map_chr(
messages ~purrr::pluck(.x, "commit", "message")
commits,
)
<- messages[grepl(string, messages, ignore.case = TRUE)]
matches
<- list(
out meta = list(owner, repo),
counts = list(
match_count = length(matches),
commit_count = length(messages),
match_ratio = length(matches) / length(messages)
),matches = matches,
messages = messages
)
return(out)
}
First we pass a GET
request to the GitHub API via gh::gh()
. The API documentation tells us the form needed to get commits for a given owner
’s repo
.
Beware: the API returns results in batches of some maximum size, but the .limit = Inf
argument automatically creates additional requests until everything is returned. That might mean a lot of API calls.
Next we can use {purrr} to iteratively pluck()
out the commit messages from the list returned by gh::gh()
. It’s then a case of finding which ones contain a search string of interest (defaulting to the word ‘typo’).
The object returned by search_commits()
is a list with four elements: meta
repeats the user and repo names; counts
is a list with the commit count, the count of messages containing the search term, and their ratio; and the messages
and matches
elements contain all messages and the ones containing the search term, respectively.
Fniding my typoes
Here’s an example where I look for commit messages to this blog that contain the word ‘typo’. Since the function contains the .limit = Inf
argument in gh::gh()
, we’ll get an output message for each separate request that’s been made to the API.
<- search_commits("matt-dray", "rostrum-blog") blog_typos
ℹ Running gh query
ℹ Running gh query, got 100 records of about 1900
ℹ Running gh query, got 200 records of about 1900
ℹ Running gh query, got 300 records of about 1900
ℹ Running gh query, got 400 records of about 1900
ℹ Running gh query, got 500 records of about 1900
ℹ Running gh query, got 600 records of about 1900
ℹ Running gh query, got 700 records of about 1900
ℹ Running gh query, got 800 records of about 1900
ℹ Running gh query, got 900 records of about 1900
ℹ Running gh query, got 1000 records of about 1900
ℹ Running gh query, got 1100 records of about 1900
ℹ Running gh query, got 1200 records of about 1900
ℹ Running gh query, got 1300 records of about 1900
ℹ Running gh query, got 1400 records of about 1900
ℹ Running gh query, got 1500 records of about 1900
ℹ Running gh query, got 1600 records of about 1900
ℹ Running gh query, got 1700 records of about 1900
ℹ Running gh query, got 1800 records of about 1900
Here’s a preview of the structure of the returned object. You can see how it’s a list that contains the values and other list elements that we expected.
str(blog_typos)
List of 4
$ meta :List of 2
..$ : chr "matt-dray"
..$ : chr "rostrum-blog"
$ counts :List of 3
..$ match_count : int 95
..$ commit_count: int 1870
..$ match_ratio : num 0.0508
$ matches : chr [1:95] "Improve text, correct typos, add cheatcode to hiscore post" "Fix typo that also made it into a Mastodon post, lol" "Correct typo in games post" "Improve readability of parse post, add renkun post, fix typos" ...
$ messages: chr [1:1870] "Re-build README.Rmd" "Remove non-existent anchor from hiscore post" "Improve text, correct typos, add cheatcode to hiscore post" "Re-build README.Rmd" ...
You can see there were 1870 commit messages returned, of which 95 contained the string ‘typo’. That’s 5 per cent.
Here’s a sample4 of those commit messages that contained the word ‘typo’:
set.seed(1337)
sample(blog_typos$matches, 5)
[1] "Fix potatypos"
[2] "Merge pull request #72 from maelle/patch-1\n\ntypo fix"
[3] "Correct typos"
[4] "Correct typo"
[5] "add gapminder example, fix typo"
It seems the typos are often corrected with general improvements to a post’s copy. This usually happens when I read the post the next day with fresh eyes and groan at my ineptitude.5
Exposing others
I think typos are probably most often referenced in repos that involve a lot of documentation, or a book or something.
To make myself feel better, I had a quick look at the repo for the {bookdown} project R for Data Science by Hadley Wickham and Garrett Grolemund.
<- search_commits("hadley", "r4ds") typos_r4ds
The result:
str(typos_r4ds)
List of 4
$ meta :List of 2
..$ : chr "hadley"
..$ : chr "r4ds"
$ counts :List of 3
..$ match_count : int 450
..$ commit_count: int 2137
..$ match_ratio : num 0.211
$ matches : chr [1:450] "fix: typo (add missing `to`) (#1529)" "Fix typos in subsection \"6.3.2 How does pivoting work?\" (#1534)\n\n* Add missing word\r\n\r\n* Fix typo" "typo fix in communication.qmd (#1523)" "Typo: \"a new\" instead of \"an new\" (#1515)" ...
$ messages: chr [1:2137] "Small format for column (#1522)\n\nspecies column name is missing back ticks in this reference" "fix: typo (add missing `to`) (#1529)" "Use dplyr 1.1 'default' parameter in 'case_when()' (#1525)\n\n* Use dplyr 1.1 'default' parameter in 'case_when"| __truncated__ "Update arrow chapter code to avoid errors (#1517)\n\n* Add in `col_types` to specify schema\r\n\r\n* Just use open_dataset()" ...
Surprise: typos happen to all of us. I’m guessing the percentage is quite high because the book has a lot of readers scouring it, finding small issues and providing quick fixes.
In other words
Of course, you can change the string
argument of search_commits()
to find terms other than the default ‘typo’. Use your imagination.
Here’s a meta example: messages containing emoji in the commits to the {emo} package by Hadley Wickham, Romain François and Lucy D’Agostino McGowan.
Emoji are expressed in commit messages like :dog:
, so we can capture them with a relatively simple regular expression like ":.*:"
(match wherever there are two colons with anything in between).
<- search_commits("hadley", "emo", ":.*:") emo_emoji
ℹ Running gh query
ℹ Running gh query, got 100 records of about 200
str(emo_emoji)
List of 4
$ meta :List of 2
..$ : chr "hadley"
..$ : chr "emo"
$ counts :List of 3
..$ match_count : int 21
..$ commit_count: int 112
..$ match_ratio : num 0.188
$ matches : chr [1:21] "need emo:: prefix in that case, bc ji_glue might be called without emo being attached. ping @batpigandme" "rm emoji keyboard (saved in separate branch) but eventually might just go in a separate :package:" "emo::ji_rx a meta regex to catch all emojis. closes #14" "bring in some extra modules (for emo::ji_rx)" ...
$ messages: chr [1:112] "Imports CRAN glue (#54)" "no longer importing dplyr. #24" "less dependency on dplyr" "clock no longer depends on dplyr" ...
Only 19 per cent? Son, I am disappoint.
Environment
Session info
Last rendered: 2023-07-17 22:22:24 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] purrr_1.0.1 gh_1.4.0
loaded via a namespace (and not attached):
[1] digest_0.6.31 R6_2.5.1 fastmap_1.1.1 xfun_0.39
[5] fontawesome_0.5.1 magrittr_2.0.3 rappdirs_0.3.3 glue_1.6.2
[9] knitr_1.43.1 gitcreds_0.1.2 htmltools_0.5.5 rmarkdown_2.23
[13] lifecycle_1.0.3 cli_3.6.1 vctrs_0.6.3 compiler_4.3.1
[17] rstudioapi_0.15.0 tools_4.3.1 curl_5.0.1 evaluate_0.21
[21] httr2_0.2.3 yaml_2.3.7 rlang_1.1.1 jsonlite_1.8.7
[25] htmlwidgets_1.6.2
Footnotes
Yes, I’m aware of Git hooks and various GitHub Actions that could prevent this.↩︎
Though obviously you’ll miss messages containing the word ‘typo’ if you have a typo in the word ‘typo’ in one of your commits…↩︎
I used it most recently in my little {ghdump} package for downloading or cloning a user’s repos en masse.↩︎
Very rarely do I make myself laugh, but I had forgotten that I used the commit message ‘Fix potatypos’ when correcting typos in the post about the {potato} package, lol. Also thank you to Maëlle, who fixed a typo on my behalf!↩︎
I wonder how many typos I’ll need to correct in this post after publishing. (Edit: turns out I accidentally missed a couple of words, lol.)↩︎