Sunday, 30 July 2017

The other day, a friend and I were wondering how many lines of code we had written.1 Really, the conversation started with me wondering if I had hit a million lines of R code yet.2 We proposed that the R code necessary for a quick-and-dirty calculation for the number of lines-of-code-written would be fairly short. We were correct!

The short script

After loading the requisite packages, the following script

  1. Finds all files contained in the directory Users and in all of its subdirectories,
  2. Takes the subset of files that end with .R (R scripts),
  3. Counts the number of lines in each of the .R scripts,
  4. Sums the number of lines from each of the .R scripts.
## [1] 95349

Boom! Not bad, but also not a million. Yet.

Note: The list.files() part of the script takes some time: it is essentially finding all of the files on your computer. In my case, the list.files() call on my /Users/ directory returns approximately 3.2 million (full) filenames.

The issues

The two main problems with this script result from updates/version control:

  1. If you use “good” file management and version control (e.g. git), then this script will only count the number of lines of code in the most recent version of each of your files. E.g., If you have re-written a file 10,000,000 times, you’ll miss 9,999,999 of the versions (and their lines of code) in your tally. You could probably fix this issue by grabbing the version histories of your R files from Github and then finding the unique lines for a given file.
  2. If you copy files to back them up—or if you change one line of a file and then save it with a new name (probably the opposite of “good” file management)—then you are going to over count by a lot.
  3. We miss lines from deleted files.

In my case, I think issue (1) is my biggest problem, but I’ll leave the remedy for future Ed.


  1. Navel gazing, anyone?

  2. I probably have a few years left, depending on the accuracy of this quick script.