Please don’t generate your bibliography

:: academia, research

I read a lot of bibliographies. I also read a lot of bad bibliographies, most of which are clearly autogenerated. I don’t like reading bad bibliographies.

There are two purposes to a bibliography.

The first is to communicate which article is being cited to the reader. For that, it’s important to make the citation information very clear and consistent.

The second is to make it possible for the reader to find the thing being cited. For that, you need as much information as necessary to track down the original article. If you have a DOI, that becomes pretty trivial, so most other information can be excluded. If you don’t have a DOI, you may need to call a librarian and they may need lots of specific information. If you haven’t had to do this, go find an old paper in your field and try to track down a physical copy. It’s a learning experience.

I have a style guide and some rules-of-thumb I use to try to satisfy these two goals:

For provenance:

  • For all citations, include DOIs, or URLs from the most authoritative, least-likely to fail/change institution if no DOIs available.
  • If no DOI/URL is available, include as much citation information as possible.
  • When DOIs and canonical URLs are available, avoid including things like publisher names and addresses, editor names, etc. This is just clutter.
  • If a URL of dubious longevity is the only available, try to archive it with https://web.archive.org/.

For clarity:

  • Never use automatically generated bib information; always clean it up and try to keep consistent.
    • I always manually replace the conference name with something more legible. The autogenerated one will often also contain the venue location, the date, the venue’s iteration number. I exclude all that, since most of that is elsewhere in the bib or is irrelevant. You can use bibtex string constants to make this easy.
    • I exclude the ACM SIGPLAN nonsense for the most well-known venues.
    • I typically exclude editors, which don’t really matter
    • I typically exclude page number, except for journal articles
    • I typically exclude the publisher, unless there is no DOI or URL. The publisher is useful if you need to track down a copy of the article, but if you have a DOI or authoritative URL, the publisher shouldn’t be necessary.
  • Avoid duplicate DOIs + URL.
  • Use braces {} to guard any word in titles that need capitalization, such as {AI} or {CIC}.
  • Never use the SIGPLAN Notices version of a citation. Some versions of ICFP, POPL, etc, papers were also published as “SIGPLAN Notices” journal articles. This can confuse readers. The two versions are the same paper, but have different DOIs and citation information. Make sure to find the conference version, and not the SIGPLAN Notices version. For example, see these two citations:

To implement these rules, I use Jabref. It can pull in the autogenerated information, then I manually clean it up according to my rules. Jabref supports marking fields as optional or required depending on the type, so I usually just delete all the optional fields, unless there’s not DOI. I use lots of bibtex string constants for common venues, to help clean up the bibtex and keep things consistent.

It’s always disappointing to me when I read a completely autogenerated bibliography. It’s obvious; it looks like slop, and accomplishes neither of the two goals of a bibliography. I was just reading one, and no entry had a DOI at all, titles had incorrect capitalization all over the place, many of the proceedings included cities and dates duplicated in the date field.