Exploring ACM
Almost a year ago, a friend and I made an experiment, a visualization of the citation relationships between papers published in the Association of Computer Machinery article library; each paper cites and is cited by other papers, we represented these relationships in a graph where each node is a paper and each edge is a citation from one to another.
To make it happen, my friend Javier crawled the ACM library and got banned lots of times. He had to restart his modem so often that we had to “borrow” some computers at our university to run the crawler and get the info. After two days we had the disappointing amount of 10,000 articles. We expected many more, but the ACM anti-crawling rules got us fetching only 2 articles per minute. Anyway they were enough to play, so I made a REST API and created a server to store the information and serve it to a web interface… the result was pretty cool!
You can search for an article, or an author.
Try “Bayesian” or “Policarpo”.
Things got interesting, when we discovered a couple of inconsistencies, at first we thought it was a bug in our crawler, or some errors somewhere in our code. But it was not. Our code was good, they were authentic errors in ACM's library. This is what we found:
Articles that cite themselves
As you can see at the bottom of this post, ACM’s explanation involves errors in the Optical Character Recognition program they use to obtain the references from the article. However I wonder how could this happen by a mere OCR mistake. A few examples:
Articles reference each other
How about that! Such a thing should not be possible, there cannot be two papers based on each other. However I was quite surprised with the abundance of things like these, we detected more than 100 occurrences in our small 14k dataset. Examples here:
This is the ACM Digital Library note, mentioned above:
OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
So that is all, we made this experiment in 3 days as a part of a small but very cool competition, organized by a small yet innovative company called Edis.