A study of duplicate citations in Medline

In the latest issue of Bioinformatics, there’s a paper on duplicated publications:

Déjà vu—A study of duplicate citations in Medline

M. Errami et al.

Motivation: Duplicate publication impacts the quality of the scientific corpus, has been difficult to detect, and studies this far have been limited in scope and size. Using text similarity searches, we were able to identify signatures of duplicate citations among a body of abstracts.

Results: A sample of 62 213 Medline citations was examined and a database of manually verified duplicate citations was created to study author publication behavior. We found that 0.04% of the citations with no shared authors were highly similar and are thus potential cases of plagiarism. 1.35% with shared authors were sufficiently similar to be considered a duplicate. Extrapolating, this would correspond to 3500 and 117 500 duplicate citations in total, respectively.

They have gone text mining looking for significant (textual) overlap between papers, spotting both cases of plagiarism and of duplicated papers from the same authors.

Both situations are unethical. Plagiarism is plain old stealing — the scientific ideas of a scientist is the most important contributions of the scientist, so if someone else steals those ideas, it is probably the worse thing that can happen. At least there were only a few cases of plagiarism.

Duplicated publications are just annoying. It when I am discovering halfway through a paper that I have read it elsewhere from a different journal. Of course, if it is an exact duplication I will discover it earlier than half-way through, but on several occasions it is somewhat re-written but the results are exactly the same as a previous paper. The cited paper discovers 1.35% of duplications, but how well their text-mining spots duplicated results with slightly re-written papers, I don’t know.

In any case, they only compare abstracts, and I don’t remember a case where I have spotted a duplication based on the abstract.

If you want to browse their discovered duplications, you can find their database here.

My own duplications

I have a few duplications myself, I must admit, but except for one case (which I’ll get back to below), those are journal Special Issue version of conference contributions. What happens there is that a subset of the conference contributions are selected for journal publication (in most cases in a slightly extended version).

In such cases, where it is blindly obvious that it is a duplication of a conference paper (the journal makes that very explicit) I don’t see any problems with duplications. The Bionformatics paper agrees: Quoting from the paper:

While some duplications may be justified, arguably to promote wider dissemination or to provide important updates to clinical trials, surreptitious duplications that are covert and do not properly acknowledge the original work are unethical.

The last case of duplicated publications for me is the two papers

Algorithms for Computing the Quartet Distance between Trees of Arbitrary Degree
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of Workshop on Algorithms in Bioinformatics (WABI), 2005, LNBI 3692, pp. 77-88 © Springer-Verlag.

Quartet Distance between General Trees (extended abstract)
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of International Conference on Numerical Analysis and Applied Mathematics (ICNAAM) 2005, pp. 796-799 © Wiley-VCH Verlag GmbH & Co.

and there is a bit of a story behind this.

We first submitted to WABI, but then discovered an error in the paper that we couldn’t fix — it was in the time analysis of one of the algorithms where we had claimed O(n2) but couldn’t get below O(n2d2). So we retracted the paper from WABI, fixed the analysis, and submitted to ICNAAM where it got accepted as well. The retraction was ignored, however, despite several emails to the PC chairs, so in the end we had to submit a final version. Since the ICNAAM version is just an extended abstract and the WABI paper is full length, we figured we could justify this, but it is a bit borderline, I think.

Acceptable duplication?

Determining if a paper is a duplicate based on only text similarity is a bit unsafe, of course. I tend to describe the problems I am working on, related work, consequences, etc. in similar terms from paper to paper. I try to avoid phrasing it the same, but it is hard not to do, and I know several cases where the introduction section of my papers read very similar.

I personally don’t see a problem in this, if the results presented are novel, but I guess it is a bit borderline as well.

Smallest publishable increment

Something that annoys me more than duplicated publications, though, is papers describing tiny increments on existing results. These papers mean that you have to read 4-5 papers to get the information the could easily be contained in a single paper.

Now, some of this is unavoidable. If the authors get an idea after the original idea is published — this has happened to me a few times — the choice is either never to publish, or to publish a minor increment. But with some authors — I could name names but I won’t — more than half the papers are tiny increments to previous ideas. This tells me that they either publish way too early, or that the willfully try to get as many publications out of as little thinking as they can.

Why bother?

Why would you publish the same results twice, or publish tiny increments?

It will boost the number of publications, but who cares about that? Even the silliest bureaucrats have figured out that what matters is impact.

If you want to boil impact down to a single number, so you can reduce the quality of a research to something that is easily measured, you don’t use the number of publications. You pick something like the h-index or such. There, the number of publications matters, but only if people cite them. You are better off with 10 papers cited 10 times each than with 50 papers cited 5 times each.

Duplicating publications doesn’t lead to greater impact. Citing the Bioinformatics paper again:

In the Duplicate/DA category, however, we observed that duplications were predominantly in journals with no impact factor and that these articles were rarely cited. If the primary value of a publication is to disseminate scientific findings and knowledge, it is not accomplished by publications in this category, so one must question the intent of the author of a Duplicate/DA publication.

In short, duplicated publications will not increase the impact, so why bother?

The citation, for Research Blogger:
Errami, M., Hicks, J.M., Fisher, W., Trusty, D., Wren, J.D., Long, T.C., Garner, H.R. (2007). Deja vu A study of duplicate citations in Medline. Bioinformatics, 24(2), 243-249. DOI: 10.1093/bioinformatics/btm574


Brian Vinter pointed this press release out to me: Denmark Creates Network for Gene Sequencing.

CLC bio and several prominent Danish research institutions have established SEQNET — a national network for developing a unique software platform for the analysis of data from the next generation sequencing technologies. The platform will integrate groundbreaking bioinformatics algorithms with a user-friendly and graphical user interface.

Apparently, my good old friend Roald Forsberg is involved:

Senior Scientific Officer at CLC bio, Dr. Roald Forsberg, states, “Next generation sequencing technologies, like 454, Solexa, or SOLiD are pushing a revolution in genetic analysis. Their massive throughput has given rise to a plethora of novel applications for DNA sequencing and has dramatically increased the ambitions of existing projects. However, handling the large amounts of fragmented data presents a great bioinformatics challenge to be dealt with before researchers can get the full value of these new technologies. Since DNA sequencing is becoming omnipresent in research we believe that the answer to this challenge is a unified next generation sequencing platform. In this network, we will make such a platform come together by combining our unique capacities for producing graphical user interfaces, algorithms and high performance computing solutions with the expertise of Denmark’s foremost researchers in the field.”

We have talked about the problems involved in dealing with data from the new high-throughput sequencing technologies a couple of times over lunch, but it seems Roald is moving faster than I am here. Good for him!

I look forward to see where this is leading!

Digital Urban Living

Browsing the Danish Research Council’s homepage — searching for some info on my own grant but failing at that — I stumbled upon this press release (in Danish, sorry). A large project in digital urban living will run in Aarhus the next four years.

I haven’t heard anything about this until today. That really shows how much I’m out of the loop these days. Since I started doing “Real Science” I haven’t been keeping track of what was going on in IT and computer science in town.

Sarcasm on!

Anyway, reading the press release it looks like a lot of the “usual suspect” visions in pervasive computing: mobile phones for news browsing (local news, in this case, probably because they want a digital Aarhus and not just any old digital urban living), mobile phones finding the closes restaurants when you go out, etc.

Let’s ignore for a sec. that I can already do that with any smart phone already. I am sure there are more visions than that…

Well, one thing they mention that smart-phones cannot do is houses changing colour according to the weather. My brick house sort of changes colour between sunny days, cloudy days and rainy days, but all shades of yellow. I am sure changing between red and green is an improvement.

Sarcasm off!

Ok, the press release is a bit daft on the concrete examples, and the remaining examples are too vague to comment on. This doesn’t mean that the project is crap, though. Boiling things down to the length of a press release is bound to dumb it down a bit as well.

I look forward to hear more information about the project. Concrete examples of what they plan to do. Find out what is in it for me! How will digital living change my life? Smart phones, ubiquity of laptop computers and wireless network has changed our life, so there is certainly a potential for IT to change the way we live.

How will this project add to this? Would the money be better spent just providing free wireless Internet downtown? ;-)

The budget is DKK 43.5 million, so it is well funded, and there is a lot of collaborators in it, so it will be interesting to see where it will lead.

More links, but all in Danish, here:

23andMe explains it so much better than me

I tried to explain my main research area in a previous post, but 23andMe even uses animations! See the animations at ScienceRoll.

23andMe is one of the new personalized genomics companies that popped up late last year. Another is deCODEme at deCODE, the company we collaborate with in the PolyGene project.

I haven’t really made up my mind about these personal genomes companies yet. Not that I think there is anything unethical about them, it is the science I cannot make up my mind about. Technologically, it is cool that you can actually type a million tagSNPs for $1000 (with the current value of the dollar that is essentially for free), but how much can you really use the information for?

Tracking the genealogy can be done with some accuracy — not sure exactly how much accuracy — but the disease risk stuff I am very sceptical about. The genetic factors of life style diseases that we know about have so little relative risk that calculating it for individuals as part of some genetic profile seems a bit dodgy to me.

Knowing that a particular genetic variant increases the disease risk ever so slightly tells us about the underlying biology of the disease, and that information is important. Making medical decisions based on the type of an individual, if the relative risk is tiny fractions of the environmental risks we know about anyway, is just plain silly.

If you know that smoking increases your risk of cancer dramatically, but you don’t stop smoking, are you likely to benefit from knowing that your genes increase your risk from 1% to 1.1%?

No manual entry for fopen

What the f*ck is going on here?

$ man fopen

No manual entry for fopen

How can Ubuntu leave out the most fundamental man pages? The man pages for system calls and for the C library is the most essential pages if you program on UNIX. With pretty much everything else, you are better off with Google, but for these?

Can someone please tell me how to get my man pages back?