Posts Tagged ‘paper’

A study of duplicate citations in Medline

Monday, January 21st, 2008

In the latest issue of Bioinformatics, there’s a paper on duplicated publications:

Déjà vu—A study of duplicate citations in Medline

M. Errami et al.

Motivation: Duplicate publication impacts the quality of the scientific corpus, has been difficult to detect, and studies this far have been limited in scope and size. Using text similarity searches, we were able to identify signatures of duplicate citations among a body of abstracts.

Results: A sample of 62 213 Medline citations was examined and a database of manually verified duplicate citations was created to study author publication behavior. We found that 0.04% of the citations with no shared authors were highly similar and are thus potential cases of plagiarism. 1.35% with shared authors were sufficiently similar to be considered a duplicate. Extrapolating, this would correspond to 3500 and 117 500 duplicate citations in total, respectively.

They have gone text mining looking for significant (textual) overlap between papers, spotting both cases of plagiarism and of duplicated papers from the same authors.

Both situations are unethical. Plagiarism is plain old stealing — the scientific ideas of a scientist is the most important contributions of the scientist, so if someone else steals those ideas, it is probably the worse thing that can happen. At least there were only a few cases of plagiarism.

Duplicated publications are just annoying. It when I am discovering halfway through a paper that I have read it elsewhere from a different journal. Of course, if it is an exact duplication I will discover it earlier than half-way through, but on several occasions it is somewhat re-written but the results are exactly the same as a previous paper. The cited paper discovers 1.35% of duplications, but how well their text-mining spots duplicated results with slightly re-written papers, I don’t know.

In any case, they only compare abstracts, and I don’t remember a case where I have spotted a duplication based on the abstract.

If you want to browse their discovered duplications, you can find their database here.

My own duplications

I have a few duplications myself, I must admit, but except for one case (which I’ll get back to below), those are journal Special Issue version of conference contributions. What happens there is that a subset of the conference contributions are selected for journal publication (in most cases in a slightly extended version).

In such cases, where it is blindly obvious that it is a duplication of a conference paper (the journal makes that very explicit) I don’t see any problems with duplications. The Bionformatics paper agrees: Quoting from the paper:

While some duplications may be justified, arguably to promote wider dissemination or to provide important updates to clinical trials, surreptitious duplications that are covert and do not properly acknowledge the original work are unethical.

The last case of duplicated publications for me is the two papers

Algorithms for Computing the Quartet Distance between Trees of Arbitrary Degree
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of Workshop on Algorithms in Bioinformatics (WABI), 2005, LNBI 3692, pp. 77-88 © Springer-Verlag.

Quartet Distance between General Trees (extended abstract)
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of International Conference on Numerical Analysis and Applied Mathematics (ICNAAM) 2005, pp. 796-799 © Wiley-VCH Verlag GmbH & Co.

and there is a bit of a story behind this.

We first submitted to WABI, but then discovered an error in the paper that we couldn’t fix — it was in the time analysis of one of the algorithms where we had claimed O(n2) but couldn’t get below O(n2d2). So we retracted the paper from WABI, fixed the analysis, and submitted to ICNAAM where it got accepted as well. The retraction was ignored, however, despite several emails to the PC chairs, so in the end we had to submit a final version. Since the ICNAAM version is just an extended abstract and the WABI paper is full length, we figured we could justify this, but it is a bit borderline, I think.

Acceptable duplication?

Determining if a paper is a duplicate based on only text similarity is a bit unsafe, of course. I tend to describe the problems I am working on, related work, consequences, etc. in similar terms from paper to paper. I try to avoid phrasing it the same, but it is hard not to do, and I know several cases where the introduction section of my papers read very similar.

I personally don’t see a problem in this, if the results presented are novel, but I guess it is a bit borderline as well.

Smallest publishable increment

Something that annoys me more than duplicated publications, though, is papers describing tiny increments on existing results. These papers mean that you have to read 4-5 papers to get the information the could easily be contained in a single paper.

Now, some of this is unavoidable. If the authors get an idea after the original idea is published — this has happened to me a few times — the choice is either never to publish, or to publish a minor increment. But with some authors — I could name names but I won’t — more than half the papers are tiny increments to previous ideas. This tells me that they either publish way too early, or that the willfully try to get as many publications out of as little thinking as they can.

Why bother?

Why would you publish the same results twice, or publish tiny increments?

It will boost the number of publications, but who cares about that? Even the silliest bureaucrats have figured out that what matters is impact.

If you want to boil impact down to a single number, so you can reduce the quality of a research to something that is easily measured, you don’t use the number of publications. You pick something like the h-index or such. There, the number of publications matters, but only if people cite them. You are better off with 10 papers cited 10 times each than with 50 papers cited 5 times each.

Duplicating publications doesn’t lead to greater impact. Citing the Bioinformatics paper again:

In the Duplicate/DA category, however, we observed that duplications were predominantly in journals with no impact factor and that these articles were rarely cited. If the primary value of a publication is to disseminate scientific findings and knowledge, it is not accomplished by publications in this category, so one must question the intent of the author of a Duplicate/DA publication.

In short, duplicated publications will not increase the impact, so why bother?


The citation, for Research Blogger:
Errami, M., Hicks, J.M., Fisher, W., Trusty, D., Wren, J.D., Long, T.C., Garner, H.R. (2007). Deja vu A study of duplicate citations in Medline. Bioinformatics, 24(2), 243-249. DOI: 10.1093/bioinformatics/btm574

On Recombination Induced Multiple and Simultaneous Coalescent Events

Friday, December 28th, 2007

ResearchBlogging.org

We just published a new paper. The paper concerns a problem that Jotun’s been working on since he and Carsten Wiuf published some results on the distribution of ancestral material of a present day sample back in time. Jo Davies worked on it as a summer student project years back, and last year we returned to the problem when Frank Simancik did a summer student project.

On Recombination Induced Multiple and Simultaneous Coalescent Events

J. Davies, F. Simancik, R. Lyngsø, T. Mailund, and J. Hein

Genetics 177: 2151–2160 (2007). doi:10.1534/genetics.107.071126

Abstract: Coalescent Theory is almost ubiquitous in contemporary molecular population genetics. Inherent in most applications is a continuous time approximation that assumes sample size is small relative to the actual population size. This assumption in effect precludes simultaneous and multiple coalescent events, which can constitute an arbitrarily large component when sample size is sufficiently large. In most situations this is justifiably ignored as a large sample size will only have few ancestors a couple of generations back and then the assumption is valid. However, in tracing the evolutionary history of large chromosomal segments, a large recombination rate will consistently keep the number of ancestors large such that multiple and simultaneous coalescent events cannot be ignored. This can create a major disparity between discrete time and continuous time models and we here show its importance illustrated with parameters typical of the human genome. The presence of gene convergence only aggravates its importance. This could seriously undermine the application of coalescent theory to complete genomes. However, it can be shown that multiple and simultaneous coalescent events influences global quantities, such as total number of ancestors, but has negligible effect on local quantities, such as linkage disequilibrium or similarities of close local trees. Reassuringly the majority of applications of coalescent models with recombination are based on local quantities for purposes such as association mapping.

What is the problem?

If you sample DNA from present day individuals and then trace its history back in time you will see coalescent events and recombination events. Coalescent events occur when two lines, as we trace them back in time, join (or coalesce). This correspond to, when considered moving forward in time, a cell divides to eventually produce two siblings who are both ancestors of individuals in our present day sample. Recombination events occur when a single line, moving back in time, split into two. Considered forward in time, this correspond to two lines combining in a chromosomal recombination.

Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory

This process can be modelled mathematically, and the theory for this is called coalescent theory. A nice introduction can be found in the book by Jotun, Mikkel and Carsten: Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory ISBN 978-0198529965.

The mathematical proces is, of course, an approximation to the real process. The real process is probably too complex to model mathematically, and if it is possible to model it, the mathematics would be too complex to give any insight to the process in any case.

However, on approximation made in the mathematical model is potentially problematic. The model assumes that coalescent events and recombination events occur so rarely, on the time scale considered, that two essentially never occur at the same time.

For a small sample in a large population, this assumption is justified. The probability of multiple events at the same time is essentially zero. When the sample size is on the same order as the population size, however, the assumption is no longer valid.

This hasn’t really been a major issue, since even for large samples, the time it takes for a large sample to coalesce into a small sample is very short compared to the time it takes for the entire process to run.

That is, if we ignore recombination events!

Recombination events produce new lines, as we move back in time, just as coalescent events remove lines. If we consider ancestral material, which is the DNA we sampled at present day, the coalescent events will eventually win and reduce the material such that each nucleotide is only found in a single line. If we also consider non-ancestral material, DNA that belonged to an ancestor of our sample but that did not get passed on to the present day sample, then we reach an equilibrium between coalescent events and recombination events that keeps several lines moving back in time.

It is this situation that Carsten and Jotun considered in their paper

The ancestry of a sample of sequences subject to recombination [pdf]

C. Wiuf and J. Hein

Genetics 151: 1217-1228 (1990).

and as it turns out, it is possible for the number of lines to remain large, compared to the population size, if only the recombination rate is sufficiently high. In fact, the number of lines can be larger than the population size!

This sounds like a major problem with the theory, but it isn’t really. It is just applying the theory to a part of parameter space where essential assumptions are no longer valid.

If we consider single genes, as the theory intended, the recombination rate is low and there is no problem with the theory. If we start considering entire chromosomes, however, we enter the parameter space where the theory breaks down!

What is the result?

We considered this problem and simulated the process both when allowing multiple events (using a simpler, but computationally shower, method) and when assuming that they do not occur.

LD table

Number of lineages back in time

The model that allows multiple events changes the equilibrium behaviour of the system. The number of lines, as we trace them back in time, changes, and we no longer end up in the strange situation of having more lines than individuals in the population.

Local properties, however, such as the phylogenies at individual nucleotides and the linkeage disequilibrium (statistical relatedness of nucleotides), are not affected by allowing multiple events. This is the good news. It means that the models we have used when developing association mapping tools are just as valid as they have allways been.