Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.
We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.
Some applications of the algorithms described in this paper have been patented by the Regents of the University of California. See US patent number 6,757,675. However, as the patent states (in claim one), the winnowing technique is only patented insofar as it relates to `comparing the contents of a query document to the content on the World Wide Web'. So there is nothing preventing you from using winnowing to detect spam, incrementally transfer files (eg rsync), or build a new kind of file system.
It is educational to see that the obvious claims one and two of the patent application, 20030120647, do not appear in the issued patent. Perhaps this is because these claims are patented elsewhere. Finally, please note that patenting algorithms is somewhat controversial.
Please note that the code given in Figure 5 of the paper has (at least) two bugs: line 21 should read
for(int i=(r-1+w)%w; i!=r; i=(i-1+w)%w)
for(int i=(r-1)%w; i!=r; i=(i-1+w)%w)
The other mistake is harder to correct: essentially it is an error on line 4 to load the buffer with INT_MAX's. The correct code should load the buffer with next_hash()'s and set both r and min equal to w-1. Finally, inside the main loop (beginning on line 13) the `advance' (the first two lines of the block) should be moved to the end, after the `if-else' statement.
If you have any questions or corrections please contact me via email.
The copyright on this paper is held by the ACM.