“…(3) Semantic duplication, where pages contain (almost) the same content, but different words. Most attention in the past has been given to finding near-duplicate pages [4,6,10,11,12,16,17]. Recently, attention has shifted towards detecting partial replication [7,15,14], but none of the prior work focuses on the origin detection problem.…”