Hypothesis

5 Matching Annotations

Nov 2015
www.cs.umd.edu www.cs.umd.edu

Duplicates.pdf

1
1. robertknight 18 Nov 2015
  
  in Public
  
  Presentation summarizing an approach to duplicate web page detection that was developed by a researcher whilst at Google in the early 2000s
  
  content-similarity duplicate-detection hashing
Visit annotations in context

Tags

hashing

duplicate-detection

content-similarity

Annotators

robertknight

URL

cs.umd.edu/~pugh/google/Duplicates.pdf
Sep 2015
arxiv.org arxiv.org

()

2
1. robertknight 10 Sep 2015
  
  in Public
  
  Given an LSH familyH, the LSH scheme amplifiesthe gap between the high probabilityP1and the lowprobabilityP2by concatenating several functions
  
  Useful recap of LSH
  
  duplicate-detection
2. robertknight 10 Sep 2015
  
  in Public
  
  Recent survey paper for hashing-based approaches to similarity search
  
  duplicate-detection
Visit annotations in context

Tags

duplicate-detection

Annotators

robertknight

URL

arxiv.org/pdf/1408.2927.pdf
www.csd.uoc.gr www.csd.uoc.gr

TODS3603-15.dvi

2
1. robertknight 10 Sep 2015
  
  in Public
  
  This paper has a very useful overview of previous work that is worth reading under section 9.
  
  duplicate-detection
2. robertknight 10 Sep 2015
  
  in Public
  
  We used the following publicly available real datasets in the experiment
  
  Datasets used are DBPL, ENRON, UNIREF-4GRAM. All small (<1M records) in web terms and I would guess, all with small document sizes.
  
  Given a lengthy paper, could potentially divide into smaller documents (1 doc === 1 page) and do signature calculation on a per-page basis. This could have the benefit of bounding the search time by limiting the number of pages that need to be rendered to text in order to start the lookup process.
  
  duplicate-detection
Visit annotations in context

Tags

duplicate-detection

Annotators

robertknight

URL

csd.uoc.gr/~hy562/local_copy/BigData/Finding_Similar_Items/a15-xiao.pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL