5 Matching Annotations
  1. Nov 2015
    1. Presentation summarizing an approach to duplicate web page detection that was developed by a researcher whilst at Google in the early 2000s

  2. Sep 2015
  3. arxiv.org arxiv.org
    1. Given an LSH familyH, the LSH scheme amplifiesthe gap between the high probabilityP1and the lowprobabilityP2by concatenating several functions

      Useful recap of LSH

    2. Recent survey paper for hashing-based approaches to similarity search

    1. This paper has a very useful overview of previous work that is worth reading under section 9.

    2. We used the following publicly available real datasets in the experiment

      Datasets used are DBPL, ENRON, UNIREF-4GRAM. All small (<1M records) in web terms and I would guess, all with small document sizes.

      Given a lengthy paper, could potentially divide into smaller documents (1 doc === 1 page) and do signature calculation on a per-page basis. This could have the benefit of bounding the search time by limiting the number of pages that need to be rendered to text in order to start the lookup process.