7 Matching Annotations
  1. Feb 2018
    1. , I wonder if its possible to still process these small files in order, but skip storing them and instead put them to the side in memory until the next small file can be appended to it, doing this until it's chunk reaches the Min or Avg chunksize.
    2. The theory behind splitting chunks using a hash function, is to consistently find boundaries where the preceding data looks a certain way. If you have similar or identical files being backed up from different sources, the chunk boundaries should fall at the same positions resulting in identical chunks that can be deduplicated.
    3. I think inserting chunk boundaries at file boundaries would be very beneficial regarding deduplication. Consider a folder where some randomly chosen files are edited or added every day. File boundaries are very natural break points for changed data and should thus be utilized.
    4. So your primary comparison should be official Duplicacy with 1M fixed chunks vs my branch with 1M variable chunks.
    5. Duplicacy does not use file hashes at all to identify previously seen files that may have changed names or locations, but rather concatenates the contents of all files into a long data stream that is cut into chunks according to artificial boundaries based on a hash function.
    6. You'd expect that if a new/moved file is discovered on a subsequent backup run, and it has the 'exact' same File Hash, that you could effectively just relink it to the existing Chunks and boundaries.

      Yes, that's what I've been thinking all along!