38 Matching Annotations
  1. Last 7 days
    1. no_repeat_ngram_size= 35

      大多数人认为OCR系统不需要特别处理n-gram重复问题,因为这主要在文本生成中重要。作者专门设置了no_repeat_ngram_size参数为35,表明他们的OCR系统需要防止长文本中的重复模式,这挑战了OCR只是简单提取文本而不需要处理文本生成特性的主流认知。

    2. max_length= 32768

      大多数人认为OCR模型处理的文本长度受限于模型架构,通常在几千词左右。作者设置的max_length高达32768,这远超传统OCR系统的处理能力,暗示了模型能够处理超长文档而不丢失上下文,挑战了OCR系统的长度限制认知。

    3. Single image supports two configs: gundam or base

      大多数人认为OCR模型需要针对特定任务或文档类型进行专门配置,但作者提出单个图像就能支持两种截然不同的配置('gundam'或'base'),这挑战了OCR系统通常需要针对特定场景进行专门配置的行业共识。

    4. Welcome the Era of One-shot Long-horizon Parsing.

      大多数人认为OCR技术需要针对不同类型的文档进行多次处理或微调,但作者声称Unlimited-OCR实现了'一次性长距离解析',这挑战了OCR领域需要多次处理的常规认知,暗示一个模型可以处理各种复杂文档而无需专门训练。

  2. Oct 2025
    1. OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched.

      PDF and OCR conversion of image or scanned pdf to OCRed PDF. Command line on Windows when used with winget installation py -m ocrmypdf --sidecar R.txt --output-type pdf R.pdf R_01.pdf

  3. Mar 2025
  4. Jan 2025
  5. Jul 2024
    1. wie wärs mit selbsthilfe?!

      diese passive "wir sind konsumenten" scheisse ist doch genau das problem...

      ich hab mir das print buch gekauft für 22 euro, hab den buchrücken aufgeschnitten mit ner kreissäge, und hab die 208 seiten durch meinen ADF scanner gejagt (Brother ADS-3000N, 150eur gebraucht). ohne vorbereitung ist das vielleicht ne halbe stunde arbeit. dann noch die scans rotieren, croppen, leveln, und durch tesseract jagen. für tesseract braucht man ne schnelle CPU.

      aktuell tu ich die hocr dateien von tesseract korrekturlesen, später werd ich ne pdf draus machen und über libgen.rs auf annas-archive.org hochladen - ein problem weniger.

      hocr dateien hab ich hochgeladen auf https://github.com/milahu/enteignung - vielleicht mag wer helfen beim korrekturlesen, dann gehts 1 oder 2 tage schneller.

      mann mann mann... als "IT insider" bin ich so gelangweilt von den normies, die beim thema IT vor 20 jahren stehen geblieben sind, kein plan haben von linux, git, python, torproject, monero, ... aber hauptsache scheisse labern in telegram >: (

  6. Mar 2024
  7. Nov 2023
  8. Sep 2023
  9. Jan 2023
  10. Oct 2022
    1. Worried about paper cards being lost or destroyed .t3_y77414._2FCtq-QzlfuN-SwVMUZMM3 { --postTitle-VisitedLinkColor: #9b9b9b; --postTitleLink-VisitedLinkColor: #9b9b9b; --postBodyLink-VisitedLinkColor: #989898; } I am loving using paper index cards. I am, however, worried that something could happen to the cards and I could lose years of work. I did not have this work when my notes were all online. are there any apps that you are using to make a digital copy of the notes? Ideally, I would love to have a digital mirror, but I am not willing to do 2x the work.

      u/LBHO https://www.reddit.com/r/antinet/comments/y77414/worried_about_paper_cards_being_lost_or_destroyed/

      As a firm believer in the programming principle of DRY (Don't Repeat Yourself), I can appreciate the desire not to do the work twice.

      Note card loss and destruction is definitely a thing folks have worried about. The easiest thing may be to spend a minute or two every day and make quick photo back ups of your cards as you make them. Then if things are lost, you'll have a back up from which you can likely find OCR (optical character recognition) software to pull your notes from to recreate them if necessary. I've outlined some details I've used in the past. Incidentally, opening a photo in Google Docs will automatically do a pretty reasonable OCR on it.

      I know some have written about bringing old notes into their (new) zettelkasten practice, and the general advice here has been to only pull in new things as needed or as heavily interested to ease the cognitive load of thinking you need to do everything at once. If you did lose everything and had to restore from back up, I suspect this would probably be the best advice for proceeding as well.

      Historically many have worried about loss, but the only actual example of loss I've run across is that of Hans Blumenberg whose zettelkasten from the early 1940s was lost during the war, but he continued apace in another dating from 1947 accumulating over 30,000 cards at the rate of about 1.5 per day over 50 some odd years.

  11. Sep 2022
  12. Aug 2022
  13. Jun 2022
  14. Feb 2022

    Tags

    Annotators

    URL

  15. Dec 2021
  16. Nov 2021
  17. Jul 2021

    Tags

    Annotators

  18. Feb 2021
  19. Jan 2021
    1. Apart from a basic segmenter taken from OCRopus a trainable line extractor is in the process of being implemented. Full trainability of layout analysis is of utmost importance to a truly universal OCR system, as text layout and its semantics varies widely across time and space, e.g. hand-crafted methods for printed Latin text are unlikely to work reliably on Arabic text or manuscripts with extensive interlinear annotation.

      wip implementation of line segmentation in kraken

  20. Oct 2020
  21. Jul 2020
  22. Apr 2020
    1. Adobe AcrobatPro.

      gImageReader is an excellent open source alternative. It runs both on Windows and Linux, and it provides a simple (yet powerful) frontend GUI to Google's robust open source OCR engine, Tesseract.

      I think an open source tool as this is a better fit to the open annotation ecosystem, based on libre software and standards, that Hypothesis promotes, instead of a proprietary (and expensive) tool such as Adobe AcrobatPro.

  23. Apr 2019
  24. Sep 2015
  25. Aug 2015