72 Matching Annotations
  1. Dec 2021
    1. e evaluate the LayoutLM model on three document image under-standing tasks: Form Understanding, Receipt Understanding,and Document Image Classification. We follow the typical fine-tuning strategy and update all parameters in an end-to-end way ontask-specific dataset

      Task-specific Fine-tuning

    2. We initialize the weight of LayoutLM model with the pre-trainedBERT base model. Specifically, our BASE model has the same ar-chitecture: a 12-layer Transformer with 768 hidden sizes, and 12attention heads, which contains about 113M parameters. Therefore,we use the BERT base model to initialize all modules in our modelexcept the 2-D position embedding layer. For the LARGE setting,our model has a 24-layer Transformer with 1,024 hidden sizes and16 attention heads, which is initialized by the pre-trained BERTLARGE model and contains about 343M parameters

      Model PreTraining

    3. Document Layout Information. It is evident that the relative po-sitions of words in a document contribute a lot to the semanticrepresentation. Taking form understanding as an example, given akey in a form (e.g., “Passport ID:”), its corresponding value is muchmore likely on its right or below instead of on the left or above.Therefore, we can embed these relative positions information as2-D position representation. Based on the self-attention mechanismwithin the Transformer, embedding 2-D position features into thelanguage representation will better align the layout informationwith the semantic representation.Visual Information. Compared with the text information, thevisual information is another significantly important feature in doc-ument representations. Typically, documents contain some visualsignals to show the importance and priority of document segments.The visual information can be represented by image features and ef-fectively utilized in document representations. For document-levelvisual features, the whole image can indicate the document layout,which is an essential feature for document image classification. Forword-level visual features, styles such as bold, underline, and italic,are also significant hints for the sequence labeling tasks. There-fore, we believe that combining the image features with traditionaltext representations can bring richer semantic representations todocuments.

      Two types of feature proposed

    4. , they usually leverage text infor-mation only for any kind of inputs. When it comes to visually richdocuments, there is much more information that can be encodedinto the pre-trained model. Therefore, we propose to utilize thevisually rich information from document layouts and align themwith the input texts.

      BERT usually leverage text information only, in VRD, much more information can be encoded in pretrained model. Visually rich information from document layouts are utilized and aligned with input texts

    5. During the pre-training, the model uses two objectivesto learn the language representation: Masked Language Modeling(MLM) and Next Sentence Prediction (NSP), where MLM randomlymasks some input tokens and the objective is to recover thesemasked tokens, and NSP is a binary classification task taking apair of sentences as inputs and classifying whether they are twoconsecutive sentences

      pretraining stage of BERT

    6. BERT model is an attention-based bidirectional language mod-eling approach. It has been verified that the BERT model showseffective knowledge transfer from the self-supervised task withlarge-scale training data. The architecture of BERT is basically amulti-layer bidirectional Transformer encoder. I

      Brief explanation about BERT Model

    7. though thesemodels have made significant progress in the document AI areawith deep neural networks, most of these methods confront twolimitations: (1) They rely on a few human-labeled training sampleswithout fully exploring the possibility of using large-scale unla-beled training samples. (2) They usually leverage either pre-trainedCV models or NLP models, but do not consider a joint training oftextual and layout information. Therefore, it is important to inves-tigate how self-supervised pre-training of text and layout may helpin the document AI area

      existing progress in document AI field, thus become something that this LayoutLM paper trying to solve

    8. , this is the first time that text andlayout are jointly learned in a single framework for document-level pre-training.

      text and layout jointly learned in single framework for document level pre-training

    9. LayoutLM to jointly model interactionsbetween text and layout information across scanned documentimages, which is beneficial for a great number of real-world doc-ument image understanding tasks such as information extractionfrom scanned documents.

      LayoutLM model interactions between text and layout information across scanned document

    Annotators

    1. Instead of constructing a grid on the character level and embedding each character with one-hotencoding as in Katti et al. (2018), we construct a grid on the word-piece level and embed with densecontextualized vectors from a BERT language model.

      Difference between chargrid in Katti et al, BERTgrid involves constructing grid on word-piece level and embed with dense contextualized vectors from BERT language model

    2. Chargrid (Katti et al. (2018)), followed more recently by CUTIE (Zhao et al. (2019)), construct a 2Dgrid of characters or words from a document and feed it into a neural model, thereby preserving thespatial arrangement of the document. The symbols in the original document are embedded in somevector space, yielding a rank-3 tensor (width, height, embedding). Both papers report significantbenefits of using such a grid approach over purely sequential 1D input representations, especially forsemantically understanding tabulated or otherwise spatially arranged text like line items

      Hybrid approach combining NLP and CV methods for document intelligence, Chargrid (Katti et al) and CUTIE (Zhao et al) constructing 2D grid of characters or words from document and feeding it to neural model, preserving spatial arrangement of document

    3. Instead of working on the textual level, it is possible to directly applymethods from computer vision (CV) (e.g. Ren et al. (2015)) to work on the raw document pixel levelwhich naturally retains the two-dimensional (2D) document structure

      possible to apply CV methods too to naturally retain 2-D Structure of the document, but impractical,because machine learning model would first need to learn textual info from raw pixel data followed by semantics

    4. In classical natural language processing (NLP),however, the layout information is completely discarded as the document text is simply a sequenceof words. Without access to the layout, a downstream task such as extraction of tabulated data canbecome much harder – and in some cases impossible to solve – since the necessary serialization maylead to severe information loss

      Downside in using classical NLP techniques only

    5. based on Chargrid by Katti et al. (2018), represents a documentas a grid of contextualized word piece embedding vectors, thereby making itsspatial structure and semantics accessible to the processing neural networ

      BERT-GRID is based on Chargrid

    6. Thecontextualized embedding vectors are retrieved from a BERT language model. Weuse BERTgrid in combination with a fully convolutional network on a semanticinstance segmentation task for extracting fields from invoices.

      Contextualized embedding vectors retrieved from BERT language model

    1. The encoder boils down to a VGG-typenetwork (Simonyan and Zisserman, 2014) with di-lated convolutions (Yu and Koltun, 2016), batchnormalization (Ioffe and Szegedy, 2015), and spa-tial dropout (Tompson et al., 2015).

      Encoder similar to VGG-type network with dilated convolutions, batch normalization, spatial dropout

    2. there can be multiple and anunknown number of instances of the same class,we further perform instance segmentation. Thismeans, in addition to predicting a segmentationmask, we may also predict bounding boxes usingthe techniques from object detection

      instance segmentation (in addition to predicting segmentation mask, also predict bounding boxes)

    3. 1-hot encoded chargrid representation ̃g as input to a fully convolutional neural networkto perform semantic segmentation on the chargridand predict a class label for each character-pixelon the document

      semantic segmentation chargrid, predict class label for each character pixel on document

    4. positionalinformation can come from an optical characterrecognition (OCR) engine, or can be directly ex-tracted from the layout information in the docu-ment as provided by, e.g., PDF or HTML. Thecoordinate space of a character box is defined bypage height H and width W, and is usually mea-sured in units of pixel

      where the positional information come from

    5. chargrid can beconstructed from character boxes, i.e., boundingboxes that each surround a single character some-where on a given document page.

      bounding boxes surrounding single character

    6. Combining approaches from computer vision,NLP, and document analysis, our work is the firstto systematically address the task of understanding2D documents the same way as NLP while stillretaining the 2D structure in structured documents

      Combining approach from CV, NLP and doc analysis (retaining 2D Structure in structured documents

    7. document understanding task as instance-level semantic segmentation on chargrid. Moreprecisely, the model predicts a segmentation maskwith pixel-level labels and object bounding boxesto group multiple instances of the same class.

      The tasks

    8. a novel paradigm for processingand understanding structured documents. Insteadof serializing a document into a 1D text, the pro-posed method, named chargrid, preserves the spa-tial structure of the document by representing it asa sparse 2D grid of characters.

      Preserves spatial structure of document, representing it as sparse 2D grid characters

  2. Nov 2021
    1. To more closely mimic a real-world dataset, we performdata augmentation. We consider the following steps (theeffects marked with a star* are based on the open sourceocrodeg package2): (1) Background: Natural images,gradient background, multiscale noise*, fibrous noise*,blobs*. (2) Distortions: Large 2D distortions*, Small 1Ddistortions*. (3) Projective transformations: Including ro-tation, skew, dilation, 3D perspective, etc. (4) Degrada-tions: Gaussian or box blur; mode or median filters; con-tour, emboss, edges, smooth, gradient text. (5) dpi andcompression: Down-scaling, jpeg compression. (6) Color:Equalize, Invert, Sharpness, Contrast, Brightness. A subsetof these steps are randomly chosen and applied on any givendocument. With this, we generate 66,481 pages of syntheticdocument data

      https://github.com/NVlabs/ocrodeg

      Data Augmentation

  3. Oct 2021
    1. n this paper, we present a novel deep learning basedpreprocessing method to jointly detect and deskew documents in digital images

      Preprocessing method proposed in this paper: deep learning based, jointly detect & deskew documents which are skewed (slightly rotated or cluttered backgrounds) to improve OCR performance. Method was tested on cash receipts photos dataset

    Annotators

    1. A.1.2 CORD, CORD+, CORD++, and CORD-M for receipt IECORD and their variant consist of 30 information categories such as menu name, count, unit price,price, and total price (Table 6). The fields are further grouped and forms the information layer at ahigher level.A.1.3 Receipt-idn for receipt IEReceipt-idn is similar to CORD but includes more diverse information categories (50) such as store name,store address, and payment time (Table 6).A.1.4 namecard for name card IEnamecard consists of 12 field types, including name, company name, position, and address (Table6). The task requires grouping and ordering of tokens for each field. Although there is only a singleinformation layer (field), the careful handling of complex spatial relations is required due to the largedegree of freedom in the layout.A.1.5 Invoice for invoice IEInvoice consists of 62 information categories such as item name, count, price with tax, itemprice without tax, total price, invoice number, invoice date, vendor name, andvendor address (Table 6). Similar to receipts, their hierarchical information is represented viainter-field grouping.A.1.6 FUNSD for general form understandingFUNSD form understanding task consists of two sub tasks: entity labeling (ELB) and entity linking (ELK).In ELB, tokens are classifed into one of four fields–header, question, answer, and other–while doingserialization of tokens within each field. Both subtasks assume that the input tokens are perfectly serializedwith no OCR error. To emphasize the importance of correct serialization in the real-world, we preparetwo variant of ELB tasks: ELB-R and ELB-S. In ELB-R, the whole documents are randomly rotatedby a degree of -20◦–20◦ and the input tokens are serialized using rotated y-coordinates. In ELB-S task,the input tokens are randomly shuffled. In both tasks, the relative order of the input tokens within eachfield remain unchanged. In ELK task, tokens are linked based on their key-value relations (inter-groupingbetween fields). For example, each “header” is linked to the corresponding “question”, and “question” ispaired with the corresponding “answer”

      Datasets explained

    2. The internal datasets Receipt-idn, namecard and Invoice are annotated by the crowd through an in-houseweb application following (Park et al., 2019; Hwang et al., 2019). First, each text segment is labeled(bounding box and the characters inside) for the OCR task. The text segments are further groupedaccording to their field types by the crowds. For Receipt-idn and Invoice, additional group-ids areannotated to each field for inter-grouping of them. The text segments placed on the same line are alsoannotated through row-ids. For quality assurance, the labeled documents are cross-inspected by thecrowds.

      Datasets(1)

    3. To extract the visually embedded texts from an im-age, we use our in-house OCR system that consistsof CRAFT text detector (Baek et al., 2019b) andComb.best text recognizer (Baek et al., 2019a). TheOCR models are finetuned on each of the documentIE datasets. The output tokens and their spatial in-formation on the image are used as the inputs toSPADE

      Section 5.1 OCR Experimental setup

    4. To perform the spatial dependency parsing task in-troduced in the previous section in an end-to-endfashion, we propose SPADEsthat consists of (1)spatial text encoder, (2) graph generator, and (3)graph decoder. Spatial text encoder and graph gen-erator are trained jointly. Graph decoder is a de-terministic function (without trainable parameters)that maps the graph to a valid parse of the outputstructure

      Model SPADE:

      1. Spatial text encoder
      2. Graph generator
      3. Graph decoder

      Spatial text encoder and graph generator trained jointly. Graph decoder is a deterministic function mapping graph to valid parse of output structure.

    5. In short, our contributions are threefold. (1) Wepresent a novel view that information extraction forsemi-structured documents can be formulated as adependency parsing problem in two-dimensionalspace. (2) We propose SPADEsfor spatial de-pendency parsing, which is capable of efficientlyconstructing a directed semantic graph of text to-kens in semi-structured documents.1 (3) SPADEsachieves a similar or better accuracy than the previ-ous state of the art or strong BERT-based baselinesin eight document IE datasets

      Contributions of this paper:

      1. Proposing IE for semi structured documents as dependency parsing problem in two-dimensional space.

      2. SPADE for spatial dependency parsing, capable of efficiently constructing directed semantic graph of text tokens in semi-structured documents.

      3. SPADE achieving similar/better accuracy than previous SOTA or strong BERT-based baselines in eight document IE datasets

    6. While effective forrelatively simple documents, their broader applica-tion in the real world is still challenging because(1) semi-structured documents often exhibit a com-plex layout where the serialization algorithm isnon-trivial, and (2) sequence tagging is inherentlynot effective for encoding multi-layer hierarchi-cal information such as the menu tree in receipts

      Broader application in real world is still challenging due to complex layout of semi-structured documents, ineffectiveness of sequence tagging for encoding multi-layer hierarchical information like menu tree in receipts

    7. To tackle these issues,we first formulate the IE task as spatialdependency parsing problem that focuseson the relationship among text tokens in thedocuments. Under this setup, we then proposeSPADEs(SPAtial DEpendency parser) thatmodels highly complex spatial relationshipsand an arbitrary number of information layersin the documents in an end-to-end manner. Weevaluate it on various kinds of documents suchas receipts, name cards, forms, and invoices,and show that it achieves a similar or betterperformance compared to strong baselinesincluding BERT-based IOB taggger

      Solution proposed: Formulating IE task as spatial dependency parsing problem that focuses on relationship among text tokens in documents, and creating Spatial Dependency parser

    1. In this paper, we presented an end-to-end network to bridge thetext reading and information extraction for document understanding.These two tasks can mutually reinforce each other through jointtraining. The visual and textual features of text reading can boost theperformances of information extraction while the loss of informationextraction can also supervise the optimization of text reading. On avariety of benchmarks, from structured to semi-structured text typeand fixed to variable layout, our proposed method significantly out-performs three state-of-the-art methods in both aspects of efficiencyand accuracy

      Conclusion: TReading and IE reinforce each other through joint training. Visual and textual features of TReading boost performances of IE while loss of IE help optimize TReading on variety of benchmarks, from structured to semi-structured text, fixed to variable layout.

      Link to Presentation Video: https://dl.acm.org/doi/10.1145/3394171.3413900

    2. We find that GCN(TR) has difficulty in adapting to suchflexible layout. Chargrid(TR) obtains impressive performances onisolated entities such as Name, Phone and Education period. SinceUniversity and Major entities are often blended with other texts,Chargrid(TR) may be failed to extract these entities. As expected,NER(TR) performs better on this dataset, thanks to the inherent se-rializable property. While it is inferior to Chargrid(TR) on isolatedentities due to the missing of layout information. Our model inheritsthe advantages of both Chargrid(TR) and NER(TR), providing con-text features for identifying entities and performing entity extractionin the character-level. In short, our model gets comprehensive gain.

      Evaluation on Resumes Chargrid obtains impressive performances on isolated entities, but on entities that are blended with other texts like University and Major. NER performs better, due to inherent serializable property but inferior to Chargrid on isolated entities (missing layout information). Proposed model inherits advantages of Chargrid and NER

    3. Character-Word LSTM is similar to NER [24], which applies LSTMon character and word level sequentially. LayoutLM [54] makes useof large pre-training data and fine-tunes on SROIE. Similar to Lay-outLM, PICK [58] extracts rich semantic representation containingthe textual and visual features as well as global layout. Comparedwith these methods, our model shows competitive performance.

      Evaluation on SROIE dataset (Setting 2): Groundtruth of text bounding-box and transcript provided officially are used

    4. Evaluation on SROIE: We perform two sets of experiments andthe results are as shown in Table 4.Setting 1: We train text reading module all by ourselves and reportcomparisons. Notice that, we do not employ tricks of data synthesisand model ensemble in the training of text reading. Since entitiesof ‘Company’ and ‘Total’ often have distinguishing visual features(e.g., bold type or large font), as shown in Fig. 1(b), benefiting fromfusion of visual and textual features, our model outperforms threecounterparts by a large margin.

      Evaluation on SROIE dataset (Setting 1)

    5. Evaluation on Taxi Invoices: In this dataset, thenoise of low-quality and taint may lead to failures of detection andrecognition of entities. Besides, the contents may be misplaced, e.g.the content of Pick-up time may appear after the ‘Date’. Table 3shows the results. We see that our model outperforms counterpartsby significant margins except for the Pick-up time (illustrated in thetail of the paragraph). Concretely, NER(TR) discards the layout infor-mation and serializes all texts into one-dimensional text sequences,reporting inferior performance than other methods. Benefiting fromthe layout information, Chargrid(TR) and GCN(TR) work muchbetter. However, Chargrid(TR) conducts pixel segmentation task andis prone to omit characters or include extra characters. For GCN(TR),it only exploits the positions of text segments. Obviously, our TRIEhas the ability to boost performances by using more useful visualfeatures in VRDs. In addition, we attribute the only slight lowerscore of Pick-up entity compared with GCN(TR) to the annotations.For example in Fig. 4, when an entity such as the Pick-up time‘18:47’ is too blurred to read, it is tagged as NULL. However, ourmodel can still correctly read and extract this entity out, which leadsto lower statistics

      Results on Taxi Invoices

    6. We validate our model on three real-world datasets. One is the publicSROIE [20] benchmark, and the other two are self-built datasets,Taxi Invoices and Resumes, respectively. Note that, the three bench-marks differ largely in layout and text type, from fixed to variablelayout and from structured to semi-structured text.

      Datasets: SROIE, Taxi Invoice & Resume (Self-built, in Chinese)

      • SROIE dataset (ICDAR 2019 Challenge) consisting of 626 receipts (train) and 347 receipts (testing) with 4 entities. Variable layouts and structured texxt.

      • The Taxi Invoice (consists of 5000 images and has 9 entities) can be grouped into roughly 13 templates, with fixed layout and structured text type.

      • The Resumes dataset is of Chinese (consists of 2475 scanned resumes), with 6 entities to extract, variable layouts and semi-structured text.

    7. Both the context and textual features matter in entity extraction.The context features (including both visual context features Candtextual context features ̃C) provide necessary information to tellentities apart while the textual features Zenable entity extraction inthe character granularity, as they contain semantic features for eachcharacter in the text. So we first perform multimodal fusion of visualcontext features Cand textual context features ̃C, which are furthercombined with textual features Zto extract entities

      Section 3.4: IE Module

    8. Textural Context. Unlike visual context which focuses onlocal visual patterns, textual context models the fine-grained longdistance dependencies and relationships between texts, providingcomplementary context information. Inspired by [10, 27, 34], weapply the self-attention mechanism to extract textual context features,supporting variable number of texts

      Textual Context (part of Multimodal Context Block)

    9. Visual Context. As mentioned, visual details such as theobvious color, font, layout and other informative features are equallyimportant as textual details (text content) for document understand-ing. A natural way of capturing the local visual context of a textis resort to the convolutional neural network. Different from [54]which extracts these features from scratch, we directly reuse C =(c1,c2,...,cm )produced by the text reading module. Thanks to thedeep backbone and lateral connections introduced by FPN, each cisummarizes the rich local visual patterns of the i-th text

      Visual Context (part of Multimodal Context Block)

    10. we design a multimodal context block toconsider position features, visual features and textual features alltogether. This block provides both visual context and textual contextof a text, which are complementary to each other and further fusedin the information extraction module

      On the Multimodal Context Block used

    11. Specifically, in text reading, the network takes the original im-age as input and outputs text region coordinate positions. Once thepositions obtained, we apply RoIAlign [15] on the shared convolu-tional features to get text region features.

      text reading in the article

    12. Text reading module isresponsible for localizing and recognizing all texts in documentimages and information extraction module is to extract entities ofinterest from them. The multimodal context block is novelly designedto bridge the text reading and information extraction modules.

      Overview of overall architecture

    13. . [4] localized, recognized and classified eachword in the document. Since it worked in the word granularity, itnot only required much more labeling efforts (positions, contentand category of each word) but also had difficulties in extractingthose entities which were embedded in word texts (e.g. extracting‘51xxxx@xxx.com’ from ‘153-xxx97|51xxxx@xxx.com’).

      Section: 2.2 Information Extraction

    14. Two related concurrent works were presented in [4, 14]. [14]proposed an entity-aware attention text extraction network to extractentities from VRDs. However, it could only process documents ofrelatively fixed layout and structured text, like train tickets, passportsand bussiness cards

      Section: 2.2 Information Extraction

    15. Though rule-based methods work in some cases, theyrely heavily on the predefined rules, whose design and maintenanceusually require deep expertise and large time cost. Besides, they cannot generalize across document templates.

      Re: Information Extraction Disadvantages of rule-based methods for IE tasks

    16. CRNN framework

      Convolutional Recurrent Neural Network (CRNN) involves CNN(convolutional neural network) followed by the RNN(Recurrent neural networks). Referred by this paper, it increases better sequential recognition of text lines.

    17. Taxi Invoices consists of 5000 images and has 9 entities toextract (Invoice Code, Invoice Number, Date, Pick-up time,Drop-off time, Price, Distance, Waiting, Amount). The in-voices are in Chinese and can be grouped into roughly 13templates. So it is kind of document of fixed layout and struc-tured text type.• SROIE [20] is a public dateset for receipt information extrac-tion in ICDAR 2019 Chanllenge. It contains 626 receipts fortraining and 347 receipts for testing. Each receipt is labeledwith four types of entities, which are Company, Date, Addressand Total. It has variable layouts and structured text.• Resumes is a dataset of 2475 Chinese scanned resumes, whichhas 6 entities to extract (Name, Phone Number, Email Ad-dress, Education period, Universities and Majors). As anowner can design his own resume template, this dataset hasvariable layouts and semi-structured text.

      Dataset Summary

    18. all the above works inevitably have the following threelimitations. (1) VRD understanding requires both visual and textualfeatures, but the visual features they exploited are limited. (2) Textreading and information extraction are highly correlated, but therelations between them have rarely been explored. (3) The stagewisetraining strategy of text reading and information extraction bringsredundant computation and time cost.

      Limitations of existing works mentioned

    19. text reading includes text detection and recognitionin images, which belongs to the optical character recogtion (OCR)research field and has already been widely used in many ComputerVision (CV) applications [12, 42, 52]

      Text Reading includes text detection & recognition in images (OCR)

    20. the multimodal visual and textual features of text reading are fusedfor information extraction and in turn, the semantics in informationextraction contribute to the optimization of text reading.

      TRIE: END-to-END Text Reading and Information Extraction for Document Understanding

    21. text readingand information extraction are mutually correlated.
      • TRIE: END-to-END Text Reading and Information Extraction for Document Understanding

      Approach: unified end-to-end text reading and information extraction network, text reading & IE supplement each other.