Hypothesis

72 Matching Annotations

Dec 2021
Local file Local file

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

12
1. jason.antonio 28 Dec 2021
  
  in Public
  
  e evaluate the LayoutLM model on three document image under-standing tasks: Form Understanding, Receipt Understanding,and Document Image Classification. We follow the typical fine-tuning strategy and update all parameters in an end-to-end way ontask-specific dataset
  
  Task-specific Fine-tuning
2. jason.antonio 28 Dec 2021
  
  in Public
  
  We initialize the weight of LayoutLM model with the pre-trainedBERT base model. Specifically, our BASE model has the same ar-chitecture: a 12-layer Transformer with 768 hidden sizes, and 12attention heads, which contains about 113M parameters. Therefore,we use the BERT base model to initialize all modules in our modelexcept the 2-D position embedding layer. For the LARGE setting,our model has a 24-layer Transformer with 1,024 hidden sizes and16 attention heads, which is initialized by the pre-trained BERTLARGE model and contains about 343M parameters
  
  Model PreTraining
3. jason.antonio 28 Dec 2021
  
  in Public
  
  use the BERT architectureas the backbone and add two new input embeddings: a 2-D positionembedding and an image embedding.
  
  2-D position embedding and Image embedding
4. jason.antonio 28 Dec 2021
  
  in Public
  
  Document Layout Information. It is evident that the relative po-sitions of words in a document contribute a lot to the semanticrepresentation. Taking form understanding as an example, given akey in a form (e.g., “Passport ID:”), its corresponding value is muchmore likely on its right or below instead of on the left or above.Therefore, we can embed these relative positions information as2-D position representation. Based on the self-attention mechanismwithin the Transformer, embedding 2-D position features into thelanguage representation will better align the layout informationwith the semantic representation.Visual Information. Compared with the text information, thevisual information is another significantly important feature in doc-ument representations. Typically, documents contain some visualsignals to show the importance and priority of document segments.The visual information can be represented by image features and ef-fectively utilized in document representations. For document-levelvisual features, the whole image can indicate the document layout,which is an essential feature for document image classification. Forword-level visual features, styles such as bold, underline, and italic,are also significant hints for the sequence labeling tasks. There-fore, we believe that combining the image features with traditionaltext representations can bring richer semantic representations todocuments.
  
  Two types of feature proposed
5. jason.antonio 28 Dec 2021
  
  in Public
  
  , they usually leverage text infor-mation only for any kind of inputs. When it comes to visually richdocuments, there is much more information that can be encodedinto the pre-trained model. Therefore, we propose to utilize thevisually rich information from document layouts and align themwith the input texts.
  
  BERT usually leverage text information only, in VRD, much more information can be encoded in pretrained model. Visually rich information from document layouts are utilized and aligned with input texts
6. jason.antonio 28 Dec 2021
  
  in Public
  
  In the fine-tuning, task-specific datasets areused to update all parameters in an end-to-end way.
  
  Fine tuning of BERT performed on task-specific datasets
7. jason.antonio 28 Dec 2021
  
  in Public
  
  During the pre-training, the model uses two objectivesto learn the language representation: Masked Language Modeling(MLM) and Next Sentence Prediction (NSP), where MLM randomlymasks some input tokens and the objective is to recover thesemasked tokens, and NSP is a binary classification task taking apair of sentences as inputs and classifying whether they are twoconsecutive sentences
  
  pretraining stage of BERT
8. jason.antonio 28 Dec 2021
  
  in Public
  
  There are two steps in the BERT framework: pre-training andfine-tuning.
  
  two steps in BERT framework
9. jason.antonio 28 Dec 2021
  
  in Public
  
  BERT model is an attention-based bidirectional language mod-eling approach. It has been verified that the BERT model showseffective knowledge transfer from the self-supervised task withlarge-scale training data. The architecture of BERT is basically amulti-layer bidirectional Transformer encoder. I
  
  Brief explanation about BERT Model
10. jason.antonio 28 Dec 2021
  
  in Public
  
  though thesemodels have made significant progress in the document AI areawith deep neural networks, most of these methods confront twolimitations: (1) They rely on a few human-labeled training sampleswithout fully exploring the possibility of using large-scale unla-beled training samples. (2) They usually leverage either pre-trainedCV models or NLP models, but do not consider a joint training oftextual and layout information. Therefore, it is important to inves-tigate how self-supervised pre-training of text and layout may helpin the document AI area
  
  existing progress in document AI field, thus become something that this LayoutLM paper trying to solve
11. jason.antonio 28 Dec 2021
  
  in Public
  
  , this is the first time that text andlayout are jointly learned in a single framework for document-level pre-training.
  
  text and layout jointly learned in single framework for document level pre-training
12. jason.antonio 28 Dec 2021
  
  in Public
  
  LayoutLM to jointly model interactionsbetween text and layout information across scanned documentimages, which is beneficial for a great number of real-world doc-ument image understanding tasks such as information extractionfrom scanned documents.
  
  LayoutLM model interactions between text and layout information across scanned document
Annotators

jason.antonio
arxiv.org arxiv.org

BERTgrid-Contextualized-Embedding-for-2D-Document-Representation-and-Understanding.pdf

6
1. jason.antonio 28 Dec 2021
  
  in Public
  
  Instead of constructing a grid on the character level and embedding each character with one-hotencoding as in Katti et al. (2018), we construct a grid on the word-piece level and embed with densecontextualized vectors from a BERT language model.
  
  Difference between chargrid in Katti et al, BERTgrid involves constructing grid on word-piece level and embed with dense contextualized vectors from BERT language model
2. jason.antonio 28 Dec 2021
  
  in Public
  
  Chargrid (Katti et al. (2018)), followed more recently by CUTIE (Zhao et al. (2019)), construct a 2Dgrid of characters or words from a document and feed it into a neural model, thereby preserving thespatial arrangement of the document. The symbols in the original document are embedded in somevector space, yielding a rank-3 tensor (width, height, embedding). Both papers report significantbenefits of using such a grid approach over purely sequential 1D input representations, especially forsemantically understanding tabulated or otherwise spatially arranged text like line items
  
  Hybrid approach combining NLP and CV methods for document intelligence, Chargrid (Katti et al) and CUTIE (Zhao et al) constructing 2D grid of characters or words from document and feeding it to neural model, preserving spatial arrangement of document
3. jason.antonio 28 Dec 2021
  
  in Public
  
  Instead of working on the textual level, it is possible to directly applymethods from computer vision (CV) (e.g. Ren et al. (2015)) to work on the raw document pixel levelwhich naturally retains the two-dimensional (2D) document structure
  
  possible to apply CV methods too to naturally retain 2-D Structure of the document, but impractical,because machine learning model would first need to learn textual info from raw pixel data followed by semantics
4. jason.antonio 28 Dec 2021
  
  in Public
  
  In classical natural language processing (NLP),however, the layout information is completely discarded as the document text is simply a sequenceof words. Without access to the layout, a downstream task such as extraction of tabulated data canbecome much harder – and in some cases impossible to solve – since the necessary serialization maylead to severe information loss
  
  Downside in using classical NLP techniques only
5. jason.antonio 28 Dec 2021
  
  in Public
  
  based on Chargrid by Katti et al. (2018), represents a documentas a grid of contextualized word piece embedding vectors, thereby making itsspatial structure and semantics accessible to the processing neural networ
  
  BERT-GRID is based on Chargrid
6. jason.antonio 28 Dec 2021
  
  in Public
  
  Thecontextualized embedding vectors are retrieved from a BERT language model. Weuse BERTgrid in combination with a fully convolutional network on a semanticinstance segmentation task for extracting fields from invoices.
  
  Contextualized embedding vectors retrieved from BERT language model
Visit annotations in context

Annotators

jason.antonio

URL

arxiv.org/pdf/1909.04948.pdf
arxiv.org arxiv.org

1809.08799.pdf

9
1. jason.antonio 28 Dec 2021
  
  in Public
  
  The encoder boils down to a VGG-typenetwork (Simonyan and Zisserman, 2014) with di-lated convolutions (Yu and Koltun, 2016), batchnormalization (Ioffe and Szegedy, 2015), and spa-tial dropout (Tompson et al., 2015).
  
  Encoder similar to VGG-type network with dilated convolutions, batch normalization, spatial dropout
2. jason.antonio 28 Dec 2021
  
  in Public
  
  there can be multiple and anunknown number of instances of the same class,we further perform instance segmentation. Thismeans, in addition to predicting a segmentationmask, we may also predict bounding boxes usingthe techniques from object detection
  
  instance segmentation (in addition to predicting segmentation mask, also predict bounding boxes)
3. jason.antonio 28 Dec 2021
  
  in Public
  
  1-hot encoded chargrid representation ̃g as input to a fully convolutional neural networkto perform semantic segmentation on the chargridand predict a class label for each character-pixelon the document
  
  semantic segmentation chargrid, predict class label for each character pixel on document
4. jason.antonio 28 Dec 2021
  
  in Public
  
  positionalinformation can come from an optical characterrecognition (OCR) engine, or can be directly ex-tracted from the layout information in the docu-ment as provided by, e.g., PDF or HTML. Thecoordinate space of a character box is defined bypage height H and width W, and is usually mea-sured in units of pixel
  
  where the positional information come from
5. jason.antonio 28 Dec 2021
  
  in Public
  
  chargrid can beconstructed from character boxes, i.e., boundingboxes that each surround a single character some-where on a given document page.
  
  bounding boxes surrounding single character
6. jason.antonio 28 Dec 2021
  
  in Public
  
  wecan formulate this problem as an instance segmen-tation task of characters on the page
  
  Instance segmentation task of characters on the page
7. jason.antonio 28 Dec 2021
  
  in Public
  
  Combining approaches from computer vision,NLP, and document analysis, our work is the firstto systematically address the task of understanding2D documents the same way as NLP while stillretaining the 2D structure in structured documents
  
  Combining approach from CV, NLP and doc analysis (retaining 2D Structure in structured documents
8. jason.antonio 28 Dec 2021
  
  in Public
  
  document understanding task as instance-level semantic segmentation on chargrid. Moreprecisely, the model predicts a segmentation maskwith pixel-level labels and object bounding boxesto group multiple instances of the same class.
  
  The tasks
9. jason.antonio 28 Dec 2021
  
  in Public
  
  a novel paradigm for processingand understanding structured documents. Insteadof serializing a document into a 1D text, the pro-posed method, named chargrid, preserves the spa-tial structure of the document by representing it asa sparse 2D grid of characters.
  
  Preserves spatial structure of document, representing it as sparse 2D grid characters
Visit annotations in context

Annotators

jason.antonio

URL

arxiv.org/pdf/1809.08799.pdf
Nov 2021
arxiv.org arxiv.org

1909.04469.pdf

1
1. jason.antonio 04 Nov 2021
  
  in Public
  
  To more closely mimic a real-world dataset, we performdata augmentation. We consider the following steps (theeffects marked with a star* are based on the open sourceocrodeg package2): (1) Background: Natural images,gradient background, multiscale noise*, fibrous noise*,blobs*. (2) Distortions: Large 2D distortions*, Small 1Ddistortions*. (3) Projective transformations: Including ro-tation, skew, dilation, 3D perspective, etc. (4) Degrada-tions: Gaussian or box blur; mode or median filters; con-tour, emboss, edges, smooth, gradient text. (5) dpi andcompression: Down-scaling, jpeg compression. (6) Color:Equalize, Invert, Sharpness, Contrast, Brightness. A subsetof these steps are randomly chosen and applied on any givendocument. With this, we generate 66,481 pages of syntheticdocument data
  
  https://github.com/NVlabs/ocrodeg
  
  Data Augmentation
Visit annotations in context

Annotators

jason.antonio

URL

arxiv.org/pdf/1909.04469.pdf
Oct 2021
Local file Local file

Untitled document

1
1. jason.antonio 23 Oct 2021
  
  in Public
  
  n this paper, we present a novel deep learning basedpreprocessing method to jointly detect and deskew documents in digital images
  
  Preprocessing method proposed in this paper: deep learning based, jointly detect & deskew documents which are skewed (slightly rotated or cluttered backgrounds) to improve OCR performance. Method was tested on cash receipts photos dataset
Annotators

jason.antonio
arxiv.org arxiv.org

2005.00642.pdf

12
1. jason.antonio 22 Oct 2021
  
  in Public
  
  A.1.2 CORD, CORD+, CORD++, and CORD-M for receipt IECORD and their variant consist of 30 information categories such as menu name, count, unit price,price, and total price (Table 6). The fields are further grouped and forms the information layer at ahigher level.A.1.3 Receipt-idn for receipt IEReceipt-idn is similar to CORD but includes more diverse information categories (50) such as store name,store address, and payment time (Table 6).A.1.4 namecard for name card IEnamecard consists of 12 field types, including name, company name, position, and address (Table6). The task requires grouping and ordering of tokens for each field. Although there is only a singleinformation layer (field), the careful handling of complex spatial relations is required due to the largedegree of freedom in the layout.A.1.5 Invoice for invoice IEInvoice consists of 62 information categories such as item name, count, price with tax, itemprice without tax, total price, invoice number, invoice date, vendor name, andvendor address (Table 6). Similar to receipts, their hierarchical information is represented viainter-field grouping.A.1.6 FUNSD for general form understandingFUNSD form understanding task consists of two sub tasks: entity labeling (ELB) and entity linking (ELK).In ELB, tokens are classifed into one of four fields–header, question, answer, and other–while doingserialization of tokens within each field. Both subtasks assume that the input tokens are perfectly serializedwith no OCR error. To emphasize the importance of correct serialization in the real-world, we preparetwo variant of ELB tasks: ELB-R and ELB-S. In ELB-R, the whole documents are randomly rotatedby a degree of -20◦–20◦ and the input tokens are serialized using rotated y-coordinates. In ELB-S task,the input tokens are randomly shuffled. In both tasks, the relative order of the input tokens within eachfield remain unchanged. In ELK task, tokens are linked based on their key-value relations (inter-groupingbetween fields). For example, each “header” is linked to the corresponding “question”, and “question” ispaired with the corresponding “answer”
  
  Datasets explained
2. jason.antonio 22 Oct 2021
  
  in Public
  
  The internal datasets Receipt-idn, namecard and Invoice are annotated by the crowd through an in-houseweb application following (Park et al., 2019; Hwang et al., 2019). First, each text segment is labeled(bounding box and the characters inside) for the OCR task. The text segments are further groupedaccording to their field types by the crowds. For Receipt-idn and Invoice, additional group-ids areannotated to each field for inter-grouping of them. The text segments placed on the same line are alsoannotated through row-ids. For quality assurance, the labeled documents are cross-inspected by thecrowds.
  
  Datasets(1)
3. jason.antonio 22 Oct 2021
  
  in Public
  
  our approach lever-ages both linguistic and (two-dimensional) spatialinformation to parse the dependency
  
  https://www.arxiv-vanity.com/papers/2005.00642/
4. jason.antonio 22 Oct 2021
  
  in Public
  
  s+bert+iob2 and sadv+bert+iob2
  
  ?
5. jason.antonio 22 Oct 2021
  
  in Public
  
  To extract the visually embedded texts from an im-age, we use our in-house OCR system that consistsof CRAFT text detector (Baek et al., 2019b) andComb.best text recognizer (Baek et al., 2019a). TheOCR models are finetuned on each of the documentIE datasets. The output tokens and their spatial in-formation on the image are used as the inputs toSPADE
  
  Section 5.1 OCR Experimental setup
6. jason.antonio 22 Oct 2021
  
  in Public
  
  To perform the spatial dependency parsing task in-troduced in the previous section in an end-to-endfashion, we propose SPADEsthat consists of (1)spatial text encoder, (2) graph generator, and (3)graph decoder. Spatial text encoder and graph gen-erator are trained jointly. Graph decoder is a de-terministic function (without trainable parameters)that maps the graph to a valid parse of the outputstructure
  
  Model SPADE:
  
  Spatial text encoder
  
  Graph generator
  
  Graph decoder
  
  Spatial text encoder and graph generator trained jointly. Graph decoder is a deterministic function mapping graph to valid parse of output structure.
7. jason.antonio 22 Oct 2021
  
  in Public
  
  In short, our contributions are threefold. (1) Wepresent a novel view that information extraction forsemi-structured documents can be formulated as adependency parsing problem in two-dimensionalspace. (2) We propose SPADEsfor spatial de-pendency parsing, which is capable of efficientlyconstructing a directed semantic graph of text to-kens in semi-structured documents.1 (3) SPADEsachieves a similar or better accuracy than the previ-ous state of the art or strong BERT-based baselinesin eight document IE datasets
  
  Contributions of this paper:
  
  Proposing IE for semi structured documents as dependency parsing problem in two-dimensional space.
  
  SPADE for spatial dependency parsing, capable of efficiently constructing directed semantic graph of text tokens in semi-structured documents.
  
  SPADE achieving similar/better accuracy than previous SOTA or strong BERT-based baselines in eight document IE datasets
8. jason.antonio 22 Oct 2021
  
  in Public
  
  dependency parsing
  
  https://www.upgrad.com/blog/dependency-parsing-in-nlp/
  
  https://towardsdatascience.com/natural-language-processing-dependency-parsing-cf094bbbe3f7
9. jason.antonio 22 Oct 2021
  
  in Public
  
  SPADEs(SPAtial DEpendency parser)
  
  https://github.com/clovaai/spade
10. jason.antonio 22 Oct 2021
  
  in Public
  
  IOB (Inside Outside Beginning)tagging problem
  
  Link1: https://github.com/IINemo/bert_sequence_tagger
  
  Link2: https://github.com/Kungbib/swedish-bert-models/issues/4
  
  Link3: https://medium.com/analytics-vidhya/bio-tagged-text-to-original-text-99b05da6664
  
  Link4: https://www.geeksforgeeks.org/nlp-iob-tags/
  
  Link5: https://www.nltk.org/book/ch07.html
11. jason.antonio 22 Oct 2021
  
  in Public
  
  While effective forrelatively simple documents, their broader applica-tion in the real world is still challenging because(1) semi-structured documents often exhibit a com-plex layout where the serialization algorithm isnon-trivial, and (2) sequence tagging is inherentlynot effective for encoding multi-layer hierarchi-cal information such as the menu tree in receipts
  
  Broader application in real world is still challenging due to complex layout of semi-structured documents, ineffectiveness of sequence tagging for encoding multi-layer hierarchical information like menu tree in receipts
12. jason.antonio 22 Oct 2021
  
  in Public
  
  To tackle these issues,we first formulate the IE task as spatialdependency parsing problem that focuseson the relationship among text tokens in thedocuments. Under this setup, we then proposeSPADEs(SPAtial DEpendency parser) thatmodels highly complex spatial relationshipsand an arbitrary number of information layersin the documents in an end-to-end manner. Weevaluate it on various kinds of documents suchas receipts, name cards, forms, and invoices,and show that it achieves a similar or betterperformance compared to strong baselinesincluding BERT-based IOB taggger
  
  Solution proposed: Formulating IE task as spatial dependency parsing problem that focuses on relationship among text tokens in documents, and creating Spatial Dependency parser
Visit annotations in context

Annotators

jason.antonio

URL

arxiv.org/pdf/2005.00642.pdf
arxiv.org arxiv.org

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

30
1. jason.antonio 22 Oct 2021
  
  in Public
  
  In this paper, we presented an end-to-end network to bridge thetext reading and information extraction for document understanding.These two tasks can mutually reinforce each other through jointtraining. The visual and textual features of text reading can boost theperformances of information extraction while the loss of informationextraction can also supervise the optimization of text reading. On avariety of benchmarks, from structured to semi-structured text typeand fixed to variable layout, our proposed method significantly out-performs three state-of-the-art methods in both aspects of efficiencyand accuracy
  
  Conclusion: TReading and IE reinforce each other through joint training. Visual and textual features of TReading boost performances of IE while loss of IE help optimize TReading on variety of benchmarks, from structured to semi-structured text, fixed to variable layout.
  
  Link to Presentation Video: https://dl.acm.org/doi/10.1145/3394171.3413900
2. jason.antonio 22 Oct 2021
  
  in Public
  
  We find that GCN(TR) has difficulty in adapting to suchflexible layout. Chargrid(TR) obtains impressive performances onisolated entities such as Name, Phone and Education period. SinceUniversity and Major entities are often blended with other texts,Chargrid(TR) may be failed to extract these entities. As expected,NER(TR) performs better on this dataset, thanks to the inherent se-rializable property. While it is inferior to Chargrid(TR) on isolatedentities due to the missing of layout information. Our model inheritsthe advantages of both Chargrid(TR) and NER(TR), providing con-text features for identifying entities and performing entity extractionin the character-level. In short, our model gets comprehensive gain.
  
  Evaluation on Resumes Chargrid obtains impressive performances on isolated entities, but on entities that are blended with other texts like University and Major. NER performs better, due to inherent serializable property but inferior to Chargrid on isolated entities (missing layout information). Proposed model inherits advantages of Chargrid and NER
3. jason.antonio 22 Oct 2021
  
  in Public
  
  Character-Word LSTM is similar to NER [24], which applies LSTMon character and word level sequentially. LayoutLM [54] makes useof large pre-training data and fine-tunes on SROIE. Similar to Lay-outLM, PICK [58] extracts rich semantic representation containingthe textual and visual features as well as global layout. Comparedwith these methods, our model shows competitive performance.
  
  Evaluation on SROIE dataset (Setting 2): Groundtruth of text bounding-box and transcript provided officially are used
4. jason.antonio 22 Oct 2021
  
  in Public
  
  Evaluation on SROIE: We perform two sets of experiments andthe results are as shown in Table 4.Setting 1: We train text reading module all by ourselves and reportcomparisons. Notice that, we do not employ tricks of data synthesisand model ensemble in the training of text reading. Since entitiesof ‘Company’ and ‘Total’ often have distinguishing visual features(e.g., bold type or large font), as shown in Fig. 1(b), benefiting fromfusion of visual and textual features, our model outperforms threecounterparts by a large margin.
  
  Evaluation on SROIE dataset (Setting 1)
5. jason.antonio 22 Oct 2021
  
  in Public
  
  F1-score
  
  Metric used
6. jason.antonio 22 Oct 2021
  
  in Public
  
  Evaluation on Taxi Invoices: In this dataset, thenoise of low-quality and taint may lead to failures of detection andrecognition of entities. Besides, the contents may be misplaced, e.g.the content of Pick-up time may appear after the ‘Date’. Table 3shows the results. We see that our model outperforms counterpartsby significant margins except for the Pick-up time (illustrated in thetail of the paragraph). Concretely, NER(TR) discards the layout infor-mation and serializes all texts into one-dimensional text sequences,reporting inferior performance than other methods. Benefiting fromthe layout information, Chargrid(TR) and GCN(TR) work muchbetter. However, Chargrid(TR) conducts pixel segmentation task andis prone to omit characters or include extra characters. For GCN(TR),it only exploits the positions of text segments. Obviously, our TRIEhas the ability to boost performances by using more useful visualfeatures in VRDs. In addition, we attribute the only slight lowerscore of Pick-up entity compared with GCN(TR) to the annotations.For example in Fig. 4, when an entity such as the Pick-up time‘18:47’ is too blurred to read, it is tagged as NULL. However, ourmodel can still correctly read and extract this entity out, which leadsto lower statistics
  
  Results on Taxi Invoices
7. jason.antonio 22 Oct 2021
  
  in Public
  
  GCN
  
  Graph Convolutional Network:
  
  https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b#:~:text=The%20term%20'convolution'%20in%20Graph,with%20underlying%20non%2Dregular%20structures
8. jason.antonio 22 Oct 2021
  
  in Public
  
  LSTM-CRF
  
  Implementation Example: https://github.com/threelittlemonkeys/lstm-crf-pytorch
  
  Other papers: https://www.jstage.jst.go.jp/article/transinf/E100.D/4/E100.D_2016EDP7179/_pdf/-char/en
9. jason.antonio 22 Oct 2021
  
  in Public
  
  Chargrid(TR)
  
  https://medium.com/sap-machine-learning-research/chargrid-77aa75e6d605
10. jason.antonio 22 Oct 2021
  
  in Public
  
  We validate our model on three real-world datasets. One is the publicSROIE [20] benchmark, and the other two are self-built datasets,Taxi Invoices and Resumes, respectively. Note that, the three bench-marks differ largely in layout and text type, from fixed to variablelayout and from structured to semi-structured text.
  
  Datasets: SROIE, Taxi Invoice & Resume (Self-built, in Chinese)
  
  SROIE dataset (ICDAR 2019 Challenge) consisting of 626 receipts (train) and 347 receipts (testing) with 4 entities. Variable layouts and structured texxt.
  
  The Taxi Invoice (consists of 5000 images and has 9 entities) can be grouped into roughly 13 templates, with fixed layout and structured text type.
  
  The Resumes dataset is of Chinese (consists of 2475 scanned resumes), with 6 entities to extract, variable layouts and semi-structured text.
11. jason.antonio 22 Oct 2021
  
  in Public
  
  Both the context and textual features matter in entity extraction.The context features (including both visual context features Candtextual context features ̃C) provide necessary information to tellentities apart while the textual features Zenable entity extraction inthe character granularity, as they contain semantic features for eachcharacter in the text. So we first perform multimodal fusion of visualcontext features Cand textual context features ̃C, which are furthercombined with textual features Zto extract entities
  
  Section 3.4: IE Module
12. jason.antonio 22 Oct 2021
  
  in Public
  
  Textural Context. Unlike visual context which focuses onlocal visual patterns, textual context models the fine-grained longdistance dependencies and relationships between texts, providingcomplementary context information. Inspired by [10, 27, 34], weapply the self-attention mechanism to extract textual context features,supporting variable number of texts
  
  Textual Context (part of Multimodal Context Block)
13. jason.antonio 22 Oct 2021
  
  in Public
  
  Visual Context. As mentioned, visual details such as theobvious color, font, layout and other informative features are equallyimportant as textual details (text content) for document understand-ing. A natural way of capturing the local visual context of a textis resort to the convolutional neural network. Different from [54]which extracts these features from scratch, we directly reuse C =(c1,c2,...,cm )produced by the text reading module. Thanks to thedeep backbone and lateral connections introduced by FPN, each cisummarizes the rich local visual patterns of the i-th text
  
  Visual Context (part of Multimodal Context Block)
14. jason.antonio 22 Oct 2021
  
  in Public
  
  we design a multimodal context block toconsider position features, visual features and textual features alltogether. This block provides both visual context and textual contextof a text, which are complementary to each other and further fusedin the information extraction module
  
  On the Multimodal Context Block used
15. jason.antonio 22 Oct 2021
  
  in Public
  
  RoIAlign
  
  On RoIAlign: https://paperswithcode.com/method/roi-align#:~:text=Region%20of%20Interest%20Align%2C%20or,extracted%20features%20with%20the%20input.
16. jason.antonio 22 Oct 2021
  
  in Public
  
  Specifically, in text reading, the network takes the original im-age as input and outputs text region coordinate positions. Once thepositions obtained, we apply RoIAlign [15] on the shared convolu-tional features to get text region features.
  
  text reading in the article
17. jason.antonio 22 Oct 2021
  
  in Public
  
  Text reading module isresponsible for localizing and recognizing all texts in documentimages and information extraction module is to extract entities ofinterest from them. The multimodal context block is novelly designedto bridge the text reading and information extraction modules.
  
  Overview of overall architecture
18. jason.antonio 21 Oct 2021
  
  in Public
  
  . [4] localized, recognized and classified eachword in the document. Since it worked in the word granularity, itnot only required much more labeling efforts (positions, contentand category of each word) but also had difficulties in extractingthose entities which were embedded in word texts (e.g. extracting‘51xxxx@xxx.com’ from ‘153-xxx97|51xxxx@xxx.com’).
  
  Section: 2.2 Information Extraction
19. jason.antonio 21 Oct 2021
  
  in Public
  
  Two related concurrent works were presented in [4, 14]. [14]proposed an entity-aware attention text extraction network to extractentities from VRDs. However, it could only process documents ofrelatively fixed layout and structured text, like train tickets, passportsand bussiness cards
  
  Section: 2.2 Information Extraction
20. jason.antonio 21 Oct 2021
  
  in Public
  
  They miss lots of informative details because multi-modalityof inputs are not fully explored.
  
  Section: 2.2 Information Extraction
21. jason.antonio 21 Oct 2021
  
  in Public
  
  Though rule-based methods work in some cases, theyrely heavily on the predefined rules, whose design and maintenanceusually require deep expertise and large time cost. Besides, they cannot generalize across document templates.
  
  Re: Information Extraction Disadvantages of rule-based methods for IE tasks
22. jason.antonio 20 Oct 2021
  
  in Public
  
  CTC decoder
  
  https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c Intuition behind CTC decoder
23. jason.antonio 20 Oct 2021
  
  in Public
  
  attention mechanism
  
  https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/ On attention mechanism
24. jason.antonio 20 Oct 2021
  
  in Public
  
  CRNN framework
  
  Convolutional Recurrent Neural Network (CRNN) involves CNN(convolutional neural network) followed by the RNN(Recurrent neural networks). Referred by this paper, it increases better sequential recognition of text lines.
25. jason.antonio 20 Oct 2021
  
  in Public
  
  Taxi Invoices consists of 5000 images and has 9 entities toextract (Invoice Code, Invoice Number, Date, Pick-up time,Drop-off time, Price, Distance, Waiting, Amount). The in-voices are in Chinese and can be grouped into roughly 13templates. So it is kind of document of fixed layout and struc-tured text type.• SROIE [20] is a public dateset for receipt information extrac-tion in ICDAR 2019 Chanllenge. It contains 626 receipts fortraining and 347 receipts for testing. Each receipt is labeledwith four types of entities, which are Company, Date, Addressand Total. It has variable layouts and structured text.• Resumes is a dataset of 2475 Chinese scanned resumes, whichhas 6 entities to extract (Name, Phone Number, Email Ad-dress, Education period, Universities and Majors). As anowner can design his own resume template, this dataset hasvariable layouts and semi-structured text.
  
  Dataset Summary
26. jason.antonio 20 Oct 2021
  
  in Public
  
  all the above works inevitably have the following threelimitations. (1) VRD understanding requires both visual and textualfeatures, but the visual features they exploited are limited. (2) Textreading and information extraction are highly correlated, but therelations between them have rarely been explored. (3) The stagewisetraining strategy of text reading and information extraction bringsredundant computation and time cost.
  
  Limitations of existing works mentioned
27. jason.antonio 20 Oct 2021
  
  in Public
  
  text reading includes text detection and recognitionin images, which belongs to the optical character recogtion (OCR)research field and has already been widely used in many ComputerVision (CV) applications [12, 42, 52]
  
  Text Reading includes text detection & recognition in images (OCR)
28. jason.antonio 20 Oct 2021
  
  in Public
  
  the multimodal visual and textual features of text reading are fusedfor information extraction and in turn, the semantics in informationextraction contribute to the optimization of text reading.
  
  TRIE: END-to-END Text Reading and Information Extraction for Document Understanding
29. jason.antonio 20 Oct 2021
  
  in Public
  
  text readingand information extraction are mutually correlated.
  
  TRIE: END-to-END Text Reading and Information Extraction for Document Understanding
  
  Approach: unified end-to-end text reading and information extraction network, text reading & IE supplement each other.
30. jason.antonio 20 Oct 2021
  
  in Public
  
  VRDs
  
  Visually Rich Documents
Visit annotations in context

Annotators

jason.antonio

URL

arxiv.org/pdf/2005.13118.pdf
hypothes.is hypothes.is

Hypothesis

1
1. jason.antonio 20 Oct 2021
  
  in Public
  
  Create a note by selecting some text and clicking the button
  
  This is an example of annotation
Visit annotations in context

Annotators

jason.antonio

URL

hypothes.is/welcome/9a701d5771bd6685

Annotators

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

Annotators

URL

Annotators

URL

Annotators

URL