The dataset consists of 6,700 annotated real business documents and 100,000 synthetically generated documents with labels for practical information extraction tasks. In addition, it comes with a dataset of about 1 million unlabeled documents that can be used for unsupervised learning.
Rossum, an Intelligent Document Processing (IDP) vendor, announced February 27 that it has published the world’s largest research dataset and benchmark, DocILE (Document Information Localization and Extraction). In doing so, Rossum aims to accelerate scientific progress in business document information extraction (IE) and advance research in the field of Intelligent Document Processing as a whole.
According to Rossum, DocILE (Document Information Localization and Extraction) is a large-scale research benchmark for cross-evaluation of machine learning methods for key information localization and extraction (KILE) and line item recognition (LIR) from semi-structured business documents such as invoices and purchase orders. To improve and measure the performance of AI models, large datasets are critical, which is why the DocILE benchmark is particularly important.
Datasets and benchmarks as it relates to business document IE are rare, as such documents often contain sensitive information and are legally protected. DocILE addresses this problem by creating a benchmark consisting of documents from two public data sources (UCSF Industry Documents Library and Public Inspection Files).
Head of Rossum’s AI Labs, Milan Šulc, Ph.D., commented: “This is an important milestone because it advances IDP research as a whole, where everyone can now develop and test more advanced algorithms on a benchmark of challenging and highly practical tasks. The new dataset will increase accuracy levels in document information extraction by accelerating research in areas such as novel machine learning architectures and training objectives. This will ultimately lead to global optimization of business communication and workflows, further increasing the amount of the time saved for our customers.”
London and Prague-based Rossum offers an Intelligent Document Processing (IDP) solution with advanced data extraction capabilities through its cloud-native platform that helps companies in a variety of industries reduce manual effort, eliminate errors and improve turnaround times.
The full press release can be found here.
Get industry news distilled, every week: