Document intelligence

Automatic extraction of value from unstructured text driven by Artificial Intelligence.

Industries:

Finance & Insurance - Retail & FMCG - Energy & Utility - Life Science - Industrial - Transportation

Solutions:

Technologies:

Azure - Amazon Bedrock - Python

Overview

In an increasingly complex and interconnected digital information network, organizations can benefit from the acquisition and integration of new data of interest to their business.

The heterogeneity of sources and formats, combined with the high amount of resources, risks making the implementation of a process onerous which, albeit with some simplification, is often banal at its base but must deal with the continuous evolution of its driver.

In fact, it is not uncommon to see the use of manual operations to derive value from unstructured information, with the aim of finding the most immediate solution to a problem in the absence of adequate technological support. This approach, however, requires high time and costs, as well as exposing us to the risk of valorization errors due to the often alienating nature of the collection tasks.

To date, there are several software solutions capable of extracting information from unstructured sources, enabling the automation of the process through a predefined series of steps. These solutions often require high and continuous tuning efforts and are typically static approaches. This is the case of solutions capable of recognizing the occurrence of a specific pattern (e.g. the VAT number of a company) in a predefined position (e.g. on the header of an invoice at the top right) but which lose effectiveness if the data is indicated in a different position or with a different label, albeit semantically similar (e.g. using synonyms, abbreviations: VAT number or VAT code, etc.). Clearly these situations can also be handled by extending existing algorithms to incorporate known variations, but it is impossible to handle all possible ones with this approach.

Thanks to the support of AI it is now possible to quickly implement more flexible and effective solutions.

Challenges

The limits associated with traditional content harvesting solutions are multiple:

Risk of errors and inaccuracies when using manual operations
Difficulty in supporting the dynamism of contexts, both in terms of content format and in terms of the language used in the content itself
Limitations of existing software solutions

Solutions

Given the context and complexities previously described, as well as the potential offered by modern LLM systems, the Document intelligence process can be redesigned, guided by the use of Artificial Intelligence as in the architecture to follow.

Data Standardization

The first step of the solution involves standardization of the input data, potentially present in various forms and the result of more or less complex processes. They may correspond to text documents, images or, in more modern contexts, be the result of web scraping activities, and however peculiar situations in which the data is contained in paper documents should not be excluded.

This data heterogeneity, although it must be taken into account, should not be seen as an obstacle or imply burdensome pre-processing activities. The latter, in fact, will materialize exclusively in a conversion of the starting data into PDF format, an activity that can be performed programmatically and with extremely low effort for any type of data among those previously described.

Content extraction: OCR

The conversion of the document into PDF becomes an enabling condition for the subsequent application of OCR processes which, in a transparent manner, will allow the extraction of textual information from paper documents, images or web pages, making the text accessible and interpretable by Artifical Intelligence models and, specifically, LLM.

The solution described so far, from a technological point of view, presents multiple implementation methods, exploiting for example the services provided by the main Cloud Providers such as Textract for AWS or Azure Document Intelligence, demonstrating elasticity and therefore being able to adapt to the technological stacks of the different company realities.

Comprehension and processing: LLM

Once the textual content has been extracted, it is now possible to move towards the LLM sphere to understand its meaning and extract the portion of interest, potentially reworked and in a format easily usable by downstream systems. Controlling the format of the output returned by the LLM is a problem that should not be underestimated. In our architecture, we suggest structuring the output and transferring it into storage solutions that enable subsequent analyzes and allow it to be used by multiple third-party consumers in parallel with the archiving of the original document for future reuse.

LLM models have the ability to respond linguistically precisely and accurately to the questions given to them, adopting the vast knowledge acquired during the training phase. This skill not only allows them to provide linguistically correct answers, but also to present information that is pertinent and relevant to the person they are interacting with and the questions being asked.

This approach, however, is not free from problems and, among these, is the knowledge cutoff.

Prompt Engineering

Prompt engineering is based on the process of organizing the conversation with the model to guide him towards accurate and relevant answers. As probabilistic models, the structuring of the context, and consequently of the conversation, directly influences the quality and correctness of the answers.

Modern LLM models are however constrained by the ability to consider only a limited amount of text to generate an answer (context windows), therefore it is not feasible to pour the entire corpus of company data into the prompt and ask the LLM questions on a potentially wide range of information.

The prompt engineering techniques used to guide LLMs to reason and provide answers on data not present in their training set must therefore struggle with this limit, finding the method of selecting only the relevant information to pass to the prompt depending on the question being asked put.

Retrieval Augemented Generation: Knowledge Graphs & VectorDB

In this sense, the Retrieval Augmented Generation (RAG) approach is currently one of the most popular and is based on architectures capable of cataloging the data of interest and searching for it based on the question posed as input. It allows you to use the linguistic capabilities of the LLM to interact with data that do not belong to its information assets, without requiring an extension of the initial training, providing not only the information relating to the domain of interest but also the rules that govern it. Knowledge Graphs (KG) come into play here.

The KG is a way of representing information that allows you to map data and metadata into a single structure: Metadata represents the key concepts for a company, its attributes and existing relationships (e.g. the definition of “customer” or a process), the representation model at the conceptual level is called Ontology. On the contrary, the data corresponds to what is present within the information systems (e.g. tables and rows within them).

In a KG an association is defined between metadata and between data and metadata, it is therefore possible to move from abstract concepts to physical data and vice versa. This association is called Semantic Linking.

Having a KG available, it is possible to pass to the prompt of an LLM, together with the question, the entire ontology and, starting from this, ask the model to explore the concepts and indicate, in relation to the question posed, which concepts and data associated with them could be interesting. Subsequently, the identified data can be extracted and presented again at the prompt to obtain a relevant answer. This approach first takes advantage of the LLM’s abilities to perform tasks to understand what the relevant data is and how to generate the queries needed to retrieve it from the information systems in which it is stored. Next, leverage the model’s language skills to synthesize an answer based on the data extracted in the previous step.

Often, however, it is not possible to pass the entire ontology to the LLM and dynamically determining the most relevant portion requires the comparison between its concepts and the context of the prompt, a non-trivial operation in itself.

To carry out this activity it is possible to equip yourself with a VectorDB, a technology specifically designed to store, query and extract vector data and which therefore excels in carrying out similarity searches based on embeddings, often defined with the term Semantic Search. In practice, semantically similar vectors are mapped in proximity to each other in the embedding space, the proximity search between vectors therefore translates into a semantic search.

The population of a VectorDB occurs through processing of the operational and heterogeneous, often unstructured, data of the company context, which are processed and converted into embeddings via dedicated models. The resulting embeddings can now be indexed within a VectorDB to facilitate subsequent searching.

To date, it is possible to choose the model dedicated to the production of embeddings from an increasingly wider panorama of libraries or services offered by the main cloud providers. However, it is important to underline how different models produce embeddings in distinct and not generally comparable spaces: changing a model usually involves having to repopulate the entire VectorDB.

With the VectorDB thus prepared, it is possible to start the inference phase in which the LLM model, in addition to the question asked on the document under examination, receives additional context to eliminate any hallucinations and provide an accurate answer. This context is generated by the integration of VectorDB, on which similarity searches are performed with respect to the embeddings obtained from the question itself received as input.

In the solution described above, Knowledge Graph and VectorDB work synergistically to support the LLM model in providing accurate and relevant information with the business context.

Benefits

The application of OCR on documents in PDF format makes the solution applicable to any type of input document (docx, web pages, images, ..)
Significant reduction in development and fine-tuning efforts
Significant reduction or elimination of manual intervention in the process
Scalability of the approach to a higher number of sources and formats with no or minimal impact