Digital Integration Hub

Architetture per l’accentramento dei dati e l’abilitazione di servizi data-driven” is “Architectures for data centralization and enabling data-driven services.

Industries:

Finance & Insurance - Retail & FMCG - Transportation - Industrial

Solutions:

Data Platform

Technologies:

Confluent - Snowflake - Aws - Azure - Google Cloud Platform

Context

The Fourth Industrial Revolution is strongly characterized by the desire to make the most of digital technology not only to optimize business processes but especially to enable new data-driven services for the company. In this new era of digitization, the IT ecosystem is based on the centrality of data, considered first-class company assets that can be reused for multiple use cases, both operational and analytical.

To maximize the extractable value from data offers several factors of competitive advantage, such as enhancing services provided to the end customer, ensuring new insights, making predictions about future business trends using advanced artificial intelligence-based analysis techniques, improving integration between traditional software systems and new digital applications (web and mobile), generating new revenue opportunities through data sharing and monetization.

To achieve these objectives, it is necessary to have an integration platform that facilitates access, sharing, and use of data by applications different from those that generated the data itself. Such a scenario represents a turning point compared to the past when IT architectures were designed with an approach that gave more centrality to investment in domain applications (Systems Of Record), to the detriment of data management. The integration of the latter was considered a secondary aspect, to be addressed purely functionally to enable individual use cases as they arose, without a real forward-looking data management strategy. Applications were designed in a way that was not oriented toward sharing but toward storing data internally: this aspect made data reuse as an asset difficult and limited the extractable value from it.

The data-centric movement is strongly contributing to changing the thinking paradigm, and this has led to the emergence of new architectural patterns that are more in line with data sharing and reuse principles than platforms based on point-to-point ETL integrations or traditional SOA architectures. Among these, the Digital Integration Hub pattern is particularly interesting for its ability to make domain data available to different consumers in a scalable and efficient manner by making the best use of modern cloud-based technologies.

Challenges

In the design of data-centric architectures to support operational and analytical use cases, some technical aspects that are crucial for the solution’s effectiveness should be carefully considered. Below, we list some of them:

Latency and eventual consistency

In order for users to retrieve data for operational, Real Time Analytics, or Operational Analytics purposes, it is necessary to minimize latency as much as possible between the moment a record is inserted, modified, or deleted in the source and the propagation of the event to the shared integration platform. This reduces the likelihood that consumers receive an outdated version of the data or experience the absence of data that they would expect to receive. The choice of technological components that make up the integration solution is a critical aspect that directly impacts latency in all stages of offloading from sources, real-time transformation, and ingestion into the platform. In any case, a solution that involves offloading data from sources must always consider the possibility of tolerating an eventual consistency regime, which, although infinitesimal, can be an obstacle for certain use cases that require strict Read After Write consistency or require data in hard real-time.

Single Source of Truth and Data Domain View

In order for the same data extracted from sources to be reused among different consumers, both operational and analytical, it is necessary to include in the architectural solution a component that acts as a single source of truth for the consistency and integrity of the data. This aspect is crucial because it is essential to ensure that all consumers accessing a data entity receive the same version of records and that physical replicas of a dataset across various layers of the platform are aligned as closely as possible in terms of timing. In this perspective, it is not sufficient to consider only the standard operational scenario but also special cases, such as the need to correct bugs, fix anomalies, or reprocess the entire dataset from the beginning (which often occurs for the initial loading of a new consumer who needs to subscribe to the dataset). The architectural components chosen as the single source of truth must guarantee the possibility of indefinitely preserving the data history (while complying with regulatory compliance constraints related to personal data processing). The data must be exposed in the form of domain data structures that are self-consistent, self-descriptive, and complete in terms of informational content that may be of interest to the majority of consumption use cases. The provision of data to consumers must ensure good read performance, resolving technical complexities during the feeding of exposed structures to make user queries as simple as possible.

Polyglotism and polymorphism

To make data reusable on a large scale, it is necessary for the integration platform to be hybrid and convergent, meaning it should support different techniques for data storage, processing, and access to optimize retrieval by the widest and most heterogeneous category of consumers possible. The platform must be able to handle structured, semi-structured, and unstructured data, extracted from domain applications or obtained from external sources such as sensors, social networks, open data, and SaaS products. Furthermore, the architecture should provide the capability to access corporate data through API interfaces and request-response-based approaches, continuous event consumption, ad-hoc queries, federated queries, and data product sharing.

Solution

The Digital Integration Hub architecture is a valid solution to harness the benefits of a data-centric approach. The following diagram illustrates the design of a Digital Integration Hub architecture. It involves a real-time data offloading component from sources driven by events, which centralizes data in the shared integration platform, importing it with the lowest possible latency. To distribute domain events to multiple consumers, it is necessary to introduce an event broker and streaming platform component, allowing for fan-out of the same data to different subscribers and potentially some real-time data transformations.

Among the subscribers, there are components used for persistent data storage, which function as a Single Source of Truth. Typically, a cloud object store is used as a long-term storage system and data archive, supporting analytical use cases in a Data Lakehouse or federated query paradigm, as well as Data Science, Self-BI, and data exploration. The use of a cloud object store as a layer of the Persistent Staging Area enables the real-time triggering of data integration, through notification events, for data deposited as objects in buckets to other data stores or cloud platforms. This allows the creation of continuous data integration pipelines guided by events.

Among the possible systems powered by the object store, there are analytical data platforms optimized for transforming data into a model suitable for reporting and business intelligence. There are also serverless integration stacks that replicate datasets of interest into low-latency, highly scalable, and high-performance NoSQL databases, supporting data access through point lookups or on-demand queries by operational consumers.

This achieves the polyglotism and polymorphism of storage: the same domain data can be exposed both in the form of structured data model, which we call Logical DWH, consisting of a hybrid ecosystem of cloud data analytics platforms and federated query engines that perform queries directly on the object stores, and in document format, hierarchical, with schema on-read from a NoSQL database that supports key-based or range-based access, as well as secondary indices.

Above the data storage systems, capabilities for data access are usually provided. These capabilities can be represented by a combination of serverless functions and workflows that perform data retrieval on-demand, data virtualization tools, and API gateways, mainly of the REST or GraphQL type.

Vantaggi

Integration Cost Rationalization

With this type of architecture, separate integration stacks for ETL flows, which feed data systems for analysis, and for application integration services are avoided. This helps reduce the operating costs of the underlying IT infrastructure and facilitates evolution and maintenance activities.

Reduction of the load on the sources

Redirecting application read requests for data to the integration platform instead of the source applications, reducing the workload on the latter. This aspect is of fundamental importance to ensure the scalability of the integration solution as the volumes of requests and data managed increase.

Reduction of the load on the sources