« February 2009 | Main | October 2009 »

March 12, 2009

ETL Architecture - Core Principles

IT at the Speed of Business

The driving factor of the modern IT shop is to operate at the speed of business.

More than anything else, this means being able to respond rapidly to changes in the business climate. These changes come from the business units within the enterprise, from trading partners outside the enterprise and – for IT – from the continuous advancements of technology itself.

Whenever the business makes a new demand, IT must be prepared to satisfy that demand. It must do so rapidly. It must do so without major disruption. And it must be free to move forward when it needs to move forward.

The approach to technology that underlies a system contributes to the responsiveness of IT. And the core principles underlying the system’s architecture drive that approach.


Core Principles for ETL Architecture

It is important to give appropriate weight to the principles that drive the architectural foundation of the Extract, Transform and Load (ETL) system. These principles – listing the most important first – are:

·   Accuracy

·   Reliability

·   Flexibility

·   Extensibility

·   Autonomy

·   Cost of Ownership

·   Scalability

·   Speed

Three of these principles – Flexibility, Extensibility and Autonomy – have the greatest impact on IT’s ability to respond to changing business demand.

Flexibility is the key principle to guide all design, adoption and implementation choices. Flexibility means being able to adapt to forces of change, easily, swiftly and with minimum risk. Technology, products, the marketplace, or – especially –the business may impose necessary and beneficial change. Flexibility is essential to avoid “tear-up” when the inevitable changes occur.

Extensibility ranks right behind Flexibility. It is especially important when Flexibility must be compromised because of a limitation in a product or design choice. Extensibility means being able to take a product beyond its intended capabilities. This is the enabler of discovery and invention, two key elements of a vibrant IT organization. Extensibility allows you to overcome limitations, not with “workarounds”, but with solutions that are well-designed and architecturally sound.

Autonomy is the IT organization’s capacity for moving forward at its own pace. Autonomy is enabled by Flexibility, Extensibility and a skilled workforce. If an architected solution supports Autonomy, the IT organization can take an active role in creating what is needed, when it is needed, to support the specific business demand. The IT organization is not dependent on, or encumbered by, the ability or desire or timetable of vendors or markets.


The ETL Model

A model for this ETL architecture is as simple and complete as that shown in Figure 1

 

ETL Model (small)

Figure 1 ETL Model

 

EXTRACT is platform-specific. Its role is to optimally collect the data that needs to be shipped to the Transformer, including both the core data and its context. The Extractor may be consuming resources on a highly-active, highly-volatile system. It must be able to take advantage of platform-specific features, and deal with platform-specific limitations, in order to minimize disruption to the platform.

TRANSFORM is platform-agnostic. Its role is to mediate between two, sometimes conflicting players, resolving the differences between the two, preserving (or adding to) the value of the data from the source extractor, normalizing it for general usage and delivering it to the target loader.

LOAD is platform-specific. Its role is to optimally organize and store the data that has been pulled from the Application platform and mediated by the Transformer. Like the Extractor, the Loader must be able to leverage the platform-specific features and avoid the platform-specific limitations of its host system.


Essential Characteristics

The essential characteristics of the system depicted in the ETL model are:

Encapsulation – Each component of the ETL processing is functionally encapsulated. This provides the degree of isolation that is required to allow each component to incorporate whatever optimizations are most appropriate to achieve its objective.

This encapsulation limits the component’s scope of awareness – it “knows” only about its own environment and (literally) knows nothing about its partners. For example, Load knows the details of its physical database design and implementation, but knows nothing about the source system, the transform engine or the business rules that moved the data from one form to another. Likewise, Extract knows the details of the source data and perhaps knows the details of the application that produced the data, but knows nothing about the target system (or systems).

Because its scope of awareness is constrained, each component is also unaffected by change to either of the other components. This constraint offers more opportunities for adaptation and flexibility as new sources, new targets and new business rules emerge.

Loose Coupling, Standard Interfaces – The components are loosely coupled – that is, they communicate only through standard, open interfaces. The rules of encapsulation require that each component knows its partner only through the coupling interface. Using loose coupling respects that encapsulation.

Loose coupling promotes extensibility. A standard interface permits the insertion of additional components which can add functionality to the standard model. For example, a “fan-out” requirement – in which a single transform feeds multiple load targets simultaneously – can be implemented by inserting a “one-to-many” distribution component between the transform and the loads. Each load remains encapsulated, unaware of its sibling loads. The transform remains encapsulated, unaware of the “fan-out”.

Likewise, using a standard interface promotes autonomy. Since the interface is non-proprietary, the IT organization can add functionality without waiting for the product vendor to incorporate that functionality into the product. This capability is essential in allowing IT to respond rapidly at its own pace to changing business demands.

Platform Awareness – The ETL model allows for platform-awareness. Because the platform-related components – Extract and Load – are encapsulated, they can freely take advantage of those features specific to their respective platforms. This allows the use of special utilities, known only to the platform, to be used to their best advantage, for performance or other purposes.

The Transform component is not platform-related – its domain is the data itself. The rules of encapsulation and loose coupling dictate that the Transform component is unaware of the specific nature of the (physical) source or target. In this respect, the Transform is platform-agnostic. It must work with the data, regardless of its physical origin, applying the transformation rules imposed by the business requirements. The Transform component is therefore free to receive from any source and deliver to any target, trusting that its Extract and Load partners know what is to be done with the data.

It is important to encapsulate platform-specific characteristics, capabilities and requirements within the platform-specific processes. This allows those processes to flex or expand as they must to leverage the platform.

 

 

[ Yahoo! ] options

ETL Model

 

ETLModel.png

 Figure 1. ETL Model

This simple model of an ETL application - "3 circles, 2 arrows" - is all that is needed to highlight the key architectural principle of Flexibility. Each component is separated from the other processing components by a "wall". They communicate with one another through a loose coupling based on an exchange of messages. This is the long-understood requester-server model, that is not often enough adhered to.

return to ETL Architecture - Core Principles

[ Yahoo! ] options