Ralph Kimball covers some great topics on ETL System Flows in his book The Data Warehouse ETL Toolkit. This week, I wanted to share some of those details on Horizontal versus Vertical ETL System Flows.
Kimball talks about two types of ETL System Flows. You have ETL that is organized either horizontally or vertically. He goes on to talk about how you would determine whether your ETL system should be either horizontal or vertical.
ETL System Flows
The ETL System Flow is the definition of how your data starts and ends within your data warehouse. Data originates from one or more data sources, they are cleaned, conformed and eventually delivered. The ETL System Flow entails all the processes and sub-processed that make up that flow and in which order they are executed to achieve some end result.
Horizontal ETL System Flow
With Horizontal ETL System Flows, your processes or sub-process are not linked together. They execute side-by-side independent of one another until they are complete.
For example, if you have an ETL system on customer orders, the ETL job could run to completion. However, if you have another ETL job on inventory tracking with those customers, the ETL job could fail to complete. Customer orders and inventory tracking processes are not linked and could be inconsistent with one another.
Vertical ETL System Flow
With Vertical ETL System Flows, your processes or sub-processes are linked together. Each job must run to completion and wait for the other jobs to get to the same point. Using the same example with horizontal, your inventory tracking job must wait until the customer order job is completed. Once that ETL job is completed, then it can proceed to run until it’s completed. If any failures happen, then the entire ETL fails.
Determining ETL System Flows
Ralph Kimball explains how there are two main variables when deciding whether your ETL system should be horizontal or vertical. Those variables are:
- Detailed data dependencies that require several ETL jobs to process through the steps in parallel. (That is, if the inventory tracking job fails, maybe the customer orders job could have undefined product codes.)
- The sensitivity the end user community might have to partially updated data (for example, orders being updated but shipments lagged by a day.)
I’ve designed both of these types of ETL System Flows in my time. When I’m doing horizontal ETL, I’m trying to load data in parallel with one another to decrease the load time. In theory, I’m trying to split the problem into smaller manageable pieces and load them together as opposed to one big chunk. This is very effective in scenarios where you have to load a large amount of data and then clean and conform it.
When I’ve developed vertical ETL, it’s because my decision makers need data to be fully updated before it gets to their hands. If any part of the ETL process is incomplete, they do not want the data. It’s either all or nothing in their minds. Thus, I focus entirely on ways to ensure data is ready to be loaded and design ETL systems around the various types of failures I could see. This makes designing vertical ETL harder than horizontal ETL.
Ultimately, you can do a mixture of both. ETL is not just one process. While the ETL system may be referenced as one big data flow that gets data from A to Z, it’s likely made up of many processes that have various conditions per process. Nothing is stopping you from taking a large chunk of data, breaking it up outside the database and loading it in parallel. Nothing is stopping you from them linking each piece of that data once in the system to ensure it’s being loaded vertically.
The Data Warehouse ETL Toolkit
As always, Ralph Kimball has great insights and I encourage you to read up more from his book, The Data Warehouse ETL Toolkit.