How to build a Customer Data Platform on top of an Enterprise Data Warehouse
You, the marketing and sales team, want to have a Customer Data Platform (CDP). Your IT counterpart suggests you take advantage of the Enterprise Data Warehouse (EDW) they already have or are looking to build. You’re facing three options:
- Use the EDW as your CDP
- Buy a standalone CDP
- Deploy a hybrid CDP solution leveraging your EDW
Let’s analyze the three options and deep dive on the hybrid CDP solution. We’ll start by analyzing the pros and cons of each approach.
Use the EDW as Your CDP
As with many things in life, the devil’s in the details. Simply saying there’s an EDW you can leverage is too simple of a statement. Here’s a quick breakdown of a typical EDW architecture:
- Data ingestion – A layer of tooling to ingest data from different sources into a data lake. These can combine middleware technologies like Integration Platform-as-a-Service (iPaaS) and Extract-Transform-Load (ETL) tools.
- Data lake – The data lake is a system like Amazon S3 that holds raw data in its original or slightly optimized format.
- Data pipeline – Additional ETL tooling like Amazon Glue, and custom programming like Python, used to extract data from the data lake into a data warehouse like Amazon Redshift.
- Data warehouse – Data in a data warehouse is structured and denormalized, more ready for end use. A bit of clarification on terminology: “Data warehouse” is a product/technology. “EDW” is an architecture/solution. A simple EDW can be just a data warehouse without a data lake.
- Visualization and analytics tools – Data visualization tools like Tableau and Power BI can then use the data in the data warehouse. Any additional means of consuming data like APIs or applications are not part of the EDW architecture.
So your EDW can be in one of four increasing states of readiness for your CDP needs:
- Your EDW is not deployed yet.
- Your data is not yet in the data lake or you don’t know if your data is in the data lake.
- You have most of the data you need in the data lake in raw format, but not yet available in a consumable format in the data warehouse.
- You have most of the data you need in the data warehouse and any additional data you need can be pulled into the data warehouse with ease.
The pros of using your EDW as a CDP are:
- Infinite flexibility.
- Leverage work already done (this is only true if you’re in Stage 4 of readiness).
- Tap into a pool of resources that may be “free” to you.
The cons of using your EDW as a CDP are:
- It takes highly technical resources like programmers and data scientists to build and maintain your EDW/CDP solution.
- Most EDWs have a broad charter to serve multiple departments, so prioritizing your needs in a timely manner may be a challenge.
- Since most EDWs serve multiple stakeholders, your data is likely involved in a complex schema. So making changes will likely create a domino effect, which can further slow down your ability to make changes.
- We’ll get into more detail when discussing the hybrid CDP solution, but most of the time data hygiene and a proper data dictionary are inadequate to make the data useful right away.
To deploy an EDW from design to go-live can easily take 18 to 24 months. Adding new data to make that data usable can take six months or longer. Unless your EDW readiness is in Stage 4 outlined above, timing can be an issue even, if you already have an EDW. So, make sure you incorporate these timelines into your decision making.
Buy a Standalone CDP
Whether you have an EDW or not, you can always buy a standalone CDP. There are plenty of choices out there. In fact, there may be too many confusing choices out there. This article’s focus is not to help you choose a particular commercial CDP, so we’ll save that discussion for another day. Instead, we’ll focus on the pros and cons of buying a commercial CDP solution.
The pros of a standalone commercial CDP solution are:
- Quick deployment compared to an EDW or any type of custom-built option.
- Rich utility/end-user features for certain use cases. Depending on what you’re trying to do with a CDP, these utilities that come pre-built in the CDP can save you a lot of time doing custom development on top of the database.
The cons of standalone commercial CDP are:
- Most commercial CDPs are designed for a very specific set of use cases. These out-of-the-box solutions can be limited in their ability to be customized to your specific needs, as well as extending their utility to other use cases.
- Just because it’s supposed to be a “standalone” CDP doesn’t mean it’s complete with everything you need. Data quality and additional integration are features that are often inadequate and will require additional third-party toolings to be added. Make sure your budget comparison includes any required third-party tooling.
- Buying a standalone CDP means additional cost. Even though nothing is really free, even if you are in Stage 4 EDW readiness, it will likely require additional justification.
The main tradeoff of a commercial CDP is basically speed vs. flexibility. When you evaluate CDP solutions, make sure you have a good roadmap of use cases for the next 36 months and set your evaluation criteria based on that. Anything less you may end up being stuck and have to change technology platform after one or two years.
A Hybrid CDP Solution
The option of using an EDW straight up as a CDP, and the option of buying a commercial CDP both have their challenges. When you have an EDW, you may have political pressure to use it instead of “going rogue” and buying a standalone solution. Creating a hybrid CDP solution that leverages your EDW can make a lot of sense both functionally and politically. A hybrid CDP solution can fill the gap and overcome challenges associated with the EDW, add missing features you get from a commercial CDP, and enable you to leverage the EDW assets and resources you have. Let’s discuss what a hybrid solution architecture can look like.
Which EDW Assets Should You Use?
EDWs are a great asset if you can use them effectively. Here are the components you should use to get the most out of your EDW:
- Storage and archive – The data lake is designed to hold a vast amount of raw data securely and cheaply. This is a great place to store raw data, old data, and any data you may just want to keep around for whatever purpose. In fact, storage is cheap these days, so when in doubt, keep it!
- Simple data pipelines – Whatever integration solutions your EDW uses should be very good at moving a large amount of data securely and reliably. Those solutions usually have a large number of out-of-box connectors and a robust integration framework for any system that doesn’t already have a connector. Leverage this integration to get the data you need into your data lake first. Once you have your raw CDP data in one place, your CDP architecture becomes greatly simplified. It’s worth clarifying that these are “simple” pipelines because they largely just move data from Point A to Point B. They aren’t good at handling business logic and computation, which we’ll discuss more below.
- Data visualization – Your EDW probably has at least one data visualization solution sitting on top of it. And your company probably has people already trained on these data visualization platforms, whether it’s Tableau, Domo, Looker, or SAS. Leverage these tools and resources for reporting and ad-hoc analysis.
With these great foundational pieces “borrowed” from your EDW, you can now focus your energy on how to supplement your EDW to create an agile CDP that can respond at the pace the business changes without being anchored by the rigidity of the EDW. The diagram below shows a hybrid CDP architecture built on the EDW. We’ll discuss each component in detail.
Add a Data Quality Engine
Intuitively, you would think that for an EDW, a CDP, or any type of data platform, data quality should be front and center, a core capability. Otherwise, it’s just garbage-in-garbage-out, right? Unfortunately, that’s almost never the case. For whatever reason, data quality is usually grossly lacking in most EDW or CDP deployments. In most EDW deployments, data quality is handled using a combination of ETL tools and custom code. On the other hand, most commercial CDPs not only lack built-in data quality capabilities, but they also offer limited, if any, custom coding options. The lack of quality data has severely limited the return on investment for EDW and CDP projects.
Whatever CDP-related use cases you may have, whether it’s prospecting, activation, or attribution, all require high-quality data. So your first order of business should be adding a data quality engine to your hybrid CDP. Ideally, you should have this engine added as a standard EDW component, but bolt one on for your CDP if you have to, because this is essential to any successful CDP execution. What are the critical capabilities for this data quality engine?
- Bi-directional integration to your data lake and data warehouse, so you can pull any data out of your EDW, clean it the way you want it, and push it back.
- The ability to perform these operations at a minimum:
- Ideally, it can be used by your team without the help of the EDW IT team.
Add a Computation Engine
Once you have good quality data, then you need the ability to manipulate that data so it can be more useful for the analysis you wish to conduct, and automate the processes you wish to scale. Automated decision making often requires computing the data into a decision-ready form, including doing mathematical computations, as well as adding relationships and structure. Simple calculations are often done using the rules engine built into your CRM. For more complex calculations, companies usually resort to writing Python programs. Here are some typical examples of computations we see:
- Matching person-to-person, person-to-account, or account-to-account across data sets
- Total number of active opportunities for an account
- Total lifetime value of an account
- Quarter-over-quarter, year-over-year growth in revenue or cost
- Aggregate revenue across dimensions like channel, product group, and geography
- Demographic and behavior scoring
- Build out account hierarchy relationship
- Aggregate data up the account hierarchy
- Contribution, distribution, and allocation calculations of all sorts
What are the critical capabilities for this computation engine for your hybrid CDP?
- Bi-directional integration to your data lake and data warehouse, same as above.
- The ability, at a minimum, to perform these operations:
- Basic to advanced mathematical operations
- Highly configurable matching logic
- Ability to create relationships between data
- Built-in storage to support multi-stage, complex computational logic, so you don’t have to rely on the EDW to store intermediate calculation results
- Ideally can be used by your team without the help of the IT team. Ideally, no coding or scripting is required.
Add a Process Automation Engine
Why do you want a CDP in the first place? There are only two reasons people need to touch data at all: (1) to gain insight and (2) to automate processes and decisions. EDW primarily serves the purpose of helping people get insight out of data. It doesn’t do much in terms of automating business processes. In today’s world, more and more business processes need to be automated, and more and more decisions need to be made faster than ever. What is a decision, if not a data-driven task? This’s what a CDP does; it’s not just a niche EDW. The true value of a CDP is the automation and utility it offers beyond just data. See this blog post on the Openprise Agile CDP Solution for a detailed discussion on why CDP is not just a niche EDW.
To create a CDP from an EDW, you need to add a process automation engine, and it can’t be just a traditional workflow engine because the type of processes a CDP often has to automate are non-transactional, or data- or computation-intensive. Some key capabilities to look for in a process automation engine:
- Ability to automate processes that involve large volumes of data. The definition of “large volume” is subjective based on your business and capabilities, but we’re talking about data payload on the order of 100,000 to 100 million records, not 1 to 1,000 records.
- Ability to automate processes that involve a large volume of data in batches, and a smaller volume of data in real time.
- Ideally on a single platform, but at least tightly integrated with the data quality engine and the computation engine.
- Once again, ideally, a no-code / low-code platform that requires no IT support.
Add Process Integration Options
As discussed above, the data ingestion technologies used in EDW that you should use for your CDP are very good at moving large volumes of data from Point A to Point B. But they’re limited in handling the complex operations that are often required in automating business processes or data management processes. For example:
- Merging records
- Adding or updating data sets with complex inter-dependencies
- Adding or updating data across objects and systems with complex sequencing requirement
- Conditional operations
Which exact capabilities you need depends on how you intend to use your CDP, which systems you’re integrating with, and the specific processes you need to automate. Be aware that process automation often requires more than just moving the bits from one table to another.
Add an Interaction / Utility Layer
Some processes involve no human interaction, but many do. People may be the ones that trigger a process to kick off by loading or specifying a data set to be processed. People may be the ones that have to handle exceptions or provide approvals during a process, or they’re the end consumer of the data. In all these data-human interactions, the data consumer experience can be greatly improved with the help of additional data-human interfaces, or simply put, “apps.” Here are some examples to illustrate the point.
- Sales and marketing teams use CDP for prospecting
- Basic data interface: provide user access to the marketing database with a querying tool, SQL query anybody?
- Rich data interface: provide each sales and marketing team a prospecting portal that contains only data that’s relevant to that team and enables the user to add a list of prospects to a campaign in the CRM with a single click.
- Marketing teams need to load lists of new leads to the CDP
- Basic data interface: send the list to an analyst to manually clean and load into the database.
- Rich data interface: provide the marketing team a simple app to drag-and-drop a list to be automatically cleaned and loaded, as well as showing the list of files successfully loaded and those that require human input.
- Customer Success team wants account information from the CRM
- Basic data interface: provide a data dump to the Customer Success team daily.
- Rich data interface: provide the Customer Success team an API that can take a person’s data as input and return which account is the person from, other rich account data such as who is the account owner and deal-related data. This data can be displayed in the help desk app in real-time via the API.
How Can Openprise Help?
An EDW is a great asset if you have access to one. A hybrid solution architecture can be a great way to build a best-in-class CDP on your EDW, and leverage its strength while filling in critical gaps by adding:
- data quality engine
- computation engine
- process automation engine
- process integration options
- interaction/utility layer
The Openprise Data Orchestration Platform provides all five of these critical capabilities in a single integrated platform, and more. By adding Openprise to your EDW, you can have an Agile CDP that can deliver results quickly, can be managed by MarketingOps, SalesOps, or Business IT teams without any coding. If this sounds like a good approach, contact one of our data orchestration experts.