Automating Data Reconciliation, Data Observability, and Data Quality Check After Each Data Load.
Over the last several years with the rise of cloud data warehouses and lakes such as Snowflake, Redshift, and Databricks, data load processes have become increasingly distributed and complex. Organizations are investing more capital in ingesting data from multiple internal and external data sources. As companies’ dependency on data increases, every day and business users use the data for critical business decisions, ensuring high data quality is a top requirement in any data analytics platform.
As data gets processed every day through various pipelines, data can break for hundreds of reasons, from code changes to business process changes. With a limited team size and multiple competing priorities, data engineers are often not able to reconcile all data(or any data) every day. As a result, many times business users find out about the data issues before the data engineering team knows about them. But at that point, it is too late for them to build the trust back.
How can we pro-actively learn about data issues before users tell us? What if we automatically reconcile data after each load every day andalert data engineers when there is a data issue? Is there any architecture or solution that can help us?
Yes, let’s review a solution called 4DAlert that automates data reconciliation, data quality, and data observability in detail and see how it could help identify the issues automatically before bad data reaches downstream reports and dashboards used by multiple users.
Scenario 1 — Reconcile data between source and target.
Almost all data platforms load data from multiple source systems. Due to one or other reasons data between source and target doesn’t match. Data teams spend manual effort every day to reconcile numerous data sources.
4DAlert solution connects to diverse data sources and automatically reconciles data between source and target. The solution leverages its own AI engine to determine the reconciliation issues and alerts appropriate stakeholders through multiple channels which include email, texts, and Slack channels.
Scenario 2-Data reconciliation within the analytics platform.
Sometimes connecting to source systems is not possible due to several reasons such as source systems are owned by different groups and they don’t allow or source systems are too rigid for any external connection. In that scenario, 4DAlert’s AI engine reconciles incoming new data with historical trends to determine data anomalies and reconciliation issues.
Scenario 3 – Data Compare across the systems.
In most organizations, there are multiple systems that consume the same data. Therefore, it is a continuous challenge to keep data in-sync across systems. 4DAlert’s flexible architecture allows it to connect diverse source systems and check key data points across the systems.
Scenario 4- Checking numbers across layers in an analytics platform.
Many times, the same data is stored in different layers and different objects. As multiple pipelines and loads run on a daily basis, it becomes difficult to check if the numbers are the same across the systems. the 4DAlert solution checks the numbers across layers and alerts when data doesn’t match.
The solution that connects to diverse data sources.
4DAlert is a WEB API based AI solution that connects to most databases such as Snowflake, Redshift, Synapse, HANA, SQL server Oracle Postgres and many more) and reconciles data between source and target at a periodic schedule.
The solution is designed to connect source and target databases even though both source and target databases are built on different database technology. For example, say source could be SAP HANA system and target could be Snowflake or Redshift system or source could be data lake in Azure or AWS S3 and target could be Snowflake or Redshift database, 4DAlert would be able to reconcile data without any issue.
Write your own SQL to detect the anomaly and check data quality.
Users can write their custom SQL queries to pinpoint any particular anomalies and overwrite their tolerance limit. For example, Sales varying by 10% is acceptable but varying by 60% is not acceptable. When users don’t define their tolerance, 4DAlert uses statistical variances and anomaly detection methods to detect outliers and alert as appropriate.
Data Observability.
In a data platform, there could be hundreds or thousands of tables. Every day multiple pipelines run and load objects. Few of the objects are loaded daily(sometimes multiple times a day) and weekly, monthly, or yearly, and others are loaded on-demand on an ad-hoc basis. It is very hard to keep track of how fresh the data is. Many times users continuously ask about the last load date.
4DAlert checks vital statistics of each object on a regular basis and labels each object on its freshness. This information could be broadcasted to users so that users are aware of the freshness of each dataset.
Auto Quality Score.
In an analytics platform, objects need to be loaded on a regular basis (sometimes with predefined SLA). Anytime data is loaded users expect the data to be loaded without any quality issue or load issue. However, many times there are objects that have frequent issues in load timing or data quality. A data observability platform such as 4DAlert tracks the failure points and provides a detailed performance scorecard for each object. Scores for each object are published as a dashboard to data engineers, enterprise data team and data scientists, and sometimes end-users for greater transparency.
Multiple keys and multiple metrics for any data set.
Many times, a dataset contains more than one key metric. For example; Dataset could have revenue and sold qt, discount, cost of goods sold and any of these metrics could go wrong. So a solution should be able to scan more than one metric simultaneously to look for abnormalities.
Key quality metrics(Ex Row count, Null count, distinct count, max value, Min value)
4DAlert comes with many predefined metrics that are applied automatically to detect anomalies in the data. For example, the Material Number in inventory data should not be a null or distinct list of countries in the data set, it can’t be millions or the maximum amount of PO should not be more than 10,000. These rules are predefined and come out of the box and data sets are checked for these rules.
Enumerated value check.
Many times the data team wants to restrict certain field values to predefined value sets. Example currencies should be a value from a predefined currency list. Same for plants, country, region, etc… 4DAlert could check
Seasonality, Month-end/Quarter end or year-end spike
Many times, data spikes at month-end or quarter-end or year-end or at any particular period of the year. An AI-enabled solution such as 4DAlert takes into account the seasonality in the data as it tries to identify the anomalies.
Custom metrics.
If predefined metrics or custom metrics are not all you need then you should be able to add your own metrics. 4DAlert allows you to write your SQL query, check the values and detect anomalies.
This post was written by Nihar Rout, Managing Partner, and Lead Architect@ 4DAlert.
Want to try schema compare features that will help you continuously deploy changes with Zero error? Request a demo with one of our experts at https://4dalert.com/