Get a head start on data quality control with the new Block for Talend Studio
Feb 11, 2021
Good data quality is a strategic asset that provides businesses a competitive advantage. Without confidence in the data, it’s difficult to be confident about the decisions you make based on that data.
That’s why Talend and Looker have developed a data quality Block for Talend Studio. It gives you a pre-built data model that makes getting data quality analytics easier for any organization that uses both Talend Studio and Looker.
Talend Studio is a development environment for creating ETL, API, and application integration solutions using a graphical user interface. It bundles hundreds of prebuilt components and connectors, lets you design solutions with a drag-and-drop interface, and then generates code that runs natively on your platform.
Of course, piping the data to its destination is only half the battle. The other half is analyzing and reporting on it, and that’s where Looker comes in. The new Block quickly gets you up and running with analytics in Looker based on the schemas produced by Talend Studio.
Data quality metrics that matter
Specifically, the Block uses data from the Talend data quality data mart, a subject-oriented subset of a data warehouse that contains data quality analysis results from Talend data profiling. The new Block provides a “getting started” template to create a set of key metrics for exploration, and a dashboard that displays measurements for six data quality dimensions. As defined by the Data Management Association of the UK, those dimensions are:
- Completeness: The proportion of data stored against the potential for 100%
- Timeliness: The degree to which data represent reality from the required point in time
- Validity: The degree to which data conforms to the syntax (format, type, or range) of its definition
- Accuracy: The degree to which data correctly describes the real-world object or event being described
- Consistency: The absence of difference, when comparing two or more representations of a thing against a definition
- Uniqueness: Nothing is recorded more than once (based upon how that thing is identified)
It also provides:
- An overall data quality score
- Number of rows processed, passed, and failed in the pipeline
- Data quality score over time
- Data quality score by department
This simple data quality dashboard is a starting point for a more robust dashboard that you can tailor to your organization’s needs, and customize depending on how you want to define data quality rules. Some organizations, for example, might find a 95% rating good enough for one attribute (such as age range) but not good enough for another (date of birth). The final dashboard might weight the various measurements to provide simplified metrics.
How to use the new Looker Block in Talend Studio
Let’s assume we are going to build a data quality dashboard to support the BCBS 239 regulation, which is meant to strengthen banks’ risk data aggregation capabilities and internal risk reporting practices.
To achieve this, you would:
- Create an inventory of risk reports that are submitted to the local banking regulator. For example, reports that cover credit risk, market risk, operational risk, and so on.
- Document all the individual attributes that constitute an individual report. For a Credit Risk Report, those attributes would be the value of total loans, OTC derivatives, traded loans, traded bonds, and more.
- Identify the list of applications that are used in the preparation of the risk report.
- Build and execute data quality rules at important points in the data lineage.
- Feed the results of these rules into the Talend data quality mart and use the Looker Block to create a user-friendly dashboard.
Once the dashboard is updated at a regular cadence, you can monitor the data quality metrics. For instance, you can drill into the numbers you see to understand why a data quality score might be low, which rules are causing it, and then work with the application team to fix the issue.
For example, the “Completeness” data quality score in the example above is 51.8%. By drilling down into the score you might find out that this is because, out of the 10 data quality rules, 3 of them have low scores. Investigating further, you might discover that these low scores are because not all loan products were included — because a file wasn’t delivered on time. Now you have the information you need to rectify that low completeness score, and ensure you’re making decisions with accurate data.
Start monitoring your data quality today
To start using the data quality Block for yourself, open a project in Talend Studio and create an analysis to profile a dataset. You might be surprised by how quickly you can access useful metrics.