Importance of high quality information
Without high-quality data we cannot understand today’s market place, our customers, their decisions and needs. Most important decisions in Company have to rely on reliable and confident data. Wrong information can lead into potential wrong business decisions which can have disastrous impact on your Company. Hopefully more and more business leaders is already aware of this fact and invest more resources to keep data in good shape.
Taking care of data quality in your Company
Our recent project cope with quality of data on the multi-vendor database DWH platform. Being able to control what is process, compare sources and targets, assuring business users receive correct and actual information on time – was the primary aim of the IT management team. On the distrubuted, mulit- vendor database DWH platform ETL pocesses were missing crucial business information. We were involved to propose solution base on the tools that were already in used on the project.
Solution limitation \ constraints we faced on the field side:
One of the huge limitation of Talend Open Source DQ Studio is access to only one source ( database/schema/catalog)… sic ! Having data to compare on various sources makes Open Talend DQ Studio almost useless, unless you invent some workaround. Our project team considered several approaches. Unfortunately upgrade onto enterprise edition was not the option. One of the idea for workaround was creating on Oracle central DQ repository and leveraging Oracle heterogeneous connectivity feature, allowing to connect to PostgreSQL and MS SQL via database link. In next step we could build synonyms to almost all database object in interest and cheat Talend! Unfortunately this option was classified as potential security breach in the system and not longer reconsider.
Data Quality rules
Users requested to have flexible functionality allowing them to setup data quality rules parametrized by: pair source/target to compare, set of columns to compare, expression, aggregate functions etc. Having basic SQL skills they could easily test their rules and make sure DWH platform delivers high quality indicators . To test rules were assigned following attributes:
From business point of view rules have to perform many different calculation executed on different data sets / databases like : number of clients coming from different departments, number non active clients, number of clients acquired in last month , value of sold insurance policies in last month, value of coverages, total income in last month, total income in last month per department. Returned results are compiled afterwards. disparity are filter out and highlighted for further investigation. This is the result which Client expected from us to deliver.
Case study, exemplary data quality rule. Management wants to know if information about new Clients has been calculated properly. SQL statements are prepared and setup in DQ engine to execute on appropriate source system.
After execution test result can be found in DQ schema. Test are fully automated, there is no need to have db links between systems or manually migrate data. DQ officer receives test results in one dedicated place.
Making Talend DI flexible
Talend was chosen as a bridge between databases to compare as the Open Source and free tool to integrate data. Talend as each ETL device has to have hardcoded port names, port data types – and there is now way force him to be more flexible. Design of generic Talend job has to base on the aliases defined on the DQM schema not real data object name to be flexible and reusable. Thus all columns name are dynamically converted onto aliases after executing proper query on database and before passing results onto Talend. Talend processes data and loads partial results into DQM schema – to the special prepared temporary tables.
When modeling database objects and mapping grid we assumed following requirements:
In the next phase stored procedures located on DQ schema compare both precalculated data sets. Results are logged into DQ schema and can be tracked via Excel sheet. Making proper driving schema to keep information about the metadata and data quality rules was crucial. It has to transform/map specific database object names onto alias on which generic Talend job could operate and move them in into our central repository. DQ schema is filled by DQ officer with metadata and these metadata are used afterwards by DQ engine to generate queries and to run then on different data sets.
Persistence of Analysis
Another important limitation of Data Quality Studio from Talend is lack persistence of performed analysis. They are not saved in database nor in the file system. One can not set up them as batch process and execute on regular basis – they can only be performed ad hoc and view via Talend Sudio. To overcome this limitation DQ engine has been delivered. Results of analysis are gathered and saved in database. Execution can be scheduled or can be done ad hoc. Results be also easily exported to any kind of file: pdf, xml, csv, excel or send via e-mail do data quality officer after execution.
Powerful Data Quality Analysis in a Free, Open Source Solution
With Talend Open Studio for Data Quality you can quickly and easily design and execute data quality analysis jobs. With its Eclipse-based graphical development console chock-full of drag ‘n drop data quality tools. Talend Open Studio for Data Quality delivers robust data quality analysis functionality including:
Struggling with data quality issues ?
If you are struggling with similar challenge with quality of data and you think our approached could be useful for you – do not hesitate to contact us and ask for more materials. We design cost effective solution which can be easily tailored to your needs. It is build entirely from free components. No licence fees. Let us know more about your requirements, than we might make analysis, arrange case studies and show you how our solution could work for your Company.
Gregor Ash, Infastage Ltd
Mobile: 00 48 609 076 200 or 00 41 (44) 586 21 96
E-mail address : email@example.com