Exceeding Talend Data Quality Studio limitations

0

Importance of high quality information

Without high-quality data we  cannot understand  today’s market place,  our customers, their decisions and needs. Most important decisions in Company have to rely on reliable and confident data. Wrong information can lead  into potential wrong business decisions which can have disastrous impact on your  Company. Hopefully more and more business leaders is already aware of this fact and invest more resources to keep data in good shape.

Taking care of data quality in your Company

Our recent project cope with quality of data on the multi-vendor database  DWH platform.  Being able to control what is process,  compare sources and targets,  assuring business users receive correct and actual information on  time – was the primary aim of  the IT  management team. On the distrubuted, mulit- vendor database DWH platform ETL pocesses were missing crucial business information. We were involved to propose solution base on the tools that were already in used on the project.

Solution limitation \ constraints we faced on the field side:

  • Multi-vendor database DWH  platform, ( Postgre CE, Oracle 12c,  MS SQL,  Teradata )
  • No access to commercial data quality engine
  • Implementation set of rules defined by business  in  flexible way allowing further “easy-going” modifications
  • Talend as preferred  bridge between databases

One of the huge limitation of  Talend  Open Source DQ  Studio is access to only one source ( database/schema/catalog)… sic !  Having data to compare  on various sources makes Open Talend DQ  Studio almost useless, unless you invent some workaround. Our project team considered several approaches. Unfortunately upgrade onto enterprise edition was not the option.  One of the idea for workaround was creating on Oracle central DQ repository and leveraging Oracle heterogeneous connectivity feature, allowing to connect to PostgreSQL and MS SQL via  database link. In next step we could build synonyms to almost all database object in interest and cheat Talend!   Unfortunately this option was classified as potential security breach in the system and not longer reconsider.

Data Quality rules

Users requested to have flexible functionality allowing them to  setup data quality rules  parametrized by:  pair source/target to compare, set of columns to compare, expression, aggregate functions etc.  Having basic SQL  skills  they could easily test their rules and make sure DWH platform delivers high quality indicators .  To test rules were assigned following attributes:

  • Subject area: definition of business area in which given data quality  rule is used
  • Test type: technical characteristic of  data quality rule. Test type determined technical complexity of data quality rule.
  • Severity : determined by disparity rate between compared data sets. Thresholds of the disparity rates are setup in metadata.

From business point of view rules have to perform many different calculation executed  on different data sets / databases like : number of clients coming from different departments,  number non active clients,  number  of clients acquired  in last month , value of sold insurance policies in last month,  value of coverages, total income in last month, total income in last month per department. Returned results  are compiled afterwards.  disparity are filter out and highlighted for further investigation. This is the result which Client expected from us to deliver.

dq_rules_450

Case study, exemplary data quality  rule.  Management wants to know  if information about new Clients has been  calculated properly.  SQL statements are prepared and setup in DQ engine  to execute on appropriate source system.

  1. Investment Banking database.  Total number new of   new Clients from last end month.
  2. Private Banking database. Total number of new Clients from last end month.
  3. Retail Banking database. Total number of new Clients from last end month.
  4. Bank Data Warehouse. Number of new Clients from last month grouped  by source systems (Retail Banking, Investment Banking, Private Banking).

After execution test result can be found in DQ schema.  Test are fully automated, there is no need to have db links between systems or manually migrate data.  DQ  officer receives test results in one dedicated place.

dq_flow

Making Talend  DI flexible

Talend was chosen as a bridge between databases to compare as the Open Source and free tool to integrate data. Talend as each  ETL  device has to have hardcoded port names, port data types – and there is now way force him to be more flexible.  Design of generic Talend job has to base on the aliases defined on the DQM schema not real data object name to be flexible and  reusable. Thus all  columns name are dynamically converted onto aliases  after executing proper query on database and before passing results onto Talend. Talend processes data and loads  partial results into DQM schema – to the special prepared temporary tables.

 

dq_tool_450

 

 

When modeling database objects and  mapping  grid we assumed following  requirements:

  •  up to ten columns to check ,
  • up to ten measure to calculate ,
  • up to ten expression to  determine.

In the next phase stored procedures located on DQ schema compare both precalculated data sets. Results are logged into DQ schema and can be tracked via Excel sheet. Making proper driving schema to keep information about the metadata and data quality rules was crucial.  It has to transform/map specific database object names onto alias on  which  generic  Talend job could operate and move them in into our central repository. DQ  schema is filled by DQ officer with metadata  and these metadata are used afterwards by DQ engine to generate queries  and to run then on different data sets.

Persistence of Analysis

Another important limitation of  Data Quality Studio from Talend is lack persistence of performed analysis. They are not saved in database nor in the file system.  One can not set up them as  batch process and execute on regular basis –  they can only be performed ad hoc and view via Talend Sudio.  To overcome this limitation  DQ engine has been  delivered. Results  of analysis are gathered and saved  in database.  Execution can be scheduled or can be done ad hoc.  Results be also easily exported to any kind of file: pdf, xml, csv, excel or send via e-mail do data quality officer after execution.

Powerful Data Quality Analysis in a Free, Open Source Solution

With Talend Open Studio for Data Quality you can quickly and easily design and execute data quality analysis jobs. With its Eclipse-based graphical development console chock-full of drag ‘n drop data quality tools. Talend Open Studio for Data Quality delivers robust data quality analysis functionality including:

  • Comprehensive data connectivity. Talend provides more built-in data connectors than any other data quality application, making it easy to connect to and read from any file format, database, or packaged enterprise application.
  • Statistical profiling. Quickly generate statistics on blank fields, null values, duplicates, unique values, the most and least frequent values, and much more.
  • Textual analysis. Develop profiles of text fields including minimum, maximum, and average length.
  • Numeric analysis. Analyze numeric fields to determine means, ranges, quartile distributions, and so on.
  • Pattern-based analysis. Assess data for conformance with standardized patterns like email address syntax, credit card number formats, and postal codes; or customized patterns for your specific needs, like SKU or serial number format

Struggling with data quality issues ?

If you are struggling with  similar challenge with quality of data and you think our approached could be useful for  you – do not hesitate to contact us and ask for more materials.  We  design cost effective  solution which can be easily  tailored  to your needs.  It is build entirely from free components.  No licence fees.   Let us know more about your requirements, than  we might  make analysis,  arrange case studies and show you how our solution could work for your Company.

 

Gregor Ash,   Infastage Ltd

Mobile: 00 48 609 076 200    or   00 41 (44) 586 21 96

E-mail address :   gregor.ash@infastage.pl

Leave us a comment