About Data Quality

In the information age, data are the main tool of business success and in facing the competition. But the formula for success does not always lie with the amount of data. Most organizations are familiar with the fact that data stored in operation systems and data warehouses are not complete, accurate, or consistent. Attempts to benefit from low-quality data are likely to fail and can even cause damage. Often the ratio of redundant data exceeds 70%. The following article by Michael Baor addresses this issue.

By Shmulik Aloni, Manager of Business Integration Section 

Market pressures are forcing companies to invest millions of dollars in business intelligence (BI) architectures, ERP systems, and decision reporting tools. But often companies ignore the component that is most vital for success: clear, consistent, and reliable data. Data are the building blocks of information. If the data in the operation systems and data warehouses are not complete, accurate, and consistent, attempts to benefit from the organization’s information are going to fail and can even cause damage.

Consider a data warehouse belonging to a bank, importing data from various operation systems, and that contains a duplicate record of the same customer. One record lists a negative balance of $-1,000 for the customer, while another record shows that the customer also has a positive balance of $1.5M. The branch manager who accesses the first record but not the second one can involuntarily cause the loss of this customer. This is a case of record duplication and the inability to recognizes the fact that this is the same customer.

Decisions Based on Deficient Data
Without high-quality data decision makers must guess where they should know. Worse, they are liable to make decisions that appear to be informed but are in fact based on deficient data.

To work with information and decision support systems that provide reliable and uniform data in real time it is not enough to collect the data and feed it to the BI tools. First, it is important to gain an in-depth understanding of all the aspects of the data and to explore possibilities of improvement before making the data available to a wider group of users. To prevent poor performance of the BI system, it is important to test the quality of the data and verify that they are complete, consistent, up to date, and accurate.

By strongly emphasizing data quality, organizations can accelerate the development of the architecture of their BI systems; reduce the number of repeat transformations and extractions from the warehouse; report to end users about data quality issues; and eventually increase profits and return on investment in the BI infrastructure.

A data cleansing process of this nature results in a significant reduction in the number of customers and vendors because it eliminates duplicates and outdate records. At times, the reduction in the number of customers, vendors, and items reaches 70%. Automatic data cleansing tools are used to transfer data from operation systems to the data warehouse (ETL) in combination with data integration tools, which are intended to transfer the data from the information systems by saving time and costly development resources. An additional benefit of using automated systems is that the monitoring and quality of the data are improved significantly, and decisions are made based on a unified view of each customer as a single entity.

Many organizations are aware of problems having to do with transaction data in their systems, but do not know how to address the problems. By their nature, data migration and conversion processes reveal data quality problems. When data from various sources are integrated and subjected to new business requirements in the data warehouse, they become accessible to business users; in these situations, the completeness and reliability of the data gain paramount importance. Data quality problems can be the result of several factors, including:

  • Data distributed among several platforms and legacy systems

  • Extensive data redundancy between various application systems

  • Lack of standards for data within the organization

  • Deficient metadata (description of the source of the data and the manner of their computation) or complete lack of metadata for legacy systems


Maximizing Profits from Data
There are many tools for dealing with data quality by performing various operations such as tracing flaws and improving processes. Tools for handling data quality issues help organizations gain control over their data assets and derive optimal benefits from their BI infrastructure. These tools help maximize the organization’s return on its investment in data infrastructure by analyzing organizational data and ensuring that only “clean” and reliable data are populating the warehouses and datamarts; reveal hidden business rules and verify their validity; grant priority to issues of data quality that lead the organization to invest in areas that have the greatest influence; use business rules and data verification to cleanse the data while they are being transferred; monitor and manage data over time to ensure that the active data cleansing programs provide consistent and measurable advantages. By understanding the properties, advantages, and deficiencies of the original data the organization can prevent surprises, set expectations, and reduce the need for corrections.

One of the most prevalent myths is that a new system or data warehouse will fix data problems originating with the legacy systems. Although a process of data transfer results in a transformation of the data for improved business approach, the transformation process in itself does not guarantee cleaner data. With the right data cleansing, conversion, and loading tools, and with data quality assurance tools it is possible to maintain operation systems and data warehouses that contain reliable, uniform, and useful data that serve the overall success of the organization.​