July 2008: Originally published in IDQ Newsletter Vol 4 Issue 3
Data categories are groupings of data with common characteristics or features. They are useful for managing the data because certain data may be treated differently based on their classification. Understanding the relationship and dependency between the different categories can help direct data quality efforts.
For example, a project focused on improving master data quality may find that one of the root causes of quality problems actually comes from faulty reference data that were included in the master data record.
By being aware of the data categories, a project can save time by including key reference data as part of its initial data quality assessments. From a data governance and stewardship viewpoint, those responsible for creating or updating data may be very different from one data category to another.
THE SMITH CORP. EXAMPLE
Your company, Smith Corp., sells widgets to state and federal government agencies, commercial accounts, and educational institutions. ABC Inc. wants to purchase four Blue Widgets from you. ABC Inc. is one of your commercial customers (identified as Customer Type 03) and has been issued a customer identifier number of 9876.
The Blue Widget has a product number of 90-123 and its unit price depends on customer type. ABC Inc. purchases four Blue Widgets at a unit price of $100 each (the price for a commercial customer) for a total price of $400.
Figure 1 below illustrates that transaction.
Figure 1 – An example of data categories.
When the agent from ABC Inc. calls Smith Corp. to place an order, the Smith Corp. customer representative enters ABC Inc.’s customer number in the sales order transaction. ABC Inc.’s company name, customer type, and address are pulled into the sales order screen from its customer master record. The master data mentioned are essential to the transaction.
When the product number is entered, the product description of “Blue Widget” is pulled into the sales order along with a unit price that has been derived based on the customer type. Therefore, the total price for four Blue Widgets is $400.
Let’s look at the data categories included in this example. We have already mentioned that the basic customer information for ABC Inc. is contained in the customer master record. Some of the data in the master record are pulled from controlled lists of reference data.
An example is customer type. Smith Corp. sells to four customer types, and the four types with associated codes are stored as a separate reference list. Other reference data associated with this customer’s master record (but not shown in the figure) are the list of valid U.S. state codes, which is used when creating the address for ABC Inc. An example of reference data needed for the transaction but not pulled in through the master data are the list of shipping options available (also not shown in the figure).
Reference data are sets of values or classification schemas that are referred to by systems, applications, data stores, processes, and reports, as well as by transactional and master records. Reference data may be unique to your company (such as customer type), but can also be used by many other companies.
Examples are standardized sets of codes such as currencies defined and maintained by ISO (International Standards Organization). In our example, the price calculations further emphasize the importance of high-quality reference data. If the code list is wrong, or the associated unit price is wrong, then the incorrect price will be used for that customer.
Why have the customer record and product record been classified as master data? Master data describe the people, places, and things that are involved in an organization’s business. Examples include customers, products, employees, suppliers, and locations. Gwen Thomas created a ditty sung to the tune of “Yankee Doodle” that highlights master data:
Master data’s all around
Embedded in transactions.
Master data are the nouns
Upon which we take action. 
In our example, Smith Corp. has a finite list of customers and a finite list of products that are unique to and important to it — no other company will be likely to have the very same lists. While ABC Inc. is a customer of other companies, how its data are formatted and used by Smith Corp. is unique to Smith Corp.
For example, if Smith Corp. only sells to companies within the United States, it may not include address data (such as country) needed by other companies that sell outside of the United States and that also sell to ABC Inc. Addresses would be formatted differently within those companies to take international addresses into account. Likewise, Smith Corp.’s product list is unique to it, and the product master record may be structured differently from other companies’ product masters.
The sales order in the example is considered transactional data. Transactional data describe an internal or external event or transaction that takes place as an organization conducts its business. Examples include sales order, invoice, purchase order, shipping document, and passport application.
Transactional data are typically grouped into transactional records that include associated master and reference data. In the example, you can see that the sales order pulls data from two different master data records. It is also possible that reference data specific to the transaction are used — so not all reference data have to come through the master record.
Figure 1 also illustrates metadata, which means “data about data.” Metadata label, describe, or characterize other data and make it easier to retrieve, interpret, or use information. The figure shows documentation defining the fields in the product master record along with the field type and field length. Several kinds of metadata are described in Table 1.
Metadata are critical to avoiding misunderstandings that can create data quality problems. In Figure 1, you can see in the master record that the field containing “Blue Widget” is called “Product Name,” but the same data are labeled “Description” in the transactional record screen. In an ideal world, the data would be labeled the same wherever they are used. Unfortunately, inconsistencies such as the one in the figure are common and often lead to misuse and misunderstanding. Having clear documentation of metadata showing the fields (and their names) that are actually using the same data is important to managing those data and to understanding the impact if those fields are changed, or if the data are moved and used by other business functions and applications.
DATA CATEGORIES DEFINED
Table 1 includes definitions and examples for each of the data categories discussed previously. These definitions were jointly created by the author and Gwen Thomas, president of the Data Governance Institute.
Table 1 – Definitions of Data Categories
Data Category Definition Master Data
Master data describe the people, places, and things that are involved in an organization’s business.
Examples include people (e.g. customers, employees, vendors, suppliers), places (e.g., locations, sales territories, offices), and things (e.g., accounts, products, assets, document sets).
Because these data tend to be used by multiple business process and IT systems, standardizing Master Data formats and synchronizing values are critical for successful system integration.
Master data tend to be grouped into master records, which may include associated reference data. An example of associated reference data is a state field within an address in a customer master record.
Transactional data describe an internal or external event or transaction that takes place as an organization conducts its business.
Examples include sales orders, invoices, purchase orders, shipping documents, passport applications, credit card payments, and insurance claims.
These data are typically grouped into transactional records, which include associated master and reference data.
Reference data are sets of values or classification schemas that are referred to by systems, applications, data stores, processes, and reports, as well as by transactional and master records.
Examples include lists of valid values, code lists, status codes, state abbreviations, demographic fields, flags, product types, gender, chart of accounts, and product hierarchy.
Standardized reference data are key to data integration and interoperability and facilitate the sharing and reporting of information.
Reference data may be used to differentiate one type of record from another for categorization and analysis, or they may be a significant fact such as country, which appears within a larger information set such as address.
Organizations often create internal reference data to characterize or standardize their own information. Reference data sets are also defined by external groups, such as government or regulatory bodies, to be used by multiple organizations. For example, currency codes are defined and maintained by the International Standards Organization (ISO).
Metadata literally means “data about data.” The metadata label describes, or characterizes other data and makes it easier to retrieve, interpret, or use information.
Technical metadata are metadata used to describe technology and data structures. Examples of technical metadata are field names, length, type, lineage, and database table layouts.
Business metadata describe the non-technical aspects of data and their usage. Examples are field definitions, report names, headings in reports and on web pages, application screen names, data quality statistics, and the parties accountable for data quality for a particular field. Some organizations would classify ETL (Extract-Transform-Load) transformations as business metadata.
Audit trail metadata are a specific type of metadata, typically stored in a record and protected from alteration, that capture how, when, and by whom the data were created, accessed, updated, or deleted. Audit trail data are used for security, compliance, or forensic purposes. Examples include timestamp, creator, create date, and update date. Although audit trail metadata are typically stored in a record, technical metadata and business metadata are usually stored separately from the data they describe.
These are the most common types of metadata, but it could be argued that there are other types of metadata that make it easier to retrieve, interpret, or use information. The label for any metadata may not be as important as the fact that it is being deliberately used to support data goals. Any discipline or activity that uses data is likely to have associated metadata.
Additional data categories that impact how systems and databases are designed and data are used:
Historical data contain significant facts, as of a certain point in time, that should not be altered except to correct an error. They are important to security and compliance. Operational systems can also contain history tables for reporting or analysis purposes. Examples include point-in-time reports, database snapshots, and version information.
Temporary data are kept in memory to speed up processing. They are not viewed by humans and are used for technical purposes. Examples include a copy of a table that is created during a processing session to speed up lookups.
Source: Copyright © 2007-2008 Danette McGilvray and Gwen Thomas. Used by permission.
Your data may be categorized differently from what the table describes. For example, some companies combine reference data and master data categories and call them master reference data (MRD). Sometimes it is difficult to decide whether a data set, such as a list of valid values, is only reference data or is also metadata. It has been said that one person’s metadata is another person’s data. No matter how data are categorized, the important point is that you are clear on what you are (and are not) addressing in data quality activities. You may find that such data quality activities should include data categories not considered previously.
RELATIONSHIPS BETWEEN DATA CATEGORIES
Figure 2 below shows the associations between the various data categories. Note that some reference data are required to create a master data record and that master data are required to create a transactional record. Sometimes reference data specific to transactional data (and not pulled in through the master records) are needed to create a transactional record. Metadata are required to better use and understand all other data categories.
From an historical data point of view, corresponding reference data may need to be maintained along with the master and transactional records; if not, important context and the meaning of the data may be lost. Auditors will want to know who updated the data and when—for all categories of data. That is why audit trail data are a part of metadata.
Figure 2 – Relationships between data categories
DATA CATEGORIES—WHY WE CARE
It is easy to see from the examples just given that the care given to your reference data strongly impacts the quality of your master and transactional data.
Reference data are key to interoperability. The more you manage and standardize them, the more you increase your ability to share data across and outside of your company. The significance of an error in reference data has a multiplying effect as the data continue to be passed on and used by other data.
The quality of master data impacts transactional data, and the quality of metadata impacts all categories. For example, documenting definitions (metadata) improves quality because it transforms undocumented assumptions into documented and agreed-on meanings so the data can be used consistently and correctly.
As mentioned previously, your company’s data are unique (master product, vendor, customer data, etc., reference data, metadata). No other organization will be likely to have the very same data list. If correct and managed conscientiously, your data provide a competitive advantage because they are tuned for your company needs.
Imagine the cost savings and revenue potential for the company that has accurate data, can find information when needed, and trusts the information found. Quality must be managed for all data categories in order to gain that competitive advantage. Of course, you will have to prioritize your efforts, but consider all the data categories when selecting your data quality activities.
1 For entertainment and education, see www.datagovernance.com for other ideas that Gwen Thomas has set to familiar tunes.
About the Author
Danette McGilvray is president and principal of Granite Falls Consulting, Inc., a firm that helps organizations increase their success by addressing the information quality and data governance aspects of their business efforts. Focusing on bottom-line results, Danette helps organizations enhance the value of their information assets by naturally integrating information quality management into the business. She also emphasizes communication and the human aspect of information quality and governance.
Danette is the author of Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information™ (Morgan Kaufmann, 2008). An internationally respected expert, her Ten Steps™ approach to information quality has been embraced as a proven method for both understanding and creating information and data quality in the enterprise. The Chinese-language edition will be available June 2011 and her book is used as a textbook in university graduate programs. She has contributed articles to various industry journals and newsletters and has been profiled in PC Week and HP Measure Magazine. She was an invited delegate to the People's Republic of China to discuss roles and opportunities for women in the computer field.
She can be reached via email at danette [AT] gfalls [dot] com