Information Technology Demystified

A Report from the UniForum Technical Steering Committee

Metadata Tracks a Moving Target


Each month, the TSC examines a key emerging technology or its use. This time, we look at a major aspect of data warehousing.

By Katherine Hammer

Today, not only is business changing rapidly, so are the demands on operational IT systems. Industry experts contend that large IS organizations spend up to 80 percent of every programming dollar just on maintaining operational systems. As IT continues to evolve and its uses to expand, there is little chance that this cost will decrease.

Much of this maintenance expense entails changes to operational applications and databases. Therefore, the "hot" application of the moment--the data warehouse--is built on shifting sands. If a data warehouse team does not build a maintenance strategy into its architectural design, the cost of maintaining the warehouse is likely to rival the cost of maintaining operational systems.

The key to avoiding this maintenance burden lies in the area of metadata management. Metadata is simply data about data: what fields constitute a record definition, the characteristics of each field, where and how the data defined by the record definition is stored and other characteristics. In many organizations the task of uncovering the metadata that defines the operational systems and their interdependencies can constitute a formidable task for the warehouse design team.

Metadata is typically stored in different locations, often in diverse, incompatible formats. In fact, some necessary metadata may not be available in any readily accessible way, but buried in application code that exists in data interface programs which link related operational databases.

Add to this the fact that production databases are rarely rebuilt through a change in the database schema; the systems in question simply cannot be taken down long enough for this to be done. As a result, many versions of the schema are only implicit within a particular database. (For example, if field A contains "1," then the following fields represent X; otherwise Y.)

Employee turnover also has an impact. Individuals who implemented changes in the first place may no longer be available to help anticipate problems or provide insights on past change rationales to help solve the new problems. The overall result may be that no standard format exists across company databases.

Common Types of Metadata

The complexity of this problem becomes even more evident when one considers the variety of types of metadata required to provide IS users with the information they need to make intelligent judgments about modifying existing systems. In the event that modifications are made, this metadata must be accessible in a form that allows the warehouse maintenance team to analyze proactively and react quickly to minimize the impacts of those changes.

A fundamental type of metadata is the definition of the databases being maintained under each database management system or file system. As noted, it is not uncommon for this information to be stored only within the application code. The database schema may have been redefined at some point to change the meaning of some fields, while preserving the previous field boundaries and data types to avoid rebuilding the entire database. This can be one of the more difficult types of metadata to manage, because unless a staff member recalls the schema redefinition, the correct metadata can be discovered only through trial and error.

Equally important is information about the relationships between the data elements stored under different data access systems. Today, a significant part of acquiring metadata may be automated by software tools. But this does not help when the relationships between databases are not recorded electronically. For example, in an employee identification field "EMPLOYEE-ID" in one database may be equivalent to "SOCSEC#" in another.

Data values often are semantically inconsistent, or even if semantically equivalent they can be represented differently so that some type of transformation is required before the data can be correlated. Sometimes these semantic inconsistencies may require considerable conditional logic to transform the source values into the appropriate form for the target system. Moreover, because the designers of operational systems were motivated to save disk space and CPU cycles, much of the data that resides in operational databases is not appropriate for use by end users (for example, symbolic fields like city names are often represented as integers, or some data is in binary form). As a result, building a warehouse entails the creation of numerous, sometimes complex business rules.

Another key type of metadata is data primacy. It addresses which database should be considered the database of record for a replicated data value. This is an example of metadata that is rarely recorded electronically; users of the system often consider data primacy to be a matter of common sense. For example, a customer's address may be stored in multiple databases, but in case of a conflict in the record content, the user is likely to consider the address in the billings database to be the most accurate. In other cases, it may be difficult to determine the database of record.

Iterative Management

As indicated above, one of the most important reasons for keeping metadata in a centralized location is impact analysis when change occurs in the operational environment. As a result, it is important that an organization's metadata management strategy include a mechanism for versioning of metadata, so what one learns about the various versions of schema that underlay most operational databases can be captured for use by later projects, rather than having to be rediscovered each time.

Because different types of metadata are distributed throughout heterogeneous systems, an organization may not be able to document its metadata fully before undertaking a strategic IS project. On the other hand, to escape spiraling maintenance costs an organization must try to gather and maintain as much of this information as possible for all existing projects. It must also fully integrate this activity into its development methodology. By practicing this philosophy, the metadata acquired by one project can be reused by others.

In short, IS organizations should strive for the appropriate mix of methodologies and tools to support iterative and incremental development. Using this approach, each element of the project is subject to managed change. It is important to consider what metadata would be helpful in reducing the time required for analysis and/or implementation for each of the types of changes that are likely to be made. Once the probable change scenarios are identified, an evaluation of software tools can be based on how well they capture and support the use of this information. Likewise throughout the project, the methodology should be regularly reassessed, tuned and distributed across the organization to maximize its benefit.

Tools and Standards

As one might expect, the software industry has risen to the metadata management challenge with a variety of solutions. Many data dictionary products and repositories seek to provide a centralized facility for managing metadata. Additionally, CASE and design tools not only capture metadata but support the effective design of new (usually relational) database schemas, as well as generating some straightforward database applications. Data extraction, integration and warehouse products also use the same type of metadata to automatically generate data interface programs.

To enable enterprise data management, these different tools must be able to easily exchange the metadata created by other tools and stored in a variety of storage facilities. The rapid proliferation of these tools has resulted in almost as many different treatments of metadata as there are tools. The only way to enable the exchange of metadata between tools is to establish at least a minimum common denominator of interchange standards and guidelines.

In response to this problem, a number of initiatives have been launched to develop a simple interchange format. Some of these initiatives are vendor-specific, like those of IBM and Oracle, but other vendor-independent efforts have been organized as well.

While the prospect of locating, compiling and managing enterprise metadata may seem overwhelming, the combination of a sound, iterative methodology, today's tools and tomorrow's standards is expected to make the management of metadata less difficult. Metadata interchange standards should enable IS managers to select what they perceive as a best-of-breed configuration of tools to build a support infrastructure that fits their unique needs.

Katherine Hammer is cofounder, president and CEO of Evolutionary Technologies in Austin, TX. She can be reached at kay@evtech.com.