Selecting Tools for Data Mining

By Rob Mattison

A so-called data warehouse can resemble a dark pit of buried information. It takes the right tools to get to that wealth.

About six months ago, I had a phone conversation with the CIO of a retail firm in Europe. He started out by explaining that his company had just finished putting together a 600GB data warehouse. Information was acquired from dozens of legacy systems, cleansed, transformed and loaded. Benchmark queries showed that the database design was good and that user response time should be excellent. "There's just one problem," the CIO said. "We can't figure out what to do with it."

Many data warehouse developers find themselves in this position. Building the data warehouse is only the first step, and by itself it produces no immediate returns. Businesses begin to cash in only when they harness the potential of the warehouse and put it to real work. The warehouses themselves are useless unless they are paired with an equally new and powerful technological approach known generically as data mining.

Originally data mining tools incorporated artificial intelligence, neural networks and other "intelligent" software techniques, which make it possible for users to analyze collections of data using techniques that are too complicated or too calculation-intensive for people to do themselves. Commercial usage today has a broader frame of reference, focusing on software tools that make it possible for business users to better investigate, explore or analyze the data within their data warehouses. Among them are products that manage query environments, "traditional" statistical analysis packages, online analytical processing (OLAP) and multidimensional analysis products, and business visualization tools.

Ironically, while the application of data mining technologies in data warehouse environments represents a shift in the way people utilize information systems, all of the components that make up these solutions have been around for years. Most people include in their data mining toolset any of a range of products that fit within two general categories: data discovery tools and operational monitoring and enhancement tools.

The Data Discovery Toolset

The artificial intelligence research project of the 1970s has turned into the commercial business toolkit of the 1990s. Among this inventory of products are neural networks, chi-square automatic interaction detectors (CHAID), decision-tree generators and a seemingly endless list of complicated acronyms. While these products are extremely powerful and have their place in the corporate decision-making world, for most people they are not enough. Along with these high-performance solutions, also included in the data mining family are more traditional methods, such as statistical analysis, operations research and other more mathematically and informationally based systems.

Data discovery is the process of taking raw, minimally understood data and using software to try to identify new patterns, insights or relationships contained within that data, with the goal of finding new information that can help people to make better business decisions. Data discovery tools make up about half of the data mining marketplace today, and despite their differences, they approach problems in the same way. Each tries to help you to predict something or explain something. By far the most popular utilization of the former approach in business is to predict customer buying behavior.

The standard procedure is this. First, you collect a large amount of data about a particular area of the business (such as customers, production processes or advertising). Next, you feed that data into the data mining tool to be analyzed. Then you look at what the product has to tell you about what it found. If, say, the information you feed in is about customers, you will hope to get new insights you can use to keep those customers happy. With information about your manufacturing process, you may find ways to improve your operations.

The most common application of data discovery tools is in the development of intelligent marketing and customer information strategies. For example, Shopko, a discount department store chain with stores from Michigan to the West Coast, has found this kind of data mining to be so profitable that it is applying it to more and more aspects of the business. "We think there is tremendous value in data mining, but we are just scratching the surface," says Jim Tucker, vice president and CIO of Shopko, based in Milwaukee, WI.

Shopko got started in data mining as a beta site for IBM's Intelligent Miner, using the product to analyze the effectiveness of its advertising. Intelligent Miner is a knowledge discovery toolkit for analyzing, extracting and validating data. It works with the IBM Intelligent Decision Server, a LAN-based information analysis server for deploying decision support applications throughout an enterprise. They access data from sources as disparate as legacy system flat files, IBM's DB2 family of proprietary relational databases, Lotus Notes and the World Wide Web.

"One of the things we are always looking for as retailers is what we call the dragger/dragee relationships," Tucker explains. "We want to know which products, when advertised or put on sale (the draggers), will help pull customers into the store and encourage them to buy other items (the dragees)." Shopko and its data mining staff are continuously working on how to maximize profits for each advertising campaign. "For instance, should we put cameras on sale in order to increase film sales, or the other way around?" Tucker asks.

So far, the biggest users of the application at Shopko have been those responsible for the development and tracking of advertising campaigns (the heart of this type of business), but other areas are trying to apply the tools to their problems as well. One group hopes to use the analysis of checkout counter key-pressing patterns by store clerks to detect reasons for mistakes and mischarges that get applied to customers.

Uncovering New Potential

Of course, while these products can provide a lot of information about customers or processes, they do not replace the human factor. It takes a good business sense and a lot of savvy to harness the power that these products represent. In some cases, the use of these new approaches can actually spawn the creation of a new function or department within the business, or greatly enhance the capabilities of an existing group. An example of this is in the banking industry, where banks are making use of their existing information about customers to expand into the business of direct marketing.

The United Jersey Bank in Princeton, NJ, uses ModelMax, a neural network product from Advanced Software Applications of Pittsburgh, PA, to change the way it does business. The bank does predictive modeling to help figure out, among other things, to whom to send advertisements for new products. Users create lists of all potential customers and their history of using banking services, and then use ModelMax to determine who the most likely buyers of the advertised services will be, as well as how much they are likely to spend. ModelMax runs on a Windows PC, but separately provided middleware, batch extract jobs or client-customized interfaces allow the extraction and analysis of data from both mainframe and Unix client/server sources.

According to Joe Somma, vice president, United Jersey Bank was doing a lot of regression analysis and CHAID, but it was a slow, tedious process to set up the data and run the models. "We looked at a lot of alternatives but found that ModelMax was easier to use," he says.

The bank, like many other financial institutions, found that it can do sophisticated, multidimensional analysis of its customer base, getting the same results in a matter of hours or days that six months worth of statistical modeling would yield. "We can run multiple models, predicting not only who is likely to buy from us but also figuring out what pricing or what market package we should offer," says Somma.

A New Generation

While data discovery tools represent the original and more exciting types of data mining applications, another group of products also is drawing notice. The latest descendants of a long line of products from the query tool, report writer and spreadsheet families, these tools have received boosts in functionality and resurgence in popularity. They make it easier for business people to keep track of the way their business is currently running, allowing them to note trends in operations sooner and to make adjustments that can save money, increase profits and smooth "bumps" in the system.

What distinguishes today's new generation of tools from past generations of report writers, online transaction processing (OLTP) systems and tons of batch reports is the amount of monitoring they can do. These tools make it possible for one person to keep track of what used to take dozens of people in the past. Not only can these applications enhance the ability of a person to monitor thousands of details, but they enable users to change those details to meet current business needs--without having to involve IS to reconfigure them. Four types of capabilities fall into this category, and products may incorporate two or more of them.

Software agents are applications or subroutines that do scheduling, investigation or monitoring work at the user's request. They eliminate the need for hundreds of reports and hours of user activity by offloading that work to the system.

Managed query environments are enhanced versions of the simpler query tools of the previous generation. These products eliminate the need for users to learn Structured Query Language (SQL), programming techniques or the locations of data, and instead provide an intuitively functional interface tailored to their specific needs.

OLAP tools represent a new generation of multidimensional analysis environments that make it possible for people to navigate through huge populations of data and access further levels of detail that are of interest to them.

Finally, visualization products make it possible to display complex data relationships in graphical terms.

These products can enhance the user's ability on several levels. End users, presented with a GUI, no longer have to key in the names of databases or learn to write SQL statements. Instead of poring through dozens or hundreds of reports each day, looking for fluctuations in sales or inventory levels, they can instruct the system to look for such fluctuations and, when it finds them, to notify the appropriate people that corrective actions are needed.

A company learning how to capitalize on this kind of capability is Falconbridge, a nickel mining concern in Falconbridge, Ontario, Canada. Bill Sandblom, information systems supervisor, was responsible for the initial data warehousing and data mining project, which yielded significant improvements in the profitability and efficiency of various areas of the business.

Falconbridge uses Forest and Trees, an enhanced query management environment from the Trinzic division of Platinum Technologies in Waltham, MA. According to Sandblom, users have seen a 30 percent cost saving through reductions in overtime and, because of the valuable information their engineers can now access, improvements in the overall running of the business. Engineers and managers perform ad hoc queries against large collections of manufacturing process data. Through their investigative efforts, they are able to identify bottlenecks in processes, anomalies of manpower allocations and other kinds of details that would be missed in manual processes.

OLAP Muscles

Some of the most popular operational monitoring and enhancement tools today are in the category known as OLAP tools. Online analytical processing is a specialized form of query that grew out of the decision support (DSS) and executive information system (EIS) disciplines. An OLAP tool is rather like a spreadsheet on steroids. These powerful interfaces organize data in the easily recognizable spreadsheet format but allow users to "drill down" into different dimensions of information and investigate the details that explain higher-level trends and anomalies.

Using this kind of data mining, companies can discover unexpected and significant paybacks. Thrift Drugs, a division of J. C. Penney with corporate headquarters in Pittsburgh, PA, installed a Sequent server to hold 150GB of store sales data and then used Lightship from Pilot Software of Cambridge, MA, to look at some trends in people's spending practices. The company discovered within the first few weeks of production a way to realign its merchandising strategy to increase revenue during an upcoming holiday season.

The company discovered holes in certain time-honored assumptions about the marketing of candy during the Easter season. The assumption going in was that the sale of specialty Easter candy would outstrip the sale of more conventional (and more profitable) candy. Analysis of the sales information revealed, however, that the opposite was true. By realigning its merchandising strategy with customers' real buying behavior, and by increasing the inventory and advertising for "normal" candy while reducing the inventory and advertising for Easter candy, Thrift was able to improve profits and recoup a significant portion of the cost of the system.

Not So Fast

Despite the potential, there are several key areas where data mining endeavors can get into trouble in a hurry. The biggest expense and greatest shock to most people come when they see how difficult it is to find the right data on which to use the mining tools in the first place. Finding the data, cleansing it and making it ready to load into the database can take up 80 percent of any project's budget. According to Jim Tucker, Shopko spent a long time "fiddling" with the data and getting it ready to use. The IS group had to pull data from many different Unix systems and proprietary environments to put together the core set of information that was needed.

Various problems can surface in trying to find source data. Because companies often have reduced their legacy systems support staff to a minimum, there is nobody around who can tell you where the files are that might hold the data you seek. And after you find these candidate sources, a large amount of data preparation has to occur.

First you have to figure out if the data source you have identified has a complete set of the information you want. Often, in order to get a complete set of data, you have to piece together different sources. One company I worked with had 17 different "master" files for products. Another had seven different "master" files for customers.

For every field of data that you want to feed to a data mining tool, integrity and synchronization issues must be addressed. When you finally identify disparate sets of data, you have to worry about getting the keys to match. Different files usually have different key structures. Often, you have to create an entirely new master key structure and translate all of the existing data into that structure before you start.

Then there are problems with timing. Different files are updated at different times: monthly, weekly, daily and in realtime. Mixing all of this information can produce data integrity problems.

After reconciling these problems, you are ready to get into some of the more tedious data integrity issues. How is the data stored? Do you have standard abbreviations? Are there codes for colors and sizes? How are dates stored?

It's also easy to get into trouble if people lose their business focus and develop a technology focus instead. This can take two forms. In some cases, the IS staff gets so excited about what they are doing and so convinced that it is of value that they fail to stay in touch with the people who will be using the solution being delivered. No matter how good a solution is, if it is unattractive, difficult to use or doesn't fit the way that people work, it will fail.

On the other hand, users themselves may fall in love with specific technologies because they like the way they look, perform or promise to deliver. In these cases, the reluctant IS department is left with the unenviable task of trying to force-fit inappropriate or undeliverable technological solutions into the environment because the users insist upon it.

Tying It All Together

Like data warehousing overall, data mining requires preparation and reasonable expectations. Before you try to deploy a data mining solution, figure out exactly what kinds of problems you are trying to solve. In general, you will either be trying to set up a data discovery environment in which to look at historical information or an operational monitoring situation where you track something on an ongoing, almost realtime basis.

After defining the business problem, you can start looking at products. There are many products available today; some are general in their approach, and some are specific. Choose the tools that will do the job you need, and don't make the mistake of becoming enamored with one and trying to figure out where to use it, instead of the other way around.

Finally, be aware from the outset that finding, preparing and loading your data is going to be the most difficult and expensive part of the process, and plan for it accordingly. Often people grossly underestimate this part of the process.

Data mining promises to be an effective means for modern businesses to apply to the assortment of problems and pressures they face today. With intelligent planning and conscientious execution, users can learn what they need to know to compete.

Rob Mattison is the author of Data Warehousing: Strategies, Technologies and Techniques (McGraw-Hill, 1996) and a senior enterprise systems architect for Sequent Computers. He can be reached at mattison@sequent.com.