Data Warehousing: Strategies, Technologies and
by Rob Mattison
485 pages; $55
Data Warehousing: Strategies, Technologies and Techniques
The title of this book might lead you to believe that it will walk you through the creation of a data warehouse or serve as the utilitarian reference book that sits close at hand as you build or maintain a data warehouse. Such is not the case. While the book does not contain the implementation details a reference book would normally contain, it is filled with valuable overview information and broad-brush approaches to data warehouse problems.
For those who might be unfamiliar with the term, the author defines a data warehouse as a database that
The majority of the applications and examples in the book show data warehouses to be systems designed for the collection of large amounts of legacy data from disparate systems, often attempts to perform company-wide integration of data. A good amount of space is devoted to looking at the circumstances that make data warehousing necessary. To justify the time and effort of development, substantial new functionality must be added. "Nobody needs a system that does what an earlier generation of system did, using the latest and greatest Windows-based, mouse-driven screen," Mattison writes. "It provides no new value to the company and, therefore, does not make economic sense."
In Data Warehousing, Mattison attempts to present the concepts of data warehousing and data mining in as much detail as possible, keeping the terminology consistent and the content manageable. For a manager or noncomputer professional, the book meets that goal. "Hands-on" computer people may be frustrated by the lack of technical information.
As an example, a manager would be pleased to discover the discussion in Chapter 12 of requisite skills for the personnel required for various parts of a data warehouse development project and would appreciate the discussion throughout the book of political issues. A systems analyst, on the other hand, would be disappointed at the lack of discussion of the relative merits of Unix versus Windows NT and TCP/IP versus NetWare's protocols. Also, there is no discussion at all of physical storage media.
In fact, where attempts are made to discuss such topics, they are often misleading. An example occurs in Chapter 8, "The Physical Infrastructure." Mattison points out the scarcity of data mining tools for Unix and then, in the same sentence, states that "the vast majority of the products run on Windows, OS/2, Macintosh and X-Term environments." The last time I checked, X terminal applications do, indeed, work in Unix.
To be fair, I must point out that the last third of the book, dedicated to the subject of data mining (recovery of information from a data warehouse) delves into a great deal of technical detail, down to the electrochemical reactions of neurons in Chapter 16, "Neural Networks and Business Data Systems." As Mattison points out in his introduction, the structure of this final section differs radically from the rest of the book. It is made up of guest authors discussing specific products or projects, all on the data mining end of the equation rather than the data warehouse itself.
In addition to the neural networks mentioned above, the data mining section contains chapters on statistical analysis, multidimensional analysis, visualization and data warehousing on an enterprise intranet. Personally, I found the chapter on enterprise intranets pertinent and too short.
Chapter 20, "Prediction from Large Data Warehouses," contains a highly enlightening discussion of what today's tools are capable of doing with the warehoused data. It examines some sample databases--simple enough to grasp immediately--and shows how automated analysis programs go about generating easy-to-understand rules rather than obtuse statistical analyses. In the sample database of irises, for example, it generated a rule stating that with the given data, if an iris has petals within a specific range of lengths and widths, then there is a 98 percent certainty it is an iris versicolor (as opposed to some other species). Another example identifies a specific machine operator as generating the majority of instances of a specific failure mode.
This book has no glossary, a serious shortcoming with a subject this complex, especially when one of the author's stated goals is to develop the reader's vocabulary. Acronyms such as SMP, MPP and IT are used without definition. Some of the terms used in data warehousing are quite difficult to define (Mattison spends the entire first chapter defining what a data warehouse is), but that's no excuse for omitting such a basic tool as a glossary.
The index, only seven pages in length, lacks entries for many subjects crucial to the subject, including archiving, backup, disk, operating systems, optical storage, RAID, redundancy, tape and even Unix. Most of these subjects are addressed within the book, which is fine for someone who will read it and set it aside. In a reference book, however, subjects missing from the index sometimes might as well not exist. Other missing elements include a bibliography and suggested reading list. Mattison covers an immense amount of material in 485 pages, and readers interested in more detail on a specific topic would benefit from some tips about where to find the information.
The book is sprinkled liberally with drawings, charts and graphs. Unfortunately, many of the illustrations seem to be afterthoughts, even containing basic errors, like one in Chapter 2 that connects purchasing to shipping rather than receiving.
Data Warehousing demonstrates Mattison's prodigious knowledge of his subject matter. A great deal of work and thought obviously has gone into developing the content and tools, among them a wonderful set of checklists in Chapter 11 that guide you through the planning and estimate stage. His writing style is clear and as light as one can expect with a topic this substantial. However, if you are looking for a technical treatise that will point you at specific hardware and software platforms (or even explain their relative merits), Data Warehousing will leave you disappointed. The same will be true if you attempt to use it in random-access mode, looking up specific topics when you need them. There is a lot of information in the book that's tough to find.
If you are not familiar with data warehousing and would like a detailed overview of the concepts, issues, problems and design strategies, this is your book. It will be a valuable aid for evaluating the people and strategies you will need to launch your own data warehousing project.
Gary Robson is founder and chief technology officer of Cheetah Systems, Inc., in Fremont, CA. He can be reached at firstname.lastname@example.org.