By Richard Cole
[Chart]: The Cornell Theory Center
[Chart]: The NCCS Network
Better ways have to be found to store increasingly larger amounts of data. It will take the proper technology, planning and cooperation to develop successful storage solutions.
As anyone who works in an office knows, the best way to predict what information you'll need next week is to clean out your desk or your hard disk. As if by magic, the specific files, memos, business cards and phone numbers that you throw away will invariably contain the information that you are looking for.
At the same time, it is necessary to cull dated information on a regular basis and/or retire it to storage. For most organizations, the solution to this familiar dilemma used to be a fairly straightforward matter of manually making tape backups and keeping them in a safe place. But today the task of corporate data storage has been compounded by several factors.
Most obvious is the sheer proliferation of information. Records accumulate, and a history of the organization's activities has to be maintained. New technologies, such as graphics and video, capture more information and consume greater disk space. Furthermore, mergers and acquisitions often force IS departments to manage and store exponentially greater amounts of data as information systems are brought together into single, enterprise-wide environments.
The move from mainframes to distributed client/server environments also has added to the complexity of storage requirements. When almost all data resided on a mainframe supervised in a data center, the focus was on conserving file space and storing data efficiently. The IS manager tried to keep a lid on growth because more storage meant more hardware--an expense that was charged to the IS budget. But in distributed processing, users have more control over creating and saving information. Because these users are not directly responsible for storage expenses, they are not oriented toward saving space. Often, they simply buy more disks and expect IS to handle storage. The result is booming growth in data storage, often unaccompanied by a consistent policy for storage management.
Many corporations now hold onto their data longer and make it work harder. For example, database marketers are now using demographic information and huge databases to develop an increasingly precise understanding of consumers and buying patterns. This information is sold or brokered to retailers who, more and more, direct their marketing efforts in terms of the "lifetime value" of their customers. These database techniques have helped to expand retailing from mass marketing to niche marketing and now to marketing at the individual level; they also have created a new generation of data management and storage challenges.
Finally, today's users are more demanding about the amount and level of information they want. The development of sophisticated report applications or analysis tools like online analytical processing (OLAP) means that more data can be presented in more ways in a matter of seconds. If the data is not readily accessible or has been stored in an older format that is now incompatible with a corporation's current system, users may protest vigorously. Glenda Lyons, formerly vice president of software technologies at PaineWeber in New York City, tells the story of a stockbroker who could not get the information he wanted fast enough from his desktop computer. His reaction was to throw a steel chair through a 28th-floor window.
The Big Picture
In facing these manifold challenges, systems managers need more than ad hoc approaches. A successful data storage strategy involves several activities, starting with general backup and recovery. This first and most frequent storage activity occurs at the user level on a daily basis. A file, for example, might be downloaded from a PC hard drive to a diskette, then transferred later to a tape storage medium down the hall or at a data center.
Later, this data can be transferred to archive facilities for long-term storage. For purposes of security and climate control, some archives are in former salt mines located miles underground in Kansas and Louisiana--which gives a literal meaning to the term data mining. State and federal regulations require businesses to archive tax, financial and other information for set periods of time. Some regulations are industry-specific. The insurance industry, for example, is required to keep policy information on their clients for the life of the client. In long-term archiving, special attention must be paid to the physical state and degradation of storage media such as tapes, some of which have a "shelf life" of five to 10 years.
Disaster recovery is a related aspect of data storage. This involves storing the information that a business requires to start up again after a calamity. Storage facilities are often kept at safe distances from the main computing site. Files are refreshed more frequently than with true archives, and the storage facility has to be accessible by high-speed communication lines so business-critical data can be brought online quickly if disaster strikes.
How to manage these different short-term and long-term storage requirements prompted the idea of hierarchical storage management (HSM). An HSM solution is a set of migration tools and strategies that enable the continuous retirement and deletion of data based on the frequency of its use. Data is regularly deleted or transferred down the media hierarchy from diskettes and disks to tape and then to cheaper, less accessible media such as optical platters. HSM solutions can be a good way to make data easily available to the user while moderating the storage cost compared with a disk-only solution. In some cases, an HSM solution using short-term disk storage may provide higher access for data that would otherwise be kept on traditional tape systems.
The drawback with HSM is that an integrated, automatic system for storage management that's suitable for most corporations is still a software generation away. A true HSM solution would have to be integrated with databases, but according to John Camp, research director at the Gartner Group in Stamford, CT, most database management system (DBMS) vendors have not developed the appropriate application programming interfaces (APIs). "There's no incentive," Camp says. "They aren't in the business of data storage management." He points out that DBMS vendors are more concerned about transactions and keeping the business running.
In fact, agreement on APIs and software standards in general is needed throughout the data storage industry. The Posix Committee Working Group P1103.1k and the Storage Systems Standards Working Group (IEEE P1244) are trying to develop interoperable storage standards. In addition, the Association for Information and Image Management (AIIM) recently began discussions on the portability of large data archives between heterogeneous data management platforms. During this year's AIIM convention, users declared that they wanted a way to transfer data from legacy archives without the exorbitant costs of copying gigabytes (GB) and even terabytes (TB) of data to new media. Software vendors, on the other hand, reportedly are still reluctant to agree to standards that might be incompatible with hardware storage technology developed in the future. So far, a broad consensus on interoperability and APIs has not emerged.
Lacking a generally available integrated data storage solution, many organizations are developing their own answers; some are fairly specific, others more general and larger in scope. For the narrower focus, network backup is a case in point. The Liggett Group, a manufacturer of tobacco products in Durham, NC, recently implemented an automated tape backup system for its local-area network, using an autoloading tape library. According to Dana Gantt, director of technical services, the company saw an opportunity to move to a fully automated network backup system in 1993 while migrating from an IBM ES/9121 mainframe to a Unix-based client/server environment. Today, the network runs on nine HP 9000 servers with a smaller number of proprietary-based HP 3000 servers, plus 300 PCs connected by a Novell NetWare LAN. Other Unix servers run SCO Unix 3.1.2 (about five years old) and Sun Solaris. The main DBMS is Oracle.
This migration was prompted mostly by a desire for the efficiencies of using packaged software, but Liggett also wanted to develop an operatorless, "lights out" data center. With the mainframe, several Novell file servers had to be backed up manually using an 8mm tape drive supporting Archivist software from Palindrome of Naperville, IL. The backup was started each night by the second-shift operator. If a tape filled up after the second shift had ended, the backup would not be completed that night and had to be finished by the morning operator. This time lag introduced the possibility of data integrity problems in the backups.
Looking for an automatic backup solution for its client/server LAN, Liggett chose a TLS-4220 tape library from Qualstar of Canoga Park, CA, supporting Palindrome's Storage Manager version 4.0. The library has two 8mm 8505XL cartridge tape drives from Exabyte of Boulder, CO, and holds 22 tape cartridges for a total capacity of 170GB. Currently, 20 of the tapes are stored in two removable tape magazines, and the other two tapes are stored in fixed slots.
David Channell, systems engineer at Liggett, says that with the new library, he has to load only one set of tapes per week. Once the library door is closed, the subsystem automatically checks its inventory using a bar code scanner. Instead of loading tapes and reading the internal label, the library scans bar codes on the outside of the tape cartridges, which reduces wear and tear on the tapes as well as the tape library and its robotics. With bar-code scanning, the library can also learn more quickly which tapes are currently loaded in the tape library. Channell and his colleagues do not have to maintain external labels, and in the event that staff members need to read a label, they can do so via the Palindrome software.
The tape library uses a Tower of Hanoi tape rotation scheme, based on a sequence of moves from a popular mathematical puzzle. The Tower of Hanoi pattern uses less media than other rotation patterns and allows files to be changed with a variety of file versions. "Essentially, you can store more stuff on fewer tapes," says Channell.
Out on the Edge
The Liggett tape library regularly backs up about 20GB of data, a typical amount for many corporate environments. In contrast, massive systems dealing in terabytes of data have been implemented at several academic supercomputing centers and national laboratories across the United States. These large systems not only represent the latest in technology, they offer a glimpse of the future for many corporations as data storage systems continue to grow.
The Cornell Theory Center at Cornell University in Ithaca, NY, is the sixth largest computing center in the world. It provides supercomputing services to academic and corporate researchers across the country. Data comes from a variety of sources ranging from the radio telescope at Arecibo, Puerto Rico, to electron microscopes.
At the heart of the center is an IBM SP2 supercomputer supporting 512 processors in a massively parallel processing (MPP) environment running AIX, IBM's Unix variant. The center also runs several supercomputers and high-end workstations for visualization, including a Power Visualization System (PVS) and three Onyx computers from Silicon Graphics, Inc. (SGI), and supports about 200 Unix workstations, 50 Macintoshes and a few other PCs.
Connectivity is especially important to operations at the center, since 60 percent of users are located at other facilities. Typically, researchers access data and run programs in terminal sessions from their own sites. The Internet is the main method of communication between them and the center. The center is connected with NYnet, its New York asynchronous transfer mode (ATM) network, and NYsernet, an Internet service provider. In addition, there is an experimental network called VBNS (very high speed backbone network server), which provides a dedicated ATM line running at 155Mbps between Cornell and other supercomputer centers.
Unlike businesses, the center is not legally required to hold data. Files are automatically backed up and stored during research projects, but they are overwritten once a project is completed. Users have ultimate responsibility for their own backups. However, the massive amounts of information handled by projects at any one time require an equally massive storage solution. Using the Andrew File System for parallel file serving, the center transfers data to two IBM 3494 tape robots, each capable of holding 1,500 tapes and 15TB of uncompressed data. These libraries are accompanied by 10 IBM 3590 Magstar tape drives with a capacity of 10GB per tape.
Doug Carlson, associate director for systems and operations, points out that the center has several unique features for handling mass storage. One is the High Performance Storage System (HPSS). This technology, according to Carlson, represents one of the highest-performing mass storage systems available today. Developed by IBM's government systems division and four national laboratories, HPSS is designed to provide a highly scalable parallel storage system for MPP systems. In this context, scalability includes data transfer rate, storage size, number of name objects, size of objects and geographical distribution. "When fully implemented this year, it will allow us to get data rates much higher than before," says Carlson. HPSS has been built to hold billions of directories, billions of files and petabytes of data. (A petabyte is a quadrillion, or 1,000 trillion, bytes.) The HPSS works with a parallel file system for the SP2.
In the future, the center will continue to expand its processing and storage capacity, but to do so it must overcome more technical challenges. In the industry generally, Carlson says, processing capacity is growing 50 percent per year, while I/O capacity is only increasing at 20 percent per year, creating a bottleneck. As a solution, he suggests several possibilities, including parallel file systems and parallel tape support. He also mentions implementing faster networking systems based on ATM and High-Performance Parallel Interface (HiPPI) switching.
The mass storage and scientific computing branch of the NASA Center for Computational Sciences (NCCS) at Greenbelt, MD, provides another example of massive storage on a grand scale. Headed by Nancy Palm, the NCCS supports research and modeling for the earth and space science community. The NCCS runs two Cray J90 supercomputers and one Convex server. Primary access for users is through Ethernet, with Fiber Distributed Data Interface (FDDI) and local HiPPI support between the supercomputers. Users run a variety of Unix-based IBM, Sun, SGI and other workstations.
As would be expected, the NCCS shares some of the technical solutions found at Cornell, in particular an HSM solution based on UniTree. Developed at Lawrence Livermore National Laboratory in Livermore, CA, and owned by UniTree Software of Dublin, CA, UniTree provides service software and coordination to transfer very large amounts of data. For data-intensive environments, one of its more important features is its "virtual disk" technology. Files are available to clients in the form of a disk with apparently unlimited capacity. The UniTree manager automatically migrates infrequently accessed files away from the high-speed disk cache toward the tape media. When an archived file is requested, the software manager automatically restores the file. The user does not need to know the physical location, and storage space is almost unlimited.
Palm explains that about 194GB of data are held online on the Convex server dedicated to the UniTree disk cache. Another 34.24TB are roboticly managed "near-line" on the Convex and Crays. The next level comprises seven 4.8TB "silos" and one 1TB Wolfcreek silo from StorageTek of Louisville, CO. These silos are massive robotic tape storage units; six are managed by Convex UniTree for permanent storage, and two are used for Cray short-term storage. There is also an operator-mounted, offline tape archive unit with 4.3TB of vaulted storage managed by UniTree.
Massive data storage is especially important to her environment, Palm says, because many of her users constantly access the same data and add more data to it to develop, say, a model of weather patterns. Data 50 years old is just as valuable as data gathered last week. In developing an ever-enlarging, long-term storage system for these scientists, she stresses the importance of maintaining the future readability of data through standard storage media formats as well as metadata that provides the names and locations of files. She also points out that database architectures must scale as much as possible. A growing parallel environment may require an equally parallel database architecture. These challenges may seem to be beyond the scope of today's commercial environments, but perhaps they are not for tomorrow's.
For general advice about implementing or expanding a data storage system today, the comments of IS professionals are remarkably consistent. First of all, everyone stresses the importance of long-term planning that includes data storage. "Many traditional capacity plans have focused mainly on CPU capacity, yet data storage capacity is just as important," says Carlson of Cornell.
Don Crouse, vice president of technology at Large Storage Configurations, a vendor based in Minneapolis, says that data storage planning should be as specific as possible. "You have to ask yourself how much money you want to put into storage management and then create a plan that answers this question in terms of storage space, retentivity [how long the data is held] and the resultant cost."
Secondly, everyone agrees that proper storage management is an intensively collaborative, cooperative process. IS has to plan internally, but this plan has to be based on talks with vendors, management and users. "We look for a cradle-to-grave relationship with vendors," says Palm. "Customers and vendors need to be honest in discussing future requirements. And not in a finger-pointing way, but working 100 percent together."
The relation with management is equally important. Joshua Greenbaum, senior analyst for Sentry Market Research in Westboro, MA, cites as an example a large, international car rental agency that wanted to build a central database of information from all its affiliates in the United States and overseas. The project almost crashed because the affiliates had not fully agreed to back it. "Decentralized data means decentralized data management," he says. This makes it necessary that management be properly informed, educated and sold on data management and storage projects. "Technology is important, but a lot depends on cooperation, and that has to be done by good old-fashioned interpersonal relationships."
In a similar way, talking with users is a critical factors for success. Palm says that every five years, the computing environments and research requirements committee asks scientists working with NASA about their requirements for the near, middle and long terms. "These guys know what they need, so as part of a service organization, we take their requirements and turn them into [procurement] documents to make sure we meet them with the right hardware, software and manpower."
Based on current trends, data storage will only become more demanding. To succeed, IS departments will have to have a good idea of where they need to go, based on a continuing dialog with the people who develop, support and use the systems.
Richard Cole is a contributing editor to UniForum publications. He can be reached at firstname.lastname@example.org.