In May of 2008, I switched roles at IBM. I moved from being your standard information developer (technical writer) to taking on a much more open ended role that I often have a difficult time describing, internally titled “Information Development Infrastructure Lead” for the IBM Integrated Data Management portfolio of products.
One of the facets of this new, broad role was leading a team of developers, tech sales, information developers, and subject matter experts in the creation or adoption of a common sample database for the entire portfolio. The IBM Integrated Data Management portfolio is mostly comprised of IBM Optim products and IBM Data Studio and some other tools thrown in. Overall, the portfolio has potentially 30 or so products with their own database requirements.
Adopt or create?
IBM has a variety of sample databases around the organization. Some are small, some are random, some for very specific situations, and very few are realistic or rich for a broad use. My first task was evaluating all of the potential sample data sets that were available to our teams and to try to find one that best fit our requirements.
The outcome of the project needed to meet some key requirements for our products:
- Story or scenario that drives usage and design of the database. Must be be similar to a realistic customer environment to help relate our tools and information to solving the customers’ own problems.
- Support the wide variety of databases that our tools support, including DB2 for Linux, UNIX, and Windows, DB2 for z/OS, Oracle, Informix Dynamic Server, and Microsoft SQL Server.
- Support key product scenarios and features, including data warehousing, data mining, modelling, application development, and general database tasks.
- Support XML capabilities in the various databases.
- Support spatial data teams (mapping data).
Very few of the databases my team investigated came close to meeting any of these goals.
Earlier in the year, IBM had acquired the Business Intelligence software company Cognos. Cognos brought to IBM an extremely rich, realistic, relevant, and widely adopted sample database known as the Great Outdoors sample database (GSDB). This sample database is based on the operations of a fictional outdoor distributor known as the Great Outdoors company. The data set included 5 schemas: business-to-business sales data (GOSALES), warehouse data (GOSALESDW), human resource data (GOSALESHR), retailer data (GOSALESRT), and marketing data (GOSALESMR).
This database was by far the closest match to our portfolios requirements. The database prior to our work already had the following features:
- Based on a story about a fictional business.
- Five schemas that covered different types of business data.
- Localized and translated.
- All text within the database is available in 25 languages.
- Data is spread out across the world and can expose different scenarios in different regions and countries.
- Initial support for DB2 for Linux, UNIX, and Windows; Oracle; and Microsoft SQL Server.
Expanding the GSDB database
To meet our key requirements, we required some additions to the sample database. I had to carefully navigate internal political issues with this cross-organizational project and be careful with the acquisition teams to ensure that all interested parties were happy and the relationships continued to be beneficial for all parties.
Working with the Cognos teams, we prioritized and planned for the addition of a new schema and addition to the story that the database follows. To meet the needs of our sales enablement teams and also our data mining teams, we required business-to-consumer data. This required the addition of customers as individuals rather than businesses, transactions, customer interests, fake credit data, and a variety of other information.
The data mining and data warehousing teams had the most challenging to implement requirements of any team. The data that we added needed to be quite deep as well as intelligent. To demonstrate the data mining tools we needed to add associations, patterns, time series, and trending to the B2C transaction data. Before the transactions could be generated, we had to first add approximately 30k new customer records.
The customers that we needed to add involved more manual work than some of the other aspects of the expansion of the GSDB database. We involved teams and individuals around IBM to help assist with the data collection. Eventually over 50 people across many countries had helped with adding the following types of data:
- First and last names that were representative of the population of a country and culture. For example, some cultures have different surnames patterns for females such as Czech female surnames that end in “ova.”
- Translated or localized version of all data. For example, city names in their English name and its local name , such as Bangkok and in Thai กรุงเทพมหานคร.
- A variety of city names, state/province names, street names, and corresponding postal codes and phone codes.
All of the above data was gathered and then randomly mixed up so that we could generate as additional rows by combinations of first names, last names, addresses and cities. Also, we did this to ensure that all data was random and that any similarities to a real personal were entirely by chance. Additional data was added in to these records randomly such as age, profession, marital status, and hobbies.
Generating complex transactions
One of the members of my extended team was a developer and subject matter expert that worked with the data warehousing and data mining teams. He was able to create an application for us that would read our new customers table and generate orders either randomly or using some logic based on age, location, profession, marital status, and hobbies.
The application read the approximately 30k customer records and generated a half million transactions for those customers. Patterns, associations, trends, and other intricacies were added to these transactions so that our complicated mining tools could discover and expose to demonstrate the value of the tools and how to use them.
GSDB sample database usage at IBM
Our new version of the sample database is starting to gain wide usage across IBM’s Software Group. Many teams are writing developerWorks articles, tutorials, and demos based on the story behind the data and the data itself. You can see many of those using the following query: http://www.ibm.com/search/csass/search/?sn=dw&lang=en&cc=US&en=utf&hpp=20&dws=dw&q=GSDB&Search=Search
One of the first projects to use the sample database was the IBM InfoSphere Warehouse information development team, which created three tutorials around the sample: SQL warehousing, cubing services, and data mining.
Of course, the Cognos products also continue to make extensive use of the data within their product line and by a variety of teams including information development, quality assurance, education, and sales.