CakePHP, DITA, and continuous integration

For my last two years at IBM, I led a team developing a continuous integration build system for DITA based builds.  We chose to base our application on the CakePHP rapid development framework with an IBM DB2 database to store the build definitions, build results, and other metadata. The build execution is handled by ANT and various internal tools.


The project began with the need to provide a consistent build infrastructure for the different teams within our organization. Prior to starting this project, teams were building their own ANT build scripts in dramatically different ways of varying complexity using what I liked to call our “not-so-common build utilities.”  We had custom targets to facilitate processing, but it was too open ended and every implementation became hard coded and customized to a level that only the creator could maintain. We needed simplification, consistency, and standardization.

Our organization had too many people supporting custom code for small groups for no other reason than it evolved that way. The primary build output at IBM is Eclipse plugins containing XHTML output for use within Eclipse help systems running in “information center” mode, which are primarily hosted on Our secondary output is PDF files. There are some additional outputs, but those two cover at least 95% of our builds.  Sounds relatively easy right? If you are familiar with the DITA Open Toolkit or IBM’s tooling, you will know that there are many variations that can go into building XHTML output from DITA. You can have different headers, footers/ XSL overrides, ditavals, file extensions, not to mention source file locations, naming conventions, or navigation architectures.

The goal of the project was to greatly simplify and speed up the processes associated with DITA builds by providing a web application to handle the entire build life cycle.

Design choices

We chose PHP as the language of choice for our application because of its lower barriers to learning. A good number of our writers have some level of experience with PHP which meant that when the time came for me to move on, that others within the organization could take over development. Our information development teams typically do not get  much in the way of programming support so it was important that the project could be maintained by our own personnel.

The CakePHP rapid development framework significantly reduced the amount of utility code that we needed to write, such as a logic for the database communication layer or email. The framework also helped reduce the amount of complex  SQL code that we needed to write.  The vibrant community of CakePHP developers meant that customized plugins and other CakePHP extensions were availble to help further speed up development.

We opted for jQuery and jQuery UI instead of CakePHP’s default Prototype library due to jQuery’s ease of use and also very active plugin community.  Replacing the Protoype-based Ajax helper with a jQuery-based helper was a piece of cake (pun intended).

The application data is stored within a DB2 version 9.7 database. The application uses a PHP port of the DelayedJob library with the CakeDjjob plugin to handle delayed and remote execution of jobs. Our build servers run as many workers as the server’s capacity allows or targets workers to certain tasks such as information center hosting tasks. The builds are executed as Apache ANT build jobs and use an internal library for working with IBM information deliverables.  The CakePHP application dynamically assembles the ANT scripts and places the build into the job queue for a build worker to execute.

The experience

We created a UI experience that puts as much control into our writer’s hands as possible and to customize the application to them individually. They have a homepage that lists all of their current build projects and a news feed that displays recent events or problems.  They can dive straight into troubleshooting build problems directly from their homepage or quickly scan the results of a project build. When a writer needs a build, they can simply launch an on-demand build, which typically finishes in one to two minutes depending on the size of their deliverable.

Current state of the project

When I left IBM at the end of March 2012, we had somewhere around 130 build projects on the system, 150 active users, and 80 test information centers running. The project has been a huge success and far surpasses our original goals.  Upon my departure the team was testing our next major release, which will feature the first delivery that takes full advantage of the build farm architecture.

This was a very ambitious project when we first launched it and was a testament to how much a small team can accomplish through persistance and cohesiveness. The project was challenging and enjoyable to work on. It made my decision to leave IBM for Google very difficult because I was enjoying my work so much, but couldn’t pass up the opportunity that Google presented.

Graphical aid for generating object setup scripts (patent pending)

I joined IBM as an information developer intern (co-op) in June of 2005.  Dell Burner, my team lead, was working on a project with our visual designer and user experience engineer to create a quick start visual for setting up WebSphere MQ for use with Q replication. Q replication is a database replication technology that is extremely high performing due to its architecture and infrastructure. The integral piece of making Q replication out perform standard SQL-based replication is that it pushes its instructions over WebSphere MQ message queues.

WebSphere MQ administrators are a whole different audience than DB2 database administrators.  These two different products are like night and day. DB2 database administrators typically don’t know anything about WebSphere MQ and also are not likely allowed to touch the WebSphere MQ configurations.

Q replication was facing a major customer pain point when it came to configuring WebSphere MQ. The DBAs were not skilled in the technologies needed to implement it. In most shops, there were communication and procedural barriers to getting the environment configured. In most cases, the best case scenario for getting just the WebSphere MQ portion of the environment configured was a minimum of 1-2 days. If the DBAs were working in a z/OS shop, that time frame could be even longer.

Of course a best case scenario means all parties involved are well versed in their requirements and can communicate those from DBA to WebSphere MQ admin.

In an effort to ease the learning curve of Q replication, Dell Burner, Daina Pupons-Wickham, Kevin Lau, and Michael Mao undertook an effort to create a diagram based document and supporting instructions for assisting DBAs to create their setup scripts that they would ask the WebSphere MQ admins to run. Continue reading

Greasemonkey script for Lotus Connections dynamic drop-down menus

Install this script from IBM Lotus Connections Drop-down menus

IBM’s Lotus Connections software is an enterprise social software application that consolidates a lot of collaboration and information sharing within the enterprise to a single location. We use Lotus Connections quite extensively within IBM and it has been an amazing improvement over many of our previous tools and methods for project, team, and professional collaboration.

The recently announced Lotus Connections 2.5 features communities, blogs, wikis, file sharing, social bookmarking, activitiesprofiles, and a customizable dashboard.  I can tell you that I use every single one of those features on a daily basis and its been a great step forward in so many ways. Knowledge is being better shared across our enterprise and it can be found more easily. Projects and teams can stay organized and on track. To-dos can be assigned and tracked.

While my overall satisfaction with the product is quite high, one small design issue bugs me. Thankfully, Lotus Connections is a web application that I can use the Greasemonkey addon for Firefox to alter aspects of the design.  I often find myself having to switch between sections in Lotus Connections–On a wiki page and need browse to a wiki section within a community.  The default design has a top bar menu with only top-most navigation items:


Getting around Lotus Connections sometimes requires clicking through multiple unnecessary pages.  I decided to add drop-down menus to each menu item to reduce the number of clicks to get around. I eventually discovered the ATOM feeds for each Lotus Connections area and determined I could add links for specific areas such as a list of all your communities or a list of all files that have been shared with you (had to blur out the actual files in the screenshot):


Overall, this script helps improve an already awesome application a bit more.  I’m hoping to convince the Lotus Connections teams to implement some kind of navigation that either matches this experience or something similar for being able to more quickly navigate to specific items in your network.

Using this script

You can use this script on, which is an IBM implementation of Lotus Connections 2.5. You can also customize the locations that this Greasemonkey script runs to include your own installations. For example, if you want it to run against* you would include that location:


If you are an IBM employee, you can find an internal version of this script for use with our own internal deployments. Check my internal Connections Blog by searching for Greasemonkey.

Great Outdoors sample database (GSDB)

In May of 2008, I switched roles at IBM. I moved from being your standard information developer (technical writer) to taking on a much more open ended role that I often have a difficult time describing, internally titled “Information Development Infrastructure Lead” for the IBM Integrated Data Management portfolio of products.

One of the facets of this new, broad role was leading a team of developers, tech sales, information developers, and subject matter experts in the creation or adoption of a common sample database for the entire portfolio. The IBM Integrated Data Management portfolio is mostly comprised of IBM Optim products and IBM Data Studio and some other tools thrown in. Overall, the portfolio has potentially 30 or so products with their own database requirements.

Adopt or create?

IBM has a variety of sample databases around the organization. Some are small, some are random, some for very specific situations, and very few are realistic or rich for a broad use.  My first task was evaluating all of the potential sample data sets that were available to our teams and to try to find one that best fit our requirements.

Key requirements

The outcome of the project needed to meet some key requirements for our products:

  • Story or scenario that drives usage and design of the database.  Must be be similar to a realistic customer environment to help relate our tools and information to solving the customers’ own problems.
  • Support the wide variety of databases that our tools support, including DB2 for Linux, UNIX, and Windows, DB2 for z/OS, Oracle, Informix Dynamic Server, and Microsoft SQL Server.
  • Support key product scenarios and features, including data warehousing, data mining, modelling, application development, and general database tasks.
  • Support XML capabilities in the various databases.
  • Support spatial data teams (mapping data).

Very few of the databases my team investigated came close to meeting any of these goals.

Cognos acquisition

Earlier in the year, IBM had acquired the Business Intelligence software company Cognos. Cognos brought to IBM an extremely rich, realistic, relevant, and widely adopted sample database known as the Great Outdoors sample database (GSDB).  This sample database is based on the operations of a fictional outdoor distributor known as the Great Outdoors company.  The data set included 5 schemas: business-to-business sales data (GOSALES), warehouse data (GOSALESDW), human resource data (GOSALESHR),  retailer data (GOSALESRT), and marketing data (GOSALESMR).

This database was by far the closest match to our portfolios requirements. The database prior to our work already had the following features:

  • Based on a story about a fictional business.
  • Five schemas that covered different types of business data.
  • Localized and translated.
    • All text within the database is available in 25 languages.
    • Data is spread out across the world and can expose different scenarios in different regions and countries.
  • Initial support for DB2 for Linux, UNIX, and Windows; Oracle; and Microsoft SQL Server.

Expanding the GSDB database

To meet our key requirements, we required some additions to the sample database.  I had to carefully navigate internal political issues with this cross-organizational project and be careful with the acquisition teams to ensure that all interested parties were happy and the relationships continued to be beneficial for all parties.

Working with the Cognos teams, we prioritized and planned for the addition of a new schema and addition to the story that the database follows. To meet the needs of our sales enablement teams and also our data mining teams, we required business-to-consumer data.  This required the addition of customers as individuals rather than businesses, transactions, customer interests, fake credit data, and a variety of other information.

The  data mining and data warehousing teams had the most challenging to implement requirements of any team.  The data that we added needed to be quite deep as well as intelligent. To demonstrate the data mining tools we needed to add associations, patterns, time series, and trending to the B2C transaction data. Before the transactions could be generated, we had to first add approximately 30k new customer records.

Adding customers

The customers that we needed to add involved more manual work than some of the other aspects of the expansion of the GSDB database. We involved teams and individuals around IBM to help assist with the data collection.  Eventually over 50 people across many countries had helped with adding the following types of data:

  • First and last names that were representative of the population of a country and culture. For example, some cultures have different surnames patterns for females such as Czech female surnames that end in “ova.”
  • Translated or localized version of all data. For example, city names in their English  name and its local name , such as Bangkok and in Thai กรุงเทพมหานคร.
  • A variety of city names, state/province names, street names, and corresponding postal codes and phone codes.

All of the above data was gathered and then randomly mixed up so that we could generate as additional rows by combinations of first names, last names, addresses and cities. Also, we did this to ensure that all data was random and that any similarities to a real personal were entirely by chance. Additional data was added in to these records randomly such as age, profession, marital status, and hobbies.

Generating complex transactions

One of the members of my extended team was a developer and subject matter expert that worked with the data warehousing and data mining teams. He was able to create an application for us that would read our new customers table and generate orders either randomly or using some logic based on age, location, profession, marital status, and hobbies.

The application read the approximately 30k customer records and generated a half million transactions for those customers. Patterns, associations, trends, and other intricacies were added to these transactions so that our complicated mining tools could discover and expose to demonstrate the value of the tools and how to use them.

GSDB sample database usage at IBM

Our new version of the sample database is starting to gain wide usage across IBM’s Software Group. Many teams are writing developerWorks articles, tutorials, and demos based on the story behind the data and the data itself. You can see many of those using the following query:

One of the first projects to use the sample database was the IBM InfoSphere Warehouse information development team, which created three tutorials around the sample: SQL warehousing, cubing services, and data mining.

Of course, the Cognos products also continue to make extensive use of the data within their product line and by a variety of teams including information development, quality assurance, education, and sales.