Link checking your topics in an Eclipse Help System/Infocenter

Links can be painful to deal with in help systems especially in multi-author environments or with larger numbers of topics.  Maintaining links over time can also be difficult. If you are authoring your help in DITA and using relationship tables, you might find they aren’t as easy to maintain as many claim. The Eclipse help system can make link validation difficult due to its frameset and difficulty for most link checkers to find all of your pages.

My goal in writing this is to find a link checker that meets the following requirements:

  • Recursive link checking.
    Need to be able to specify how deep the program should follow links to discover and validate all links. Ideally, could limit this recursion to the same site.
  • Reports that can be saved and distributed
    Share results with writing teams for resolving and fixing links.
  • Customization of settings, especially links and link patterns to ignore.
    Link checkers often have false positives for various reasons and it is helpful to be able to filter and hide these “errors.”
  • Bonus: Works with dynamic pages and JavaScript.
    The newest versions of the Eclipse Help System rely soley on JavaScript to generate the navigation tree. Ideally, a link checker could work with the same view that we see. Often, link checkers are just verifying the response codes in the headers rather than actually examining the page for dynamically generated content and links.
  • Bonus: Command line support.
    Integration into a post-build process or for otherwise automating will likely require a command line interface.

I’ve long used ParaSoft WebKing for accessibility and link checking for projects within IBM but it can be quite difficult to configure to meet the goals above.  I’ve also tried a few other applications including LinkChecker, W3C Link Checker, and Xenu. Xenu works pretty well for discovering broken links to any kind of resources, including images, stylesheets, and JavaScripts.

Robots.txt

Many link checkers try to follow the rules  that websites define in their robots.txt file. These files instruct search engines and other spiders how they are to behave when snooping around a website. I ran into this problem when trying to test my IBM information center by using the LinkChecker from SourceForge and W3C Link Checker.

My solution was to run the link checking program against my test infocenter rather than our production server. This eliminated the W3C link checker from my possible tools because it cannot access my intranet. Xenu does not follow the robots.txt instructions.

What URL to use?

You might also assume like me that just pointing your link checker at the top level URL should work just fine. I’ve had mixed results trying to use http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp as my starting page.  Your link checker might also have a difficult time with the frames also.

I suggest experimenting by also running it against the TOC that is used for non-JavaScript enabled browsers, which for my IC is http://publib.boulder.ibm.com/infocenter/idm/v2r2/basic/tocView.jsp. This page does not use the standard JavaScript based navigation tree so a link checker should have an easier time with it.  If you set your link checker to recursively dig down enough levels the navigation starting point should work pretty and even better if you provide plenty of cross-linking within your files.

No broken links?

There is an Eclipse bug about certain versions not returning a proper 404 response status.  If you are experiencing this same problem, you might need to try testing with a different version of Eclipse. I’ll try to track down this bug # and put it in here.

Checking the links of my infocenter

I installed the latest version of Xenu, which had recently been updated. I pointed Xenu to http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp and left all of the default options.  This infocenter contains approximately 17,000 topics for all languages and are owned by some 12 or more contributing authors.

Xenu is still running but it looks as if it is properly crawling the site and finding problems:

report_window

Starting to see plenty of red showing up and quite a bit more than I was expecting so I expect I’ll be sending this report out to our team.  The total number of pages and files to test were discovered thus far is over 10k so the link checker seems to be working correctly.

When Xenu finally finished, I generated a report (Hit Cancel on the FTP screen) and it is formatted for viewing from a couple different perspectives.  The HTML markup behind the scenes is very basic, I’ll have to investigate how to parse it to show broken links by plugin for being able to send the reports to the correct people but that might be a later post.

Here’s a quick screen cap showing one easily identifiable broken link:

report_output

Other tools?

I’ve listed a few here from my own experiences but would love to hear from others.

One thought on “Link checking your topics in an Eclipse Help System/Infocenter

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>