Links can be painful to deal with in help systems especially in multi-author environments or with larger numbers of topics. Maintaining links over time can also be difficult. If you are authoring your help in DITA and using relationship tables, you might find they aren’t as easy to maintain as many claim. The Eclipse help system can make link validation difficult due to its frameset and difficulty for most link checkers to find all of your pages.
My goal in writing this is to find a link checker that meets the following requirements:
- Recursive link checking.
Need to be able to specify how deep the program should follow links to discover and validate all links. Ideally, could limit this recursion to the same site.
- Reports that can be saved and distributed
Share results with writing teams for resolving and fixing links.
- Customization of settings, especially links and link patterns to ignore.
Link checkers often have false positives for various reasons and it is helpful to be able to filter and hide these “errors.”
- Bonus: Command line support.
Integration into a post-build process or for otherwise automating will likely require a command line interface.
Many link checkers try to follow the rules that websites define in their robots.txt file. These files instruct search engines and other spiders how they are to behave when snooping around a website. I ran into this problem when trying to test my IBM information center by using the LinkChecker from SourceForge and W3C Link Checker.
My solution was to run the link checking program against my test infocenter rather than our production server. This eliminated the W3C link checker from my possible tools because it cannot access my intranet. Xenu does not follow the robots.txt instructions.
What URL to use?
You might also assume like me that just pointing your link checker at the top level URL should work just fine. I’ve had mixed results trying to use http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp as my starting page. Your link checker might also have a difficult time with the frames also.
No broken links?
There is an Eclipse bug about certain versions not returning a proper 404 response status. If you are experiencing this same problem, you might need to try testing with a different version of Eclipse. I’ll try to track down this bug # and put it in here.
Checking the links of my infocenter
I installed the latest version of Xenu, which had recently been updated. I pointed Xenu to http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp and left all of the default options. This infocenter contains approximately 17,000 topics for all languages and are owned by some 12 or more contributing authors.
Xenu is still running but it looks as if it is properly crawling the site and finding problems:
Starting to see plenty of red showing up and quite a bit more than I was expecting so I expect I’ll be sending this report out to our team. The total number of pages and files to test were discovered thus far is over 10k so the link checker seems to be working correctly.
When Xenu finally finished, I generated a report (Hit Cancel on the FTP screen) and it is formatted for viewing from a couple different perspectives. The HTML markup behind the scenes is very basic, I’ll have to investigate how to parse it to show broken links by plugin for being able to send the reports to the correct people but that might be a later post.
Here’s a quick screen cap showing one easily identifiable broken link:
I’ve listed a few here from my own experiences but would love to hear from others.