exact phrase  any/all
Managing the enterprise information network
denotes premium content | May 26 2012 

Feature

posted 14 Nov 2005 in Volume 2 Issue 5

The perfect portal implementation: Eight steps to heaven

Part III

In the third part of this series, Mike Ferguson focuses on step four of the portal implementation process, taxonomy design and categorisation.

Step four: Develop the portal taxonomy and categorisation scheme
One of the key tasks in portal implementation is making information easy to find. This is especially important if you are deploying portal technology for knowledge management, where the emphasis is on unstructured documents, web content, digital media and collaboration. The most common approach to this is to create a taxonomy – a hierarchical structure where each folder represents a category (or term) that describes a group of information (see Figure 1). Categories can include other lower-level categories and references to specific content items that are associated with them. Here, the term content is used in the broadest sense to mean documents, e-mail, images, web pages, digital media or services.

The folders hold references to related items of content that could be scattered across servers inside and outside the enterprise. It means that the user can find all related content under a particular category irrespective of where that content is actually located. One item could be a document on a file system next to your desk, while another item appearing next to that document could be on a website on the other side of the world. In addition, it is quite normal to have the same content item referenced in multiple categories throughout the taxonomy. This is valuable as different people may choose to navigate the system via different paths (See Figure 2).

Taxonomy design is an area of raging debate in portal circles. There are all kinds of questions being raised, for example:

  • Whether there should be one taxonomy covering all content or one for each community of users;
  • Whether the taxonomy should be maintained by a central administration team or delegated to authorised content managers throughout the organisation;
  • If a taxonomy exists for each community of users, then what happens if the same term is used in different taxonomies with different meanings? Understanding the relationships between the terms used across taxonomies to categorise content;
  • Should a taxonomy be designed – and information categorised – before making it available to users? Or, should we observe what terms staff use to categorise or ‘tag’ content and then calculate the most popular ones? In this way, we select categories for the taxonomy using the most popular terms chosen by the user base – the terms that have a ‘democratically elected majority’.

In response to this last question, one approach is to create a ‘bookshelf’ of categories in advance, like a library system. Or, don’t create one at all – just let everyone choose their own tags for content and when the most popular terms ‘float to the top’, use these to design and categorise the taxonomy.

So, what do you do? In theory, good taxonomy development takes into account the importance of separating the elements of a group (taxon) into subgroups (taxa) that are mutually exclusive, unambiguous, and – taken together – include all possibilities. In practice, the system should be simple, easy to remember and easy to use, so that the user is quickly led to the information they are looking for.

In most cases it is unlikely that there will be a single taxonomy categorising all internal and external content accessed by users. Portal deployment is typically iterative and involves designing community-based taxonomies. Community content managers are then nominated to take on ownership and maintenance of their local community-based taxonomy. Content managers also make use of portal usage reports to tune the taxonomy over time.

Over the next year or so, the taxonomy design and maintenance process is likely to be enriched even further as portal vendors start to introduce taxonomy advisors to help content managers optimise system design and maintenance. In addition, more flexibility is likely to appear as portals start to allow users to adopt their own terms in viewing categories, while maintaining relationships with terms across the organisation and different communities.

Generally speaking there are two main approaches to taxonomy design: top down and bottom up. Top-down taxonomy design typically starts with a content usage study. Content management and publishing procedures are also established. The information from the study is then used in the taxonomy design process. Top-down taxonomy design starts with the top-level categories and is followed by the definition of subfolders in hierarchies. Portal vendor’s services organisations often try to speed this up by offering pre-built taxonomies for various vertical industries. Once the taxonomy has been designed and built, you populate it with references to content by using automated categorisation tools.

The bottom-up approach uses various tools, including crawlers and search engines, to discover content and automatically generate a taxonomy by analysing this content. Some of the tools available for this are quite powerful but this varies across portal products.

In reality, most people will use elements of both approaches. For example, start with a top-down approach to define high-level categories and then switch to the bottom-up approach to derive the lower-level categories. The trick with taxonomy design is to end up with something that is stable and not subject to huge change. A taxonomy should be intuitive, consistent and logical. It should also contain categories that, when taken together, are mutually exclusive and collectively exhaustive. While regular maintenance is the norm, there are clearly best practices that can pay dividend. Figure 3 shows some popular – but often the most difficult to achieve – approaches to taxonomy design.

These are:

  • Multi-faceted categorisation;
  • Subject or topic-oriented categorisation;
  • Subject or process-oriented categorisation.

Subject oriented is probably the most widely practiced of these methods and is often based on the principle of each category being ‘an instance of’ or ‘part of’ its parent.

One common mistake in taxonomy design is categorising by organisational structure – for example, finance, marketing, human resources etc. On the surface this seems fairly logical, with high-level categories representing major departments. The problem is that when you start getting into detailed sub-categories, constant changes in organisational structure can tend to force frequent change on the system structure. For this reason many organisations aim for subject oriented or multi-faceted approaches.

There can be no doubt that taxonomy design is iterative. You are not going to get it right first time and once the taxonomy is deployed, it is important to make use of regular usage reports to help to improve it. The good news is that portal products are forgiving and if you need to make changes they are relatively straightforward to perform.

Once the taxonomy is designed, categorisation of content can be manual or automated. Manual categorisation puts the user in control. This typically happens when a content or document author has completed the authoring of a piece of information and wishes to publish it. The information – for example, a document – is published once an authorised user manually selects its position within the taxonomy. Users can publish content by placing a reference to it in one or more folders (see Figure 5). The content itself may be on a shared file system or more likely these days, within an enterprise content- management system. The point here is that the user is in control of the categorisation process and so the placement of the content under the ‘right’ categories is typically very accurate. When performing manual categorisation, the user must also provide additional metadata about the content (for example, the author’s name, title, a description of what it is about and when it expires). This metadata is stored alongside the reference to the content in the taxonomy, so navigating users can see a brief abstract of what the item is about. In some cases the authored content may need to go through an approval process during manual categorisation, prior to being published. This type of workflow processing is likely to occur when content is to be made visible to the public. In this case, approvers could include marketing and legal departments, for example.

Automatic categorisation allows the portal to discover and relate the content scattered across many internal and external content sources. This is especially important if you have large amounts of unstructured information that need to be accessed via the portal. Figure 6 shows the automated categorisation process, which starts with the set-up of crawlers or spiders. These are small programs that can automatically discover content. Portal administrators can configure as many of these as they like and schedule them to run at regular intervals and varying frequencies. You can crawl websites, document management systems, file systems, databases and many other content sources. They can also be configured with various restrictions – for example, to look only in certain places or for documents with specific file types. Crawlers discover content, generate metadata about that content, take a copy of it and bring it ‘back to base’. At this point, they dump it on to the portal categoriser, which works out what the content is about and groups together related data, before assigning content references to the appropriate categories in the taxonomy. Automatic categorisation may in fact be a process consisting of a number of tools that are invoked in a sequence. These tools include:

  • Language identification – automatically identifies the language the document was written in;
  • Feature extraction – automatically generates metadata about significant vocabulary items along with an item-frequency count;
  • Clustering – automatically groups collections of documents that are similar in some way;
  • Summariser – Analyses the sentences in a document to automatically generate a summary.
  • Topic categorisation – automatically assigns documents to pre-defined categories, topics and themes. This is achieved by creating a list of category names and confidence levels for each document.

Various automatic categorisation techniques may be used and these vary across portal products. You can also buy third-party products that can be used for this – for example, from Autonomy, Entrieva, Teragram and Verity. Broadly speaking, there is a trade off when it comes to automated categorisation – speed versus accuracy (See Figure 7).

Several products use text mining, which is a very popular mechanism. This is a very fast technique that involves training algorithms to categorise accurately by first giving them a sample set of a few thousand documents. Once you are happy with the accuracy of the categorisation, you then let the categoriser loose on all your content. Rule-driven categorisation is more time consuming but tends to be more accurate.The most accurate – and slowest – technique is of course manual categorisation.

You may require different techniques throughout the organisation – for example, a pharmaceutical company may need rule-based categorisation in its research and development departments (for accuracy). But the rest of the department may be happy with text mining. In this case, you may have to buy a third-party product in addition to what your portal vendor offers out-of-the-box if you don’t feel that the products functionality is accurate enough.

In summary, taxonomy design and categorisation is an important part of portal implementation, especially in facilitating collaboration, information sharing and knowledge management. The process is summed up in Figure 8, where content is crawled, categorised and made available to users in community-based taxonomies. Users can then navigate their view of the taxonomy as well as conduct content searches and personalise what they need by filtering out data that does not meet their requirements. Relevant content is then presented on the device being used to access the portal.

In next month’s article, we will look at step five of the portal implementation process: customising the portal’s user interface.

Mike Ferguson is managing director of Intelligent Business Strategies Limited. He is also a partner in iBonD. As an analyst and consultant, he specialises in enterprise business intelligence, business integration and portals. He can be contacted at +44 1625 520700 or by e-mail at: mferguson@intelligentbusiness.biz

Sponsored links

Subscribe to the EI e-newsletter. Keep up-to-date with the latest news from EI magazine

Intranets and Portals report
Copyright ©1994-2005 Ark Group Ltd All rights reserved. No part of this site or the publications described herein
may be reproduced in any form without the permission of Ark Conferences Ltd, Registered in England, No. 2931372.