By Penny Crosman
If you want to find out what Brad and Angelina are up to, Google is a great search tool. Type in the celebrity names and poof, you get a list of the latest stories about the Brangelina baby-to-be. But if you need a technical or business-oriented search, Internet-style search technology doesn’t cut it. Accurate enterprise search depends on intelligent use of state-of-the-art taxonomies, metatags, semantics, clustering and analytics that find concepts and meaning in your data and documents.
The idea that the enterprise can’t be searched like the Web sounds foreign to many business executives. “Why can’t we use Google?” says the CEO. IT obediently buys Google’s search appliance, turns it on and the problem is solved. Or is it? “For some companies, Google is fine,” says Laura Ramos, Gartner analyst. But where many repositories of non-Web content and documents need to be searched or critical information must be found quickly, companies need to design searches that approximate human reasoning.
No one product can do this. But by mixing and matching the latest taxonomy, clustering, and entity, concept and sentiment extraction tools, you can get close. What’s helping is the rise of XML: As more companies realize the benefits of reading and sharing information in standard XML formats, such as RDF, ebXML and XBRL, more products roll out to convert documents, databases and other content into XML. The information provided in XML tags and formatting brings a level of intelligence about documents and content hitherto unavailable. Next-generation search technologies are taking advantage of XML formatting and metadata to provide searches informed by insider information and structure.
The main trend adding power to enterprise search is the increase in semistructured information: content that has some kind of structure to it, generally through the use of metatags that describe content. E-mail, which is structured with “To,” “From” and “Subject” fields, is one example of semistructured data. XML is also expanding the universe of semistructured content as industries adopt XML schema, such as ACORD (Association for Cooperative Operations Research and Development) for XML in the insurance industry and XBRL (Extensible Business Reporting Language) in the financial services arena. Such schema help businesses exchange and analyze data in a standardized way.
Basic structure is provided by metatagging. An author or software program identifies the elements of a document, such as headline, abstract, byline, first paragraph, second paragraph and so on to modestly improve search results.
Using content structure in the display of search results is useful. If a search engine can present the headline, abstract, graphics and the first and last paragraphs of an article, the user gets a good idea of what it’s about — much better than the typical document “snippet” that’s often no use at all.
A few vendors are using XQuery, a command-oriented, SQL-like standard for creating search statements, to exploit the structure of XML-tagged content. Mark Logic, for example, converts documents and databases to XML, provides structural metatagging, and indexes the content and tags in a database where they can be mined by a variety of text analytics tools. Similarly, Siderean Software’s Seamark Metadata Assembly Process Platform converts unstructured and structured data to RDF (Resource Description Framework), generates metadata such as page title and date; and organizes the content and tags into relational tables. Entity and concept extraction can be applied to create tags, and metatags can be suggested to an editorial team, which can approve and refine them. Content and metadata are then pulled into a central repository where they can be organized according to corporate vocabularies or ontologies and mined using the tagging results.
With metatags and some structure in place, the next logical step to improving an enterprise search is to build a taxonomy. For as long as search technology has existed, it’s been obvious that the first step toward getting something more accurate than 500,000 useless hits is to create context or navigation for the search, such as a taxonomy — a classification according to a predetermined system. A taxonomy can be as basic as organizing documents by month or client, or it can be a sophisticated scheme of concepts within topics.
“Categorization lets you sharpen the search and do concept-based retrieval as well as browsing,” says Sue Feldman at IDC. “It lets a user answer questions that can’t be answered by search alone, such as, ‘What’s in this collection?’ or ‘I’m interested in going on a vacation, but I don’t know where; what are some interesting places?’”
With a taxonomy in place, users can browse through categories and discover information they need but didn’t know how to look for (indeed, few people understand how to write effective search queries or ask the right questions of a search engine). The tricky part is deciding who will build the taxonomy. Who is willing, able and blessed with sufficient free time to decide what the structure should be and where each new piece of content fits in?
The most straightforward answer is to have authors categorize and apply the proper metatags and keywords to their content. Publishers of magazines and technical publications, for instance, take a structured-authoring approach using marked-up templates. But this laborious practice is not for everyone, and in a typical company, most users lack the time and inclination to fill out forms describing each document.
A more lightweight method of categorizing, called “folksonomy,” is becoming popular on the Web, where sites like Flickr and Del.icio.us provide those submitting photos or lists with easy-to-use tools to annotate their content. “By combining annotation across many different distributors, you gain insight into useful information and get around some [of the] problems with more traditional approaches to metadata management,” says Brad Allen of Siderean.
With an active community of users assigning categories and metatags, valid new terms, initiatives and projects are easily added to the existing taxonomy, making it more dynamic than a rigid taxonomy created by a librarian. “It’s sloppy and it’s chaotic, but the degree to which it improves precision in the retrieval process can be quite significant,” Allen says.
Formal taxonomies are usually created by a librarian or cataloger trained in library science. This can be effective, but it’s expensive, time-consuming and hard to keep up-to-date.
Sometimes Web masters help determine relevancy. Google’s search engine creates page ranks based on how frequently people link to a given piece of content. The downside to this is that most companies’ documents and data sources have little or no record of content linking. “That this is lost on most people is a triumph of branding and makes page-rank-free Google somewhat akin to caffeine-free Jolt as a product,” says Dave Kellogg, CEO of Mark Logic.
Clustering tools, such as those from Engenium or Vivisimo, create an ad hoc taxonomy by grouping search results into categories on the fly (search engines from Inxight Software and Siderean also cluster results). With clustering, a search for the term “life insurance” on an insurance company’s site would display results grouped under headings such as Whole Life, Term Life and Employee Benefits. It’s a fast and efficient way to categorize content, but it’s not always accurate; there’s no consistent set of categories, and the results can be strange because there’s no human involvement.
Combining Search Tools
The next step to intelligent search is to apply text analytics tools. Several small companies are providing analytics software for entity, concept and sentiment extraction.
Sentiment extraction, or sentiment monitoring, the newest of these tools, tries to identify the emotions behind a set of results. If, for example, a search uncovers 5,000 news articles about the Segway, sentiment extraction could narrow the set down to only those articles that are favorable. Products from Business 360, Fast, Lexilitics, NStein and Symphony all provide sentiment extraction. IBM has layered NStein technology on its OmniFind enterprise search platform to support “reputation monitoring,” so companies can know when their public image is becoming tarnished.
Entity extraction uses various techniques to identify proper names and tag and index them. Inxight and ClearForest are the two leading providers of entity-extraction software, and many search tools embed or work with their technology.
Concept search tools put results in context, as in Paris the city versus Paris the person or Apple the company versus Apple the fruit. These tools use natural-language understanding techniques to make such distinctions. Autonomy and Engenium are two vendors of concept search software.
Adding a Backbone
Assuming you need more than one search technology, how do you knit disparate solutions together? IBM’s answer is Unstructured Information Management Architecture. Recently published on SourceForge.net, UIMA is an XML standard framework whose source code is available to third-party search technologies. It acts as a backbone into which text analytics and taxonomy tools can be plugged.
UIMA may sound like a gimmick to promote IBM’s OmniFind enterprise search product, but because its business is driven by services more than software, IBM is willing to pull in other, sometimes competing applications. “No single vendor can address all analytics needs or all requirements to understand unstructured information,” says Marc Andrews, director of search and discovery strategy. “Companies need different analytics for different sets of content; [what's] relevant to the life sciences community will not be relevant to the financial services industry. And even within an organization, the analytics relevant to warranty claims and customer service data will be different from the analytics relevant to marketing, HR and generic interest.”
UIMA provides a common language so search results can be interpreted by different applications or analytics engines. The framework defines a common analysis structure whereby any content — whether it be an HTML page, a PDF, a free-form text field, a blob out of a database or a Word document — can be pulled into a common format and sent to a search tool. Results are fed back into the analysis structure and passed along to the next search tool. The final results are output in a common format that any UIMA-compliant application can use.
Can UIMA become a universally accepted backbone that holds search tools together? Some think UIMA is on its way to becoming a de facto standard. So far, the Mayo Clinic, Sloan Kettering and the Defense Advanced Research Projects Agency are adopting the framework, and 15 vendors, including Attensity, ClearForest, Cognos, Inxight, NStein and Siderean, have agreed to make their search tools UIMA-compliant.
In a case of co-opetition, Endeca will support UIMA in an upcoming release of its enterprise search software even though the company competes with iPhrase, which was acquired last year by IBM. “UIMA will uncomplicate the world,” says Phil Braden, Endeca’s director of product management. “As more and more people adopt UIMA as the standard for how structured and unstructured data is supposed to look and how these components are supposed to integrate, it becomes that much easier to pull data from these different systems into Endeca.”
There’s little to challenge UIMA other than a couple of XML initiatives that also address the standardization of data formats for search engines. One such initiative is Exchangeable Faceted Metadata Language, an open XML format for publishing and connecting faceted metadata between Web sites, but that standard doesn’t have the momentum of something being pushed by IBM.
Not every company, of course, will go to all the lengths described here to architect accurate search. For some, keyword search and placement of documents in well-labeled electronic folders will suffice. The sophisticated search pioneers are e-commerce sites, pharmaceutical companies and government agencies, which have the most to gain: greater sales, faster drug development, detection of terrorist activity. Call centers are getting search makeovers so that multiple search tools can mine unstructured content and databases together and give reps all the information they need to close calls. What could broader and more accurate searches achieve in your company?