|
Not your father's search engine
Tomorrow's search technologies will go beyond just finding documents.
by Alan Joch
The world's awash in information, and as data volumes continue to grow each year, so do the number of content providers and sources of information. Web sites, document management systems, intranet portals, e-mail programs, storage arrays, commercial databases all make ferreting out what's relevant tough, even with today's efficient search engine technologies.
But finding essential data and turning it into meaningful information is an even bigger challenge. "Search alone is not enough," says Claire Gronemeyer, product manager for Inxight Software Inc., a Sunnyvale, Calif., supplier of information analysis software. "You either get too many results or you get nothing that is relevant."
Now experts within the research departments of commercial companies, university labs and R&D institutions are hammering away at this information challenge. And while any breakthroughs from these research efforts promise to help almost everyone in our information-driven culture, two groups in particular stand to benefit: the intelligence community, for anti-terrorism activities and large enterprises, as they fine-tune business processes to compete in a global economy.
Here are four projects that preview tomorrow's more sophisticated data search and analysis products.
Real-time queries
One reason we're deluged with data is that our information systems are becoming interconnected as never before—data isn't always just in one central location. It flows through networks and arrives at our PCs in high-volume streams. What's the answer for getting it all under control?
Turn some traditional data management thinking on its head, responds Michael Franklin, a computer science professor at the University of California, Berkeley. He and fellow Berkeley professor Joe Hellerstein head the Telegraph Project, which is performing research into building infrastructures for querying streaming data from sensors, logs and peer-to-peer systems. Known as TelegraphCQ (Continuous Query), the goal is to create a system that continuously monitors real-time updates to information. "You can put a stream processing engine in the middle of any data flow and then use database queries to monitor and analyze what's going on in those flows," Franklin says.
The TelegraphCQ approach is an alternative to traditional database queries that probe storehouses of information, find the relevant data, massage it and return an answer. With TelegraphCQ's approach, analysts store the queries as well as the data in order to glean results from information that arrives on the fly. For example, a sales manager might store a query for order closing rates. Whenever the sales system updates its lists of new contracts, the query captures the real-time closing data and reports the results to the manager.
Franklin believes one future application for streaming-data queries may be for live information flowing in from RFID monitors. "Everyone is worrying about the fire hose of data that's going to come from all those devices," Franklin says. His answer is the HiFi (High Fan-In) system. HiFi creates an information pipeline with a number of aggregation points that gradually condense information into smaller and more meaningful chunks to whittle the volume down to a manageable size by the time it arrives at a manager's desk.
Franklin says interest in the streaming-data queries also comes from the intelligence community, which would use it for processing the reams of terrorist-suspect communications, and from the financial industry, which would use it in fraud detection applications. "It could work for any applications where you have diverse data flows that need to be integrated and monitored at the same time," he adds. "We're applying 30 years of query optimization and processing technology in a real-time fashion."
Context is everything
The World Wide Web Consortium (W3C), the creator of seminal standards for the Web, is turning its attention to a related area, the Semantic Web. "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise and community boundaries," says Eric Miller, W3C Semantic Web activity lead, in Cambridge, Mass.
The Semantic Web highlights a growing realization that people don't want to just search for information anymore. They want to simultaneously acquire information from disparate sources and effectively manage the information once they receive it, Miller believes. "Once you start thinking of objects and their contextual relationships to other objects, you begin to see how you can reduce costs and make it more effective to manage, organize and search for data," he says.
The W3C's resource description framework (RDF) is the basic building block for creating contextual relationships. "It's a model for uniquely identifying objects and uniquely identifying the relationships that exist between objects," according to Miller.
For example, if an enterprise creates two separate databases, one listing customer names and the other listing the names of customers' employers, joining the two may be difficult if the databases don't use the same schemas and data definitions. With standard descriptors defined within the Semantic Web framework, analysts can create new bits of data that indicate "the ID for 'customer' in this database is equivalent to 'company' in this other database." It then records this in a way that others can easily reuse. Taking advantage of "equivalent to" relationships is the beginning of unlocking and connecting data held in traditional databases and making it more accessible to other applications, Miller contends. "You don't necessarily change the relational database model but rather provide a means for exposing the data in a common way that allows this to be integrated with various structured and semi-structured data sources. It looks like just a bunch of network data that can be stitched together with other data," he says.
Easier data integration is one potential benefit for commercial enterprises. "It can happen as needed, and that idea of as-needed data integration is a very powerful one," Miller believes.
Seeing is understanding
"Sense making" is one of the prime research goals at the Palo Alto Research Center (PARC), the storied California R&D center. "Sense making is interpretation under the gun," says Stephen Smoliar, a member of research staff. "Under the gun," in his view, means that someone needs to find and extract meaning from an extremely large and heterogeneous data set under a demanding deadline.
The answer is tools that "tease out meaning" from piles of raw data, he adds, and one tool type at PARC relies on data visualization, the graphical representations of information that help people organize and navigate through the information.
PARC is developing interactive visualization interfaces using a technique it calls Focus+Context. "You see the whole ball of wax" of data in a relatively low-resolution form, Smoliar explains. "You interact with that display to decide where you want your focal point to be." The visualization interface renders the focus area into a higher-resolution image that helps analysts pull out the information of greatest interest from the rest of the display, he adds.
PARC is using this approach to visualize entities and relations extracted from text sources. The technology looks for and highlights relationships among the various subjects in a data set, such as pages of reports or e-mail transcriptions. One case study PARC performed for an intelligence agency centered on bio-warfare threat analysis, using the entire text of a book bought from Amazon.com .
Broad reach
Rather than forcing people to log into every relevant news site, internal knowledgebase and commercial content source to find nuggets of information, the SmartDiscovery Awareness Server, from Inxight Software, provides a federated search engine. The technology probes relevant information sources to return results in a consolidated form. "(The SmartDiscovery server) then automatically categorizes, 'de-dupes' (removes duplications) and analyzes the information," says Gronemeyer.
A newly formed Inxight research group recently received a $1.93 million contract for a research laboratory in Rome, N.Y., to apply Inxight's technology to high-speed processing of intelligence data. Government applications for Inxight's products include counterterrorism solutions as well as intelligence and law enforcement uses.
The next step for the SmartDiscovery server will be to make use of entity-extraction filters that use linguistics analysis to break up text into nouns, verbs and subjects of sentences. The extraction software looks at the larger linguistic constructs and determines that in a particular context one proper noun refers to a person's name, while another is the name of a city. The SmartDiscovery server then establishes relationships among the names and activities being described to extract meaning from large masses of text.
"The intelligence community is very interested in this kind of extraction for field reports about terrorists," says Gronemeyer. "Instead of manually going through all of the information, analysts can simply see who the names are, where the (terrorist) networks are" and act on the intelligence, she adds.
The same techniques may be applied to business intelligence and text analytics for commercial companies for use in customer relationship management (CRM), enterprise content management (ECM) and risk management applications, she adds. For example, a pharmaceutical company might mine reports of adverse drug reactions to look for a link between a particular symptom and a particular drug brand. "The company might go back through its customer logs, its feedback forms and other sources to see where problems may lie," Gronemeyer says. "You drill down to the point where you don't have thousands and thousands of hits but rather a few hits that are focused on the specific content you are interested in."
T
Alan Joch is a New England-based business and technology writer whose work has appeared in The New York Times, Fortune Small Business and Byte.
© Teradata Magazine-March 2006
back to top
|