March 2007


The goals of Hypothia are to integrate social feeds, capture embedded ad delivery, reinvent authentic weblogs, capture viral blogospheres, and harness A-list podcasts.

If you think that my last sentence was utterly full of nonsense, you’re right.

Specifically, I’m full of the Web 2.0 Bullsh*t Generator. It’s worth a laugh. And, of course, it’s in Beta, naturally.

Advertisement

1. Introduction
Pharmacogenomics experts have recognized that genomics-based approaches to drug discovery appear to suffer from some sort of information overload problem
(A. D. Roses, Burns, Chissoe, Middleton, & Jean, 2005, p. 179). More specifically, the explosion of human genomics information may have been outpaced by a concurrent explosion of noise within that data, leading to a significant attrition rate in the pharmaceutical pipeline (A. D. Roses et al., 2005, p. 179). However, it is not entirely clear how the concepts of information overload and signal-to-noise apply to information-based struggles in pharmacogenomics. In order to improve our understanding of the barriers to optimal use of pharmacogenomics information for drug discovery purposes we must first briefly unpack competing ideas about information overload and signal-to-noise and then contextualize the appropriate ideas within PGx-based drug discovery (henceforth PGx-DD).

2. Explaining Too Much Information in PGx-based Drug Discovery: Information Theory or Information Overload?

Genomics research pioneer and GSK Senior VP for Genomics Research Allen Roses has recently shed light on why pharmacogenomics-based approaches may not be optimal. According to Roses, who arguably is in a unique position to understand the problem, the central problem is one arising from information struggles. Roses writes,

What factors have limited target selection and drug discovery productivity? Although HTS technologies were successfully implemented and spectacular advances in mining chemical space have been made, the universe for selecting targets expanded, and in turn almost exploded with an inundation of information. Perhaps the best explanation for the initial modest success observed was the dramatic increase in the ‘noise-to-signal’ ratio, which led to a rise in the rate of attrition at considerable expense. The difficulty in making the translation from the identification of all genes to selecting specific disease-relevant targets for drug discovery was not realistically appreciated (A. D. Roses et al., 2005, p. 179).

What Roses calls the “noise-to-signal” ratio sounds like the problem of information overload, yet it also sounds as if it borrows from the language of Information Theory as put forth by Claude Shannon. Roses’ insight seems to corroborate Sean Ekins’ observation that already-extant data is not optimally utilized (2005). Pharmacogenomics is failing to deliver because PGx researchers and organizations utilizing PGx research have been unable to meet the information challenges concomitant with the explosion of data.

The language Allen Roses uses to describe struggles with information in the field of PGx-based drug discovery refers both to a signal-to-noise ratio and to information overload. The terminology appears, however, to be rather ambiguously utilized in the context of PGx-DD. “Noise-to-signal” seems to refer to Claude Shannon’s mathematical theory of communication (Shannon & Weaver, 1949) while the problems described by PGx professionals sound more like cognitive issues related to more formal notions of information overload.

2.1.Shannon’s Mathematical Theory of Communication
In 1948, Claude Shannon of Bell Labs completed work on his mathematical theory of communication. For so doing, Shannon is credited as fathering the field of Information Theory. It is from Shannon’s theory that the notion of signal-to-noise arises, among many other concepts crucial to any understanding of information. In his introduction to the ensuing book publication comprising Shannon’s work on the theory, Warren Weaver explains that the theory was supposed to deal with three distinct levels of communications problems, as follows:

Level A. How accurately can the symbols of communication be transmitted? (The technical problem.)

Level B. How precisely do the transmitted symbols convey the desired meaning? (The semantic problem.)

Level C. How effectively does the received meaning affect conduct in the desired way? (The effectiveness problem.) (Shannon & Weaver, 1949, p. 4)

Information in Shannon’s sense is not used in the ordinary sense of information. While by ‘information’ we ordinarily mean something akin to that which has already been said/written, Shannon means information in the sense of what may possibly be said (Shannon & Weaver, 1949, p. 8). For Shannon, information is a probable message sent over a channel (e.g., a telephone wire) and his concern is with describing general properties of the transmission and interpretation of such electronic signals.

Concerns about the ratio of signal-to-noise with respect to information transmission do originate from Shannon’s own communication theory work. The very ratio of signal-to-noise appears in Shannon’s theoretical examination of channel capacity with power limitation (Shannon & Weaver, 1949, p. 100). Shannon uses the ratio of the power source of the signal (denoted as P) to the power of the noise (denoted as N) in order to provide a general way of calculating how many bits per second any communication pathway can actually transmit. Shannon replaces P with S, the peak allowed transmitter power, in order to adjust channel capacity where peak power limits the rate of the channel to transmit bits. According to Shannon the upper bound rate of a channel is the channel band times the log of the ratio of signal plus noise to noise where the signal-to-noise ratio is low (Shannon & Weaver, 1949, p. 107). Loosely speaking, the rate at which telephone wires, coaxial cables, wireless networks, and the like can transmit messages varies logarithmically with the ratio of peak power (signal) to background noise on the channel (noise).

Shannon & Weaver’s specified problem set does not accurately match the sort of problem a drug discovery researcher is facing, not at least without a considerable stretch. Shannon’s sense of information in his definitive work on communication theory does not seem quite the same as the sort of information we are dealing with when we speak of genomics research data. Finally, Shannon’s notion of signal-to-noise can at best only loosely apply to notions of researchers struggling with too much information in their hands. Shannon is writing about communication channels, not people.

Efforts Shannon may have made to model specifically human communication in his theoretical work appears to be at best tertiary to the central thrust of his work, which was to generalize the properties of electronic communications systems. In short, Information Theory as proffered by Shannon does not appear to apply in any straightforward way to the sort of “noise-to-signal” problem Allen Roses describes or any other human communication problems that can occur independently of electronic signals. The signal-to-noise problem Roses reports is an information problem to be sure but it appears to be an information problem unlikely to be either explained or resolved through the lens of Shannon’s communication theory.

2.2.Information Overload
The concept of the possibility of too much information dates back to ancient times (Bawden, Holtham, & Courtney, 1999, p. 249). The recurring concern of information overload stems from the general notion that a person’s work becomes inefficient from increasing difficulty experienced in locating the best pieces of information. With the advent of computer-based information retrieval systems in the 1950s (Bawden et al., 1999, p. 249) as well as the beginnings of the mass proliferation of scientific research literature (Ziman, 1980), the concern became more frequently and more directly articulated and investigated. While any exact definition of information overload is elusive issues of relevance and efficiency are commonly notes as are issues of both data management and psychic strain (Bawden et al., 1999, p. 250). The constant problem however is that information overload stands for a struggle—a struggle that increases as a collection of information grows beyond human tractability. The recurring solution inevitably takes the form of methods or techniques that allow a person to locate some tractable set of pieces of information of sufficient quality in a reasonable amount of time in order to aid the person in completing the task.

3. Impact of information overload on PGx-based drug discovery
Information overload describes the general problem of “noise-to-signal” referred to by Allen Roses. Roses characterizes the information problem facing PGx-DD as having increased the rate of attrition of drug candidates in the pharmaceutical pipeline. Further, he states that the solution to the problem is an increase in “specific, disease-relevant targets” relative to all genomic data (A. D. Roses et al., 2005, p. 179). In other words, the proliferation of genomic data has drowned out this highly specific disease-relevant genomic information to the point that it increases drug discovery failure. The way to resolve the issue is to reduce information overload in PGx-DD by restricting the flow of information to PGx researchers to highly specific disease-relevant genomic information. As Roses says, providing researchers with validating evidence is crucial.

4. Validating evidence, novelty, and a PGx-info quality model
What, however, frames, delimits, or describes validating evidence for candidate targets? Roses states that disease-specific targets chosen based on well-trod beliefs “have a significant probability of being the totally wrong target” (A. D. Roses et al., 2005, p. 180). It is therefore not enough to identify highly specific disease relevant data efficiently; the data must support infrequent or entirely novel theories. The data must in essence have the characteristic of supporting novelty, of supporting ideas not commonly held, of bolstering theories that appear to be unreasonable.

The quality of PGx information should be evaluated using the following three criteria:

(a) the disease-relevance of the information,

(b) the specificity of the information, and

(c) the novelty of the information or the novelty of the theory supported by the information.

Sources

Bawden, D., Holtham, C., & Courtney, N. (1999). Perspectives on information overload. Aslib Proceedings, 51(8), 249-255.

Ekins, S., Bugrim, A., Nikolsky, Y., & Nikolskaya, T. (2005). Systems biology: Applications in drug discovery. In S. C. Gad (Ed.), Drug discovery handbook (pp. 123-183). Hoboken, New Jersey: Wiley Interscience.

Roses, A. D., Burns, D. K., Chissoe, S., Middleton, L., & Jean, P. S. (2005). Disease-specific target selection: A critical first step down the right road. Drug Discovery Today, 10(3), 177-189.

Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana and Chicago: University of Illinois Press.

Ziman, J. M. (1980). The proliferation of scientific literature – a natural process. Science, 208(4442), 369-371.

(NOTE: the preceding document is a revised expert from my master’s thesis.)

“The best way to predict the future is to invent it”
– Alan Kay

“Don’t listen to the physics majors until you also check with the Vaudevillians.”
– Tom Notti, The Bubble Guy

In my introduction to Hypothia I briefly referred to a paradigm shift in the web that I wish to participate in. That paradigm shift as I imagine it is the change from information retrieval (IR) to information generation as the core technology for utilizing the web.

Web 1.0 was basically the infancy of the web, the Mosaic-Netscape-Alta Vista web. Web 2.0 has ushered in a user-centric era, where the winners will be those who effectively repurpose user data assets generated from agile services.

We can safely say that the web is no longer in its infancy but rather in its adolescence. The somewhat adolesecent appeal of paradigmatic Web 2.0 applications (e.g., YouTube, MySpace) is no mere accident but rather a reflection of a youth-oriented culture of innovation and capital that is Silicon Valley mistaking youth appeal for broad appeal. Mistaking it, or perhaps pushing it and making it so.

The good news about the mistaken horizontal appeal is that there’s still quite a lot of room for the web to grow in terms of utility and users. The world is far from wired. Heck, Sequoia’s grandmom probably isn’t too keen on “the computer nonsense” yet. And why on earth would she be, other than for photos of the grandkids? OK, Google may be usable by most grandmothers, something which has been every bit as important to Google’s success as its PageRank algorithm.

The trend towards Web 3.0, the Semantic Web, has already begun to sprout. Instead of social web applications with so-called “horizontal” appeal (“so-called”, because if you are over 40 and on Friendster, you might get some strange looks from other users), we are already beginning to see niche social tools. In other words, web 2.0 tools are slowly becoming vertically oriented. Services such as LinkedIn, geared towards a semi-broad niche of white collar professionals looking to “network”, are succeeding as even more specialized and vertically-oreitned tools appear.

Such tools as LinkedIn, Gwagle and Trip usher in the Semantic Web simply because in narrowing the content scope, the content to be searched maps better to meaning. In other words, terms that users choose to search these tools become less ambiguous as the content scope shrinks. At the same time that terms map better to their intended senses, such sites make it more and more possible for ontology building and use. We can see this, for example, on Trip, an evidence-based medical search tool that provides faceted search features and leverages the admittedly rudimentary MeSH ontology. (I would add that disambiguation is best performed by the contours of context rather than by any set of rules applied to document collections. This emergent nature of disambiguation, and the concomitant necessity for ambiguity in understanding, is best saved for a later discussion.)

But it’s not just the narrowing semantic spaces that help usher in the Semantic Web. It’s also the more complex sets of user data, things like tags and search terms, applied to specific domains, that help automate user-responsive architectures expanding the possibilities for advanced analytics and responsive content.

Another indication that the Semantic Web in the sense of vertically-oriented semantic retrieval is on its way is the the work of George Miller’s research group at Princeton. Miller is renowned for a number of things, among them the creation and development of WordNet. Christiane Fellbaum, a colleague of Miller’s and long-time participant in the WordNet project, has apparently initiated work on a project called Medical WordNet (also here). Unlike WordNet, Medical WordNet will benefit from the fact that it will be applied to a much narrower semantic space. It will add specialized terms not in WordNet while limiting senses and relations between terms shared with WordNet.

Yet another indication of the rise of the Semantic Web is the finalization of the XQuery standard along with the development of XML content servers. Simply put, why invest months learning, say, data warehousing and OLAP cubes, when you can just implement advanced linguistic representations in XML and query them in an amazingly simple scripting language? Further, with XML content servers such as Mark Logic or eXist, you can query document collections and synthesize new documents, taking pieces of multiple documents and assembling them together, bound only by the limits of XPath and whatever heuristics you can add.

But OK so with Web 3.0 we will have basic semantics incorporated into our content and the ability to leverage meaning in order to find what we want to find. In the article in which he coins the phrase, “Semantic Web,” Tim Berners-Lee speculated extensively about the possibility of using meaning-annontated content to make basic deductions. But while the infusion of meaning into information retrieval is well under way, the infusion of domain rules for drawing conclusions from that which is retrieved is not nearly so immanent.

So Web 3.0 will be yet another era of information retrieval in the literal sense. Finding information will become refined to domain-specific, context-limited, user-experience-friendly, meaning-aware, document synthesis retrieval. But without the maturation of reasoning, retrieval will remain nothing more than that–retrieval. Information processing tools will remain stuck on regurgitation, however elaborate the regurgitation may be.

With all of the power of retrieval extended and leveraged, and the introduction of vertically-oriented ontology tools such as Medical WordNet, the next crush will be to develop systems that think. Well, thinking? No not really thinking. Rather, with the application of context, domain, and user-tuned semantics to search, the need for the development of domain-specific heuristics will become readily apparent. Instead of the emergence of question answering systems whereby such systems answer multiple questions, I imagine we’ll see services dedicated to solving single problems quite well.

What’s crucial in solving problems using information retrieval tools is that the output of rule-using systems is novel content. In other words, such tools are no longer merely finding existing content but rather creating new content. And the creation of useful information places a heavy burden on evaluating statements for their quality.

The long-term goal of Hypothia is to pursue the development of problem-specific information generation services with a particular eye on scientific discovery. Hypothia in short aims to become the first innovation service and help usher in Web 4.0. It’s far-off, far-fetched, far-out, and maybe a bit ridiculous as a vision, but someone’s got to create the future.

Hypothia is the name of my new venture. Hypothia is a new organization dedicated to innovation leveraging the power of text mining and advanced analytical strategies for vertical domains. Hypothia aims to release a set of next-generation information tools with the ultimate goal of replacing search with generation.

The shift from information retrieval (IR) to information generation is a subtle yet revolutionary shift in the way we interact with information.

On this blog I hope to discuss various strategies, technologies and general ideas that might contribute to this paradigm shift. As Google moves away from a business strategy of innovation in IR to a strategy of product/service diversification, they create a tremendous opportunity for everyone else to invent the next best solutions.

I have a number of core interests with respect to text mining. I believe in the concept of know thy data. Hence I believe that myriad complexities in text mining can be reduced and application usability can be maximized by concentrating in specific problem areas. Most of my own work has concentrated in health, from drug discovery to consumer health to clinical diagnosis. I also have a fascination with applying mining strategies to other areas, such as content management, commodities forecasting, real estate pricing analysis, and even sports analysis.

If you wish to learn a little more about me, please see my personal page (http://patrickherron.com). You can find additional information about my previous academic research on my text mining co-op search page (http://proximate.org/tm) and you can learn more about my creative writing and publications on my writing bio page (http://proximate.org/bio).