Verticals


Virtual Peace (http://virtualpeace.org) is alive as of last evening.

For the last gosh-don’t-recall-how-many-months I’ve been working as a Project Collaborator for a project envisioned by the other half (more than half) of the Jenkins Chair here at Duke, Tim Lenoir.  For those of you who don’t know Tim, he’s been a leading historian of science for decades now, helping found the History and Philosophy of Science program at Stanford.  Tim is notable in part for changing areas of expertise multiple times over his career, and most recently he’s shifted into new media studies.  This is the shift that brought him here to Duke and I can’t say enough how incredible of an opportunity it is to work for him.  We seem to serve a pivotal function for Duke as people who bring together innovation with interdisciplnarianism.

What does that mean? Well, like the things we study, there are no easy simple narratives to cover it.  But I can speak through examples.  And the Virtual Peace Project is one such example.

Tim, in his latest intellectual foray, has developed an uncanny and unparalleled understanding of the role of simulation in society.  He has studies the path, no, wide swath of simulation in the history of personal computing, and he developed a course teaching contemporary video game criticism in relation to the historical context of simulation development.

It’s not enough to just attempt to study these things in some antiquated objective sense, however.  You’ve got to get your hands on these things, do these things, make these things, get some context. And the Virtual Peace project is exactly that. A way for us to understand and a way for us to actually do something, something really fantastic.

The Virtual Peace project is an initiative funded by the MacArthur Foundation and HASTAC through their DML grant program. Tim’s vision was to appropriate the first-person shooter (FPS) interface for immersive collaborative learning.  In particular, Virtual Peace simulates an environment in which multiple agencies coordinate and negotiate relief efforts for the aftermath of Hurricane Mitch in Honduras and Nicaragua.  The simulation, built on the Unreal game engine in collaboration with Virtual Heroes, allows for 16 people to play different roles as representatives of various agencies all trying to maximize the collective outcome of the relief effort.  It’s sort of like Second Life crossed with America’s Army, everyone armed not with guns but with private agendas and a common goal of humanitarian relief. The simulation is designed to take about an hour, perfect for classroom use. And with review components instructors have detailed means for evaluating the efforts and performance of each player.

I can’t say enough how cool this thing is.  Each player has a set of gestures he or she may deploy in front of another player.  The simulation has some new gaming innovations including proximity-based sound attenuation and full-screen full-session multi-POV video capture.  And the instructor can choose form a palette of “curveballs” to make the simulation run interesting.  Those changes to the scenario are communicated to each player through a PDA his or her avatar has. I was pushing for heads-up display but that’s not quite realistic yet I guess. 😉

The project pairs the simulation with a course-oriented website.  While a significant amount of web content is visible to the public, most of the web site is intended as a sort of simulation preparation and role-assignment course site.  We custom-built an authentication and authorization package that is simple and lightweight and user-friendly, a system that allows instructors to assign each student a role in the simulation, track the assignments, distribute hidden documents to people with specific roles, and allow everyone to see everything, including an after-action review, after the simulation run.

Last evening, Wednesday October 08, 2008, the Virtual Peace game simulation enjoyed its first live classroom run at the new Link facility in Perkins Library at Duke University.  A class of Rotary Fellows affiliated with the Duke-UNC Rotary Center were the first players in the simulation and there was much excitement in the air.

Next up:

I never miss a beat here it seems, for now I am already onto my next project, something that has been my main project since starting here: reading research and patent corpora mediated through text mining methods.  Yes that’s right, in an age where we struggle to get people to read at all (imagine what it’s like to be a poet in 2008) we’re moving forward with a new form of reading: reading everything at once, reading across the dimensions of text. I bet you’re wondering what I mean.  Well, I just can’t tell you what I mean, at least, not yet.

At the end of October I’ll be presenting with Tim in Berlin for the “Writing Genomics: Historiographical Challenges for New Historical Developments” workshop at the Max Planck Institute for the History of Science. We’ll be presenting on some results related to our work with the Center for Nanotechnology in Society at UCSB.  Basically we’ll be showing some of our methods for analyzing large document collections (scientific research literature, patents) as applied to the areas of bio/geno/nano/parma both in China and the US. We’ll demonstrate two main areas of interest: our semiotic maps of idea flows over time I’ve developed in working with Tim and Vincent Dorie, and the spike in the Chinese nano scientific literature at the intersection of bio/geno/nano/parma.  This will be perfect for a historiography workshop. The stated purpose of the workshop:

Although a growing corpus of case-studies focusing on different aspects of genomics is now available, the historical narratives continue to be dominated by the “actors” perspective or, in studies of science policy and socio-economical analysis, by stories lacking the fine-grained empirical content demanded by contemporary standards in the history of science.[…] Today, we are at the point in which having comprehensive narratives of the origin and development of this field would be not only possible, but very useful. For scholars in the humanities, this situation is simultaneously a source of difficulties and an opportunity to find new ways of approaching, in an empirically sound manner, the complexities and subtleties of this field.

I can’t express enough how exited I am about this. The end of easy narratives and the opportunity for intradisciplinary work (nod to Oury and Guattari) is just fantastic.  So, to be working on two innovations, platforms of innovation really, in just one week.  I told you my job here was pretty cool. Busy, hectic, breakneck, but also creative and multimodal.

Advertisement

After much anticipation an amazing new Patent retrieval tool launched yesterday. SparkIP is an amazing new patent search tool of which my colleague (he is my boss-man really) Tim Lenoir is a founder. SparkIP combines the robust on-the-fly clustering of search results similar to Vivisimo’s Clusty but with a pretty incredible twist. The search engine results are navigated by the user in a visual way. Results are clustered, and first the user is presented not with patent results per se but rather with patent cluster results. The company refers to each cluster as a “SparkCluster Map.” Each of these cluster “maps” have numerous clusters within them. This set of cluster maps (shown here)

SparkIPLandscape

referred to as a landscape, is an excellent and robust way of reducing often-overwhelmingly-sized relevant document results while providing complex visual information about each cluster. This is truly a forward-looking tool in many respects but particularly in terms of generating intelligent and useful information about technologies, people, and institutions related to a keyword search. SparkIP has raised the bar on information retrieval right here. But your search is not done yet.

Given the landscape you can then select any of the specific cluster maps (seven in all were returned on “text mining”) by clicking directly on the map graphic. I selected the second cluster map, “information retrieval.” This then brings an enlarged view of the cluster map revealing the clusters within the map, shown here:

SparkClusterMap

Then clicking onto one of the map nodes/clusters (I selected the “document information retrieval” node at the very center of the cluster map) you see a view called “Technology Detail” (shown below):

TechDetail

More information-overload-reducing brilliance on display here in SparkIP. First, note that while 61 patents were retrieved, only 10 were returned. Further, there are likely hundreds more patents relevant to “text mining.” What appears to be happening here is that SparkIP has developed patent-filtering heuristics “under the hood” that get rid of the high volume of junk patents cluttering any patent database. After all, many if not most patents are created by their originators for purposes other than to stake a claim on a highly specific technology. Many a business game is played with patents as the pieces. An organization might want to try and occupy an intellectual property space to see if it can land licensing suckers. Other patents are premature. Some others overreach or are incredibly vague and therefore unenforceable. And so on.

There are a number of small problems with the interface as with many a beta product. The back buttom removes you entirely from your search results rather than helping you navigate backwards from, say, technology detail view to cluster map view. The meaning of visual iconography such as cluster map node size or color, while intuitive, are not altogether clear just from naively using the tool.

But wait folks, that’s not all. In addition to keyword-to-landscape patent search SparkIP will also open up an eBay-esque marketplace for intellectual property. I don’t know of that part is already live or not. I hope to have more time to play around with the site in the coming days.

SparkIP was founded at Duke University through collaboration between Dr. Lenoir, current Pratt School of Engineering Dean Rob Clark, and John Hopkins Provost and Senior President of Academic Affairs Kristina Johnson. Since joining Lenoir at Duke I’ve had a couple of small windows of opportunity to provide some technical advice on cluster metrics with SparkIP engineer (and allpatents.org founder) Kevin Webb. But I never even got to see a demo of this thing. And let me tell you, man, this thing is amazing. I put this tool right up there with Clusty and the TRIP evidence-based medicine site as a retrieval tool among the best since the arrival of Google beta.

Congratulations to you Tim, and to you Kevin, and to the rest of the SparkIP team.

For my master’s thesis I performed a case study of a very large multinational drug company to evaluate how it innovates in text mining to drive its central mission of drug innovation. Drug discovery is hard and therefore expensive, but with high performance computing now a commodity, drug companies should be at the bleeding edge of text mining innovation, particularly in the area of virtual hypothesis formation and testing (deriving novel insights from mining multiple inputs, from clinical data stores to genetics databases to research literature collections and even the so-called grey literature). But guess what? At least with respect to the case I studied, they aren’t. They are highly focused on circa-1997 extraction tasks with little to no interest in statistical learning and a confused interest in taxonomies and automated inductive reasoning. They invest in formal logics and in information extraction but the meat in the middle, the statistical learning, is kept strictly to data mining of data sets severly limited in scope. Simply put, the company has little to no coherent and well-articulated vision of how it can tackle its most daunting problem for drug discovery: information overload.

How can this problem arise? Isn’t the central mission of a drug company, its core competency, to create new drugs? Well, historically it has been. But competing with the core competency of the drug company is another, oft conflicting, central mission, to make money. What this means for drug discovery is that it is only kind of important. The company I studied was laying off key drug innovators globally as it was focusing its investments further down the drug pipeline, placing more and more emphasis on Phase 2 & 3 projects, more on lower risk short term gains. What this means is that the central mission has become, to get drugs to market, particularly ones with a recurring revenue model.

Historically the drug companies could hang their hats on introducing drug treatments that have contributed to huge improvements in human health over the last century. Drug companies have been in the business of saving lives. Drugs are largely responsible for the 50%+ increase in life expectancy in the US over the last century.

Sometimes, however, human health improvements are not profitable. Sometimes drug companies will select strategies far less beneficial to human health that are far more financially beneficial to the organization. Consider the focus on marketing deregulation in the US, or FDA deregulation. Why invest in developing drugs when you can invest in removing barriers to sales? Now that deregulation has just about all but run its course, drug companies will soon face the fact that they will need to depend more and more on releasing new drugs. When the two largest drug companies in the world can’t combine for more than a dozen new drugs in any given calendar year, you can tell that something’s clearly broke. You can’t hang the shortage on regulation or on a shortage of actionable research.

So why the institutional emphasis away from innovation? One can only speculate; I will use Portfolio Theory to speculate. The dominant forces controlling large multinational drug companies are people of a certain kind, namely, aging investors. They have invested their dollars and expect something in return. Portfolo Theory tells us that our optimal investment trend as we age is to go from high risk to lower risk, income-generating investment. For example, I’m 35, and if I expect to live to, say, 80, I’m probably at least three decades from retirement, from a time where I need my investments to generate income. Because I have decades to invest, I can handle the risk of higher risk investments, namely because I don’t need the reliable income, and because I have time to recover if I lose. The game of investing depends entirely on how much time you perceive yourself as having, namely because on your death bed all the money in the world is worth nothing, but having a lot of cash on hand that last week before your death bed is pretty damned important. A promise for a check next week won’t do you good if you’re dead. And so I think, unlike me, the investors in large mutinationals are old men, frankly. They need to allocate their assets on the income-generating end of the spectrum. They need that cash and they need it now.

This asset allocation model is confirmed by the reduced interest in technological innovation and the increase of interest in being merely early adopters. Adopting established technologies carries a lower risk, as it has a higher probability of some payoff.

And so why invest in high risk, in innovation? The argument for it would be three-fold: to attract younger investors, to focus on the longevity and long-term stability of the company, and to be true to the core mission, which should be to treat health ailments. Maybe my experience is anecdotal, but at least to me it appears that investors in my parent’s generation (they’re 65) are far more likely to invest in, say Pfizer or GSK, than investors my age. They’re safe, they’ll do OK next quarter, but that picture is very murky a decade from now. Not to mention that younger investors no longer see drug companies as beneficial to human health. There’s nothing attractive for the average younger investor.

One of the saddest consequences of this reluctance to innovate, this focus on profit, is the impact on human health. Drug companies are far more willing to repackage old drugs and market the heck out of them, renewing their proprietary charges, than to find new drugs. And when the drug companies choose new drugs to invest in, they are going to look for “comeback” drugs, drugs that cure nothing but treat indefinitely. No new antibiotics are reaching market because there’s no incentive

The following comprises a collection of my intuitions and “big picture” insights resulting from graduate study focused on text mining at SILS. These are insights related to feature representation, knowledge engineering, model building, the application of statistics to real-life phenomena, and the greater whole of information science.

Many of these apparently go without saying, yet so many discussions of supposed problems would go away if some of these observations were made explicit. This is my attempt to make them explicit. Maybe it goes without saying that expressing the obvious is sometimes quite necessary.

1. Statistical models often fail because they’re missing key attributes necessary to describe the phenomena they represent

Attributes that are altogether unrecognized, difficult to quantify, difficult to analyze, truncated out, or simply forgotten arguably dominate and confound the predictive/explanatory power of statistical models. These missing variable abound. Their absence dominates to the point where theory itself must give way to empiricism and its sister, skepticism. It also means that we simply don’t see everything and that it never hurts to try and see more things.
2. Feature reduction of highly dimensional linguistic data sets is a misguided, outdated and counterproductive approach

There. I said it.

Claude Shannon’s model of information as that which is located among noise is a metaphor that appears to have been misleading a number of people in information science, particularly those involved with anything even remotely tangential to text mining (or, if you must, “knowledge discovery”). Information in an atomic form (e.g., bits) allows for the differentiation of signal and noise. A bit either is a signal or it isn’t. Attributes of real-life phenomena (e.g., average first down yardage in football for a team) are not like bits, at least not in the way we experience them and interpret those phenomena, whether in written explanations or in databases. “Real-life” phenomenae comprise different sorts of real-world features that can never be honestly reduced to their atomic constituents. And, pragmatically speaking, they won’t be reduced to quantum atomic states any time soon.

Given that every attribute of real-world phenomenae we identify partakes of both signal and noise, the removal of any attribute (save for the case of redundancy) always corresponds to the loss of information. Ultimately the statistical modeling of phenomena such as competitive sports and stock markets and clinical emergency room chief complaints is wholly unlike modeling communication channels. There’s something immediately discontinuous about binary electronic signals while other these other phenomena need dramatic interpretive steps before they can be represented with discontinuous electronic signals. Finally, signal and noise are terms that don’t apply very well because that which we are modeling can only be realistically described by features that are both informative and misleading at the same time.

There’s something rather continuous about language (something that latent semantic indexing attempts to capture) and that even the simplest of approaches, such as applying stop word lists to bag of word representations, lost critical information that dictate the semantics of the document. “Dog,” “a dog” and “the dog” quite clearly mean different things, as do “of the dog”, “out of a dog” and so forth. Representing all of those quotations as “dog” or going a step further and representing all of these quotes with the very same word-sense identifier, dumbs down human language beyond recognition. Garbage in, garbage out is a phrase I learned more than a quarter century ago when learning to program games for the Commodore Vic-20.

Reading a text book from 1993 on the C4.5 algorithm, I came across reflections that some crucial elements of C4.5 appeared to be motivated by economizing on computer resource issues. Not enough memory, too slow of processing, etc. In 2007 high performance computing is a commodity. The pressures for feature reduction in machine learning needed to be heeded 14 years ago, but they’re considerably less of an issue today.

Finally, at the very end of my stretch of graduate school studies I accidentally came across a new strategy for feature representation that is so painfully obvious in retrospect it leaves me wondering why no one else has been doing this. Fortunately for Hypothia it spells one very big competitive advantage. But I digress.

3. There’s always something missing from your set of attributes (cf. 1 & 2)

4. There’s no substitute for knowing your data set (cf. 1)I credit this oft-neglected, oft-devalued approach to my first and truly excellent data mining instructor, Miles Efron, who may be to blame for turning me on to text mining in the first place. What have you wrought? He made sure to repeat this lesson of knowing thy data a few times, and the lesson was surely not lost on me. In fact it seems as it it frames and justifies my confidence in my approach.

5. [DELETED] and let your algorithms optimize your attributes for maximal classification margin (cf. 2 & 3)

Can’t say the deleted part yet. But I will, eventually. It probably should be obvious by now. But still I’m not prepared to say.

6. SVM+SMO is very good for binary classification of highly dimensional data (cf. 5)

Improvements to SVM+SMO are always welcome of course, and it appears there are now numerous implementations of SVM that improve. I should note that, according to Eibe Frank, SMO in Weka (written in Java) is just as fast as Joachims’ SVM-light written in C. SMO’s pretty good.

SMO solves the QP problem created by SVM efficiently.

7. You always need more computing power (cf. 2, 5 & 6)

The curse is not dimensionality, the curse is not intellectual. The curse is economic, a problem of resources.
Likely it will be difficult to produce a dataset that is intractable for a good HPC setup running SVM+SMO but it doesn’t exactly hurt to try as long as you’re trying to harness more and more power.

8. You don’t know everything (cf. 3 & 4)

9. models only forecast well in forecast-influenced environments only when the model has an information advantage over other models (information assymetry, competitive advantage)

10. You’ll never get it quite right ( cf. 8 )

11. There always more left to do (cf. 5, 7 & 10)

12. Disambiguation can be better pursued not in any pure sense by machinic strategies but rather by messier approach of utilizing the greater context surrounding term, document, and corpus, which in turns permits some degree of ambiguity, which is necessary for understanding

13. Word sense disambiguation is quite possibly the wrong way to go to conjure semantics in one’s text representation (cf. 2 & 12)

As I’ve written before, there are other approaches available to leverage semantic information that are better than word-sense diambiguation (WSD) .

14. More formally, the incorporation of ambiguity into linguistic representations (i.e, representing all possible word senses/meanings and POSs for any given word) allows for better representations of intelligence than ones produced at least in part through WSD strategies

15. For artificial intelligence to become smarter than humans, it must at least be as smart as humans first.  A person’s ability to understand multiple senses of a given word at once (of which poetry is perhaps the most striking example) is strikingly intelligent and far more intelligent than most WSD approaches I’ve seen (cf. 14).  And when you consider that the basic unit of meaning is truly not the word but the sentence, WSD seems all the more foolish, and yet makes me feel there’s a huge opportunity to understand language from its wholes and holes.  Discourse analysis anyone?

16. Not knowing everything, not always getting it right, and always having more left to do makes the hard work a great deal of fun. Discoveries are everywhere waiting to be written into existence. (cf. 8, 10, 11)

17. Don’t panic, be good, and have fun. (cf. 16)

18. The essence of human language is nothing less than the totality of the human language in all of its past present and future configurations and possibilities.

1. Introduction
Pharmacogenomics experts have recognized that genomics-based approaches to drug discovery appear to suffer from some sort of information overload problem
(A. D. Roses, Burns, Chissoe, Middleton, & Jean, 2005, p. 179). More specifically, the explosion of human genomics information may have been outpaced by a concurrent explosion of noise within that data, leading to a significant attrition rate in the pharmaceutical pipeline (A. D. Roses et al., 2005, p. 179). However, it is not entirely clear how the concepts of information overload and signal-to-noise apply to information-based struggles in pharmacogenomics. In order to improve our understanding of the barriers to optimal use of pharmacogenomics information for drug discovery purposes we must first briefly unpack competing ideas about information overload and signal-to-noise and then contextualize the appropriate ideas within PGx-based drug discovery (henceforth PGx-DD).

2. Explaining Too Much Information in PGx-based Drug Discovery: Information Theory or Information Overload?

Genomics research pioneer and GSK Senior VP for Genomics Research Allen Roses has recently shed light on why pharmacogenomics-based approaches may not be optimal. According to Roses, who arguably is in a unique position to understand the problem, the central problem is one arising from information struggles. Roses writes,

What factors have limited target selection and drug discovery productivity? Although HTS technologies were successfully implemented and spectacular advances in mining chemical space have been made, the universe for selecting targets expanded, and in turn almost exploded with an inundation of information. Perhaps the best explanation for the initial modest success observed was the dramatic increase in the ‘noise-to-signal’ ratio, which led to a rise in the rate of attrition at considerable expense. The difficulty in making the translation from the identification of all genes to selecting specific disease-relevant targets for drug discovery was not realistically appreciated (A. D. Roses et al., 2005, p. 179).

What Roses calls the “noise-to-signal” ratio sounds like the problem of information overload, yet it also sounds as if it borrows from the language of Information Theory as put forth by Claude Shannon. Roses’ insight seems to corroborate Sean Ekins’ observation that already-extant data is not optimally utilized (2005). Pharmacogenomics is failing to deliver because PGx researchers and organizations utilizing PGx research have been unable to meet the information challenges concomitant with the explosion of data.

The language Allen Roses uses to describe struggles with information in the field of PGx-based drug discovery refers both to a signal-to-noise ratio and to information overload. The terminology appears, however, to be rather ambiguously utilized in the context of PGx-DD. “Noise-to-signal” seems to refer to Claude Shannon’s mathematical theory of communication (Shannon & Weaver, 1949) while the problems described by PGx professionals sound more like cognitive issues related to more formal notions of information overload.

2.1.Shannon’s Mathematical Theory of Communication
In 1948, Claude Shannon of Bell Labs completed work on his mathematical theory of communication. For so doing, Shannon is credited as fathering the field of Information Theory. It is from Shannon’s theory that the notion of signal-to-noise arises, among many other concepts crucial to any understanding of information. In his introduction to the ensuing book publication comprising Shannon’s work on the theory, Warren Weaver explains that the theory was supposed to deal with three distinct levels of communications problems, as follows:

Level A. How accurately can the symbols of communication be transmitted? (The technical problem.)

Level B. How precisely do the transmitted symbols convey the desired meaning? (The semantic problem.)

Level C. How effectively does the received meaning affect conduct in the desired way? (The effectiveness problem.) (Shannon & Weaver, 1949, p. 4)

Information in Shannon’s sense is not used in the ordinary sense of information. While by ‘information’ we ordinarily mean something akin to that which has already been said/written, Shannon means information in the sense of what may possibly be said (Shannon & Weaver, 1949, p. 8). For Shannon, information is a probable message sent over a channel (e.g., a telephone wire) and his concern is with describing general properties of the transmission and interpretation of such electronic signals.

Concerns about the ratio of signal-to-noise with respect to information transmission do originate from Shannon’s own communication theory work. The very ratio of signal-to-noise appears in Shannon’s theoretical examination of channel capacity with power limitation (Shannon & Weaver, 1949, p. 100). Shannon uses the ratio of the power source of the signal (denoted as P) to the power of the noise (denoted as N) in order to provide a general way of calculating how many bits per second any communication pathway can actually transmit. Shannon replaces P with S, the peak allowed transmitter power, in order to adjust channel capacity where peak power limits the rate of the channel to transmit bits. According to Shannon the upper bound rate of a channel is the channel band times the log of the ratio of signal plus noise to noise where the signal-to-noise ratio is low (Shannon & Weaver, 1949, p. 107). Loosely speaking, the rate at which telephone wires, coaxial cables, wireless networks, and the like can transmit messages varies logarithmically with the ratio of peak power (signal) to background noise on the channel (noise).

Shannon & Weaver’s specified problem set does not accurately match the sort of problem a drug discovery researcher is facing, not at least without a considerable stretch. Shannon’s sense of information in his definitive work on communication theory does not seem quite the same as the sort of information we are dealing with when we speak of genomics research data. Finally, Shannon’s notion of signal-to-noise can at best only loosely apply to notions of researchers struggling with too much information in their hands. Shannon is writing about communication channels, not people.

Efforts Shannon may have made to model specifically human communication in his theoretical work appears to be at best tertiary to the central thrust of his work, which was to generalize the properties of electronic communications systems. In short, Information Theory as proffered by Shannon does not appear to apply in any straightforward way to the sort of “noise-to-signal” problem Allen Roses describes or any other human communication problems that can occur independently of electronic signals. The signal-to-noise problem Roses reports is an information problem to be sure but it appears to be an information problem unlikely to be either explained or resolved through the lens of Shannon’s communication theory.

2.2.Information Overload
The concept of the possibility of too much information dates back to ancient times (Bawden, Holtham, & Courtney, 1999, p. 249). The recurring concern of information overload stems from the general notion that a person’s work becomes inefficient from increasing difficulty experienced in locating the best pieces of information. With the advent of computer-based information retrieval systems in the 1950s (Bawden et al., 1999, p. 249) as well as the beginnings of the mass proliferation of scientific research literature (Ziman, 1980), the concern became more frequently and more directly articulated and investigated. While any exact definition of information overload is elusive issues of relevance and efficiency are commonly notes as are issues of both data management and psychic strain (Bawden et al., 1999, p. 250). The constant problem however is that information overload stands for a struggle—a struggle that increases as a collection of information grows beyond human tractability. The recurring solution inevitably takes the form of methods or techniques that allow a person to locate some tractable set of pieces of information of sufficient quality in a reasonable amount of time in order to aid the person in completing the task.

3. Impact of information overload on PGx-based drug discovery
Information overload describes the general problem of “noise-to-signal” referred to by Allen Roses. Roses characterizes the information problem facing PGx-DD as having increased the rate of attrition of drug candidates in the pharmaceutical pipeline. Further, he states that the solution to the problem is an increase in “specific, disease-relevant targets” relative to all genomic data (A. D. Roses et al., 2005, p. 179). In other words, the proliferation of genomic data has drowned out this highly specific disease-relevant genomic information to the point that it increases drug discovery failure. The way to resolve the issue is to reduce information overload in PGx-DD by restricting the flow of information to PGx researchers to highly specific disease-relevant genomic information. As Roses says, providing researchers with validating evidence is crucial.

4. Validating evidence, novelty, and a PGx-info quality model
What, however, frames, delimits, or describes validating evidence for candidate targets? Roses states that disease-specific targets chosen based on well-trod beliefs “have a significant probability of being the totally wrong target” (A. D. Roses et al., 2005, p. 180). It is therefore not enough to identify highly specific disease relevant data efficiently; the data must support infrequent or entirely novel theories. The data must in essence have the characteristic of supporting novelty, of supporting ideas not commonly held, of bolstering theories that appear to be unreasonable.

The quality of PGx information should be evaluated using the following three criteria:

(a) the disease-relevance of the information,

(b) the specificity of the information, and

(c) the novelty of the information or the novelty of the theory supported by the information.

Sources

Bawden, D., Holtham, C., & Courtney, N. (1999). Perspectives on information overload. Aslib Proceedings, 51(8), 249-255.

Ekins, S., Bugrim, A., Nikolsky, Y., & Nikolskaya, T. (2005). Systems biology: Applications in drug discovery. In S. C. Gad (Ed.), Drug discovery handbook (pp. 123-183). Hoboken, New Jersey: Wiley Interscience.

Roses, A. D., Burns, D. K., Chissoe, S., Middleton, L., & Jean, P. S. (2005). Disease-specific target selection: A critical first step down the right road. Drug Discovery Today, 10(3), 177-189.

Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana and Chicago: University of Illinois Press.

Ziman, J. M. (1980). The proliferation of scientific literature – a natural process. Science, 208(4442), 369-371.

(NOTE: the preceding document is a revised expert from my master’s thesis.)

“The best way to predict the future is to invent it”
– Alan Kay

“Don’t listen to the physics majors until you also check with the Vaudevillians.”
– Tom Notti, The Bubble Guy

In my introduction to Hypothia I briefly referred to a paradigm shift in the web that I wish to participate in. That paradigm shift as I imagine it is the change from information retrieval (IR) to information generation as the core technology for utilizing the web.

Web 1.0 was basically the infancy of the web, the Mosaic-Netscape-Alta Vista web. Web 2.0 has ushered in a user-centric era, where the winners will be those who effectively repurpose user data assets generated from agile services.

We can safely say that the web is no longer in its infancy but rather in its adolescence. The somewhat adolesecent appeal of paradigmatic Web 2.0 applications (e.g., YouTube, MySpace) is no mere accident but rather a reflection of a youth-oriented culture of innovation and capital that is Silicon Valley mistaking youth appeal for broad appeal. Mistaking it, or perhaps pushing it and making it so.

The good news about the mistaken horizontal appeal is that there’s still quite a lot of room for the web to grow in terms of utility and users. The world is far from wired. Heck, Sequoia’s grandmom probably isn’t too keen on “the computer nonsense” yet. And why on earth would she be, other than for photos of the grandkids? OK, Google may be usable by most grandmothers, something which has been every bit as important to Google’s success as its PageRank algorithm.

The trend towards Web 3.0, the Semantic Web, has already begun to sprout. Instead of social web applications with so-called “horizontal” appeal (“so-called”, because if you are over 40 and on Friendster, you might get some strange looks from other users), we are already beginning to see niche social tools. In other words, web 2.0 tools are slowly becoming vertically oriented. Services such as LinkedIn, geared towards a semi-broad niche of white collar professionals looking to “network”, are succeeding as even more specialized and vertically-oreitned tools appear.

Such tools as LinkedIn, Gwagle and Trip usher in the Semantic Web simply because in narrowing the content scope, the content to be searched maps better to meaning. In other words, terms that users choose to search these tools become less ambiguous as the content scope shrinks. At the same time that terms map better to their intended senses, such sites make it more and more possible for ontology building and use. We can see this, for example, on Trip, an evidence-based medical search tool that provides faceted search features and leverages the admittedly rudimentary MeSH ontology. (I would add that disambiguation is best performed by the contours of context rather than by any set of rules applied to document collections. This emergent nature of disambiguation, and the concomitant necessity for ambiguity in understanding, is best saved for a later discussion.)

But it’s not just the narrowing semantic spaces that help usher in the Semantic Web. It’s also the more complex sets of user data, things like tags and search terms, applied to specific domains, that help automate user-responsive architectures expanding the possibilities for advanced analytics and responsive content.

Another indication that the Semantic Web in the sense of vertically-oriented semantic retrieval is on its way is the the work of George Miller’s research group at Princeton. Miller is renowned for a number of things, among them the creation and development of WordNet. Christiane Fellbaum, a colleague of Miller’s and long-time participant in the WordNet project, has apparently initiated work on a project called Medical WordNet (also here). Unlike WordNet, Medical WordNet will benefit from the fact that it will be applied to a much narrower semantic space. It will add specialized terms not in WordNet while limiting senses and relations between terms shared with WordNet.

Yet another indication of the rise of the Semantic Web is the finalization of the XQuery standard along with the development of XML content servers. Simply put, why invest months learning, say, data warehousing and OLAP cubes, when you can just implement advanced linguistic representations in XML and query them in an amazingly simple scripting language? Further, with XML content servers such as Mark Logic or eXist, you can query document collections and synthesize new documents, taking pieces of multiple documents and assembling them together, bound only by the limits of XPath and whatever heuristics you can add.

But OK so with Web 3.0 we will have basic semantics incorporated into our content and the ability to leverage meaning in order to find what we want to find. In the article in which he coins the phrase, “Semantic Web,” Tim Berners-Lee speculated extensively about the possibility of using meaning-annontated content to make basic deductions. But while the infusion of meaning into information retrieval is well under way, the infusion of domain rules for drawing conclusions from that which is retrieved is not nearly so immanent.

So Web 3.0 will be yet another era of information retrieval in the literal sense. Finding information will become refined to domain-specific, context-limited, user-experience-friendly, meaning-aware, document synthesis retrieval. But without the maturation of reasoning, retrieval will remain nothing more than that–retrieval. Information processing tools will remain stuck on regurgitation, however elaborate the regurgitation may be.

With all of the power of retrieval extended and leveraged, and the introduction of vertically-oriented ontology tools such as Medical WordNet, the next crush will be to develop systems that think. Well, thinking? No not really thinking. Rather, with the application of context, domain, and user-tuned semantics to search, the need for the development of domain-specific heuristics will become readily apparent. Instead of the emergence of question answering systems whereby such systems answer multiple questions, I imagine we’ll see services dedicated to solving single problems quite well.

What’s crucial in solving problems using information retrieval tools is that the output of rule-using systems is novel content. In other words, such tools are no longer merely finding existing content but rather creating new content. And the creation of useful information places a heavy burden on evaluating statements for their quality.

The long-term goal of Hypothia is to pursue the development of problem-specific information generation services with a particular eye on scientific discovery. Hypothia in short aims to become the first innovation service and help usher in Web 4.0. It’s far-off, far-fetched, far-out, and maybe a bit ridiculous as a vision, but someone’s got to create the future.