Text Mining

Virtual Peace (http://virtualpeace.org) is alive as of last evening.

For the last gosh-don’t-recall-how-many-months I’ve been working as a Project Collaborator for a project envisioned by the other half (more than half) of the Jenkins Chair here at Duke, Tim Lenoir.  For those of you who don’t know Tim, he’s been a leading historian of science for decades now, helping found the History and Philosophy of Science program at Stanford.  Tim is notable in part for changing areas of expertise multiple times over his career, and most recently he’s shifted into new media studies.  This is the shift that brought him here to Duke and I can’t say enough how incredible of an opportunity it is to work for him.  We seem to serve a pivotal function for Duke as people who bring together innovation with interdisciplnarianism.

What does that mean? Well, like the things we study, there are no easy simple narratives to cover it.  But I can speak through examples.  And the Virtual Peace Project is one such example.

Tim, in his latest intellectual foray, has developed an uncanny and unparalleled understanding of the role of simulation in society.  He has studies the path, no, wide swath of simulation in the history of personal computing, and he developed a course teaching contemporary video game criticism in relation to the historical context of simulation development.

It’s not enough to just attempt to study these things in some antiquated objective sense, however.  You’ve got to get your hands on these things, do these things, make these things, get some context. And the Virtual Peace project is exactly that. A way for us to understand and a way for us to actually do something, something really fantastic.

The Virtual Peace project is an initiative funded by the MacArthur Foundation and HASTAC through their DML grant program. Tim’s vision was to appropriate the first-person shooter (FPS) interface for immersive collaborative learning.  In particular, Virtual Peace simulates an environment in which multiple agencies coordinate and negotiate relief efforts for the aftermath of Hurricane Mitch in Honduras and Nicaragua.  The simulation, built on the Unreal game engine in collaboration with Virtual Heroes, allows for 16 people to play different roles as representatives of various agencies all trying to maximize the collective outcome of the relief effort.  It’s sort of like Second Life crossed with America’s Army, everyone armed not with guns but with private agendas and a common goal of humanitarian relief. The simulation is designed to take about an hour, perfect for classroom use. And with review components instructors have detailed means for evaluating the efforts and performance of each player.

I can’t say enough how cool this thing is.  Each player has a set of gestures he or she may deploy in front of another player.  The simulation has some new gaming innovations including proximity-based sound attenuation and full-screen full-session multi-POV video capture.  And the instructor can choose form a palette of “curveballs” to make the simulation run interesting.  Those changes to the scenario are communicated to each player through a PDA his or her avatar has. I was pushing for heads-up display but that’s not quite realistic yet I guess. 😉

The project pairs the simulation with a course-oriented website.  While a significant amount of web content is visible to the public, most of the web site is intended as a sort of simulation preparation and role-assignment course site.  We custom-built an authentication and authorization package that is simple and lightweight and user-friendly, a system that allows instructors to assign each student a role in the simulation, track the assignments, distribute hidden documents to people with specific roles, and allow everyone to see everything, including an after-action review, after the simulation run.

Last evening, Wednesday October 08, 2008, the Virtual Peace game simulation enjoyed its first live classroom run at the new Link facility in Perkins Library at Duke University.  A class of Rotary Fellows affiliated with the Duke-UNC Rotary Center were the first players in the simulation and there was much excitement in the air.

Next up:

I never miss a beat here it seems, for now I am already onto my next project, something that has been my main project since starting here: reading research and patent corpora mediated through text mining methods.  Yes that’s right, in an age where we struggle to get people to read at all (imagine what it’s like to be a poet in 2008) we’re moving forward with a new form of reading: reading everything at once, reading across the dimensions of text. I bet you’re wondering what I mean.  Well, I just can’t tell you what I mean, at least, not yet.

At the end of October I’ll be presenting with Tim in Berlin for the “Writing Genomics: Historiographical Challenges for New Historical Developments” workshop at the Max Planck Institute for the History of Science. We’ll be presenting on some results related to our work with the Center for Nanotechnology in Society at UCSB.  Basically we’ll be showing some of our methods for analyzing large document collections (scientific research literature, patents) as applied to the areas of bio/geno/nano/parma both in China and the US. We’ll demonstrate two main areas of interest: our semiotic maps of idea flows over time I’ve developed in working with Tim and Vincent Dorie, and the spike in the Chinese nano scientific literature at the intersection of bio/geno/nano/parma.  This will be perfect for a historiography workshop. The stated purpose of the workshop:

Although a growing corpus of case-studies focusing on different aspects of genomics is now available, the historical narratives continue to be dominated by the “actors” perspective or, in studies of science policy and socio-economical analysis, by stories lacking the fine-grained empirical content demanded by contemporary standards in the history of science.[…] Today, we are at the point in which having comprehensive narratives of the origin and development of this field would be not only possible, but very useful. For scholars in the humanities, this situation is simultaneously a source of difficulties and an opportunity to find new ways of approaching, in an empirically sound manner, the complexities and subtleties of this field.

I can’t express enough how exited I am about this. The end of easy narratives and the opportunity for intradisciplinary work (nod to Oury and Guattari) is just fantastic.  So, to be working on two innovations, platforms of innovation really, in just one week.  I told you my job here was pretty cool. Busy, hectic, breakneck, but also creative and multimodal.

After much anticipation an amazing new Patent retrieval tool launched yesterday. SparkIP is an amazing new patent search tool of which my colleague (he is my boss-man really) Tim Lenoir is a founder. SparkIP combines the robust on-the-fly clustering of search results similar to Vivisimo’s Clusty but with a pretty incredible twist. The search engine results are navigated by the user in a visual way. Results are clustered, and first the user is presented not with patent results per se but rather with patent cluster results. The company refers to each cluster as a “SparkCluster Map.” Each of these cluster “maps” have numerous clusters within them. This set of cluster maps (shown here)


referred to as a landscape, is an excellent and robust way of reducing often-overwhelmingly-sized relevant document results while providing complex visual information about each cluster. This is truly a forward-looking tool in many respects but particularly in terms of generating intelligent and useful information about technologies, people, and institutions related to a keyword search. SparkIP has raised the bar on information retrieval right here. But your search is not done yet.

Given the landscape you can then select any of the specific cluster maps (seven in all were returned on “text mining”) by clicking directly on the map graphic. I selected the second cluster map, “information retrieval.” This then brings an enlarged view of the cluster map revealing the clusters within the map, shown here:


Then clicking onto one of the map nodes/clusters (I selected the “document information retrieval” node at the very center of the cluster map) you see a view called “Technology Detail” (shown below):


More information-overload-reducing brilliance on display here in SparkIP. First, note that while 61 patents were retrieved, only 10 were returned. Further, there are likely hundreds more patents relevant to “text mining.” What appears to be happening here is that SparkIP has developed patent-filtering heuristics “under the hood” that get rid of the high volume of junk patents cluttering any patent database. After all, many if not most patents are created by their originators for purposes other than to stake a claim on a highly specific technology. Many a business game is played with patents as the pieces. An organization might want to try and occupy an intellectual property space to see if it can land licensing suckers. Other patents are premature. Some others overreach or are incredibly vague and therefore unenforceable. And so on.

There are a number of small problems with the interface as with many a beta product. The back buttom removes you entirely from your search results rather than helping you navigate backwards from, say, technology detail view to cluster map view. The meaning of visual iconography such as cluster map node size or color, while intuitive, are not altogether clear just from naively using the tool.

But wait folks, that’s not all. In addition to keyword-to-landscape patent search SparkIP will also open up an eBay-esque marketplace for intellectual property. I don’t know of that part is already live or not. I hope to have more time to play around with the site in the coming days.

SparkIP was founded at Duke University through collaboration between Dr. Lenoir, current Pratt School of Engineering Dean Rob Clark, and John Hopkins Provost and Senior President of Academic Affairs Kristina Johnson. Since joining Lenoir at Duke I’ve had a couple of small windows of opportunity to provide some technical advice on cluster metrics with SparkIP engineer (and allpatents.org founder) Kevin Webb. But I never even got to see a demo of this thing. And let me tell you, man, this thing is amazing. I put this tool right up there with Clusty and the TRIP evidence-based medicine site as a retrieval tool among the best since the arrival of Google beta.

Congratulations to you Tim, and to you Kevin, and to the rest of the SparkIP team.


Cereb Cortex. 2005 Aug;15(8):1261-9. Epub 2005 Jan 5. 

The neural mechanisms of speech comprehension: fMRI studies of semantic ambiguity.

Rodd JM, Davis MH, Johnsrude IS.

Department of Psychology, University College London, UK. j.rodd@ucl.ac.uk

A number of regions of the temporal and frontal lobes are known to be important for spoken language comprehension, yet we do not have a clear understanding of their functional role(s). In particular, there is considerable disagreement about which brain regions are involved in the semantic aspects of comprehension. Two functional magnetic resonance studies use the phenomenon of semantic ambiguity to identify regions within the fronto-temporal language network that subserve the semantic aspects of spoken language comprehension. Volunteers heard sentences containing ambiguous words (e.g. ‘the shell was fired towards the tank’) and well-matched low-ambiguity sentences (e.g. ‘her secrets were written in her diary’). Although these sentences have similar acoustic, phonological, syntactic and prosodic properties (and were rated as being equally natural), the high-ambiguity sentences require additional processing by those brain regions involved in activating and selecting contextually appropriate word meanings. The ambiguity in these sentences goes largely unnoticed, and yet high-ambiguity sentences produced increased signal in left posterior inferior temporal cortex and inferior frontal gyri bilaterally. Given the ubiquity of semantic ambiguity, we conclude that these brain regions form an important part of the network that is involved in computing the meaning of spoken sentences. (My emphasis.)


Here we may have a possible biological locus for exactly the sort of phenomenon I was positing in my previous post. Interestingly enough, ambiguity seems to a core process, and again we have evidence that language users are able to actively engage with ambiguous language and that an important step in cognition is pre-disambiguated. Importantly, it is in all likelihood that linguistic comprehension engages in parallel visualization of multiple possibilities. This is probably responsible for so much of what makes poetry interesting and road signs uninteresting.

The inferior temporal cortex is a higher-level part of the ventral stream of the visual processing system of the human brain. The ventral stream engages in classification and identification of phenomena. The adjacent inferior frontal gyrus coontains Broadmans Areas 44 and 45, which contain a number of non-visual areas heavily engaged in linguistic understanding. Broca’s Area is contained in Broadmans Area 44. Broca’s area is connected to Wernicke’s area via the arculate fasciculus.

One way to disprove my present theory is to see the neural precursors to these differentiated brain areas in fetal development. Do human brains develop the visual system first? Do these linguistic areas develop out of the visual tissues? Or do they come out of a wholly different set of neural tissues? Anyone know a neuroembryologist?

When reading some Steven Pinker a couple of years back I wondered whether language could be better understood via sound, sentence, and vision rather than by words and rules as Pinker suggests (see his Words and Rules). Rules seem to be elements of narration we use or rather abuse to divine a neat model of causality. However there seems to be very little in biology that’s rather rule-like. Biology is inherently anti-functional, at least in the strict mathematical sense of the word function. Cells and subcellular systems can and do appear to regularly do different things given the same input. And that’s assuming we can even truly tightly control an input to a biological system in any meaningful (re: in vivo) way. Weak and strong AI proponents would have us think that neurons are analogues for computer circuits, but the complexity of neural matter is hardly reducible to such a model without sacrificing crucial information.

Rules just don’t seem inherent to language. Words, however, do seem on some level fundamental to language. From a textual perspective certainly. We can see evidence for this in many ways; in my experience the evidence is in building representations of document collections for various text mining experiments. But from an oral perspective, are words fundamental?

Spoken language seems far more continuous that written language not only from a processing standpoint but also from a sensory point of view. Spoken language is experienced and performed in a rather continuous way; words are deduced in learning language, but it remains to be shown whether words are in and of themselves mere narrative convenience for explaining how we understand language rather than language itself. these sounds continue rather fluidly within sentences. The auditory experience of language is that the most coarse break, the most distinct break, is the break between sentences. But spoken language is not just continuous in the way it is serially composed and experienced in an auditory fashion. It is also continuous in that it speads across the sensory spectrum, from sound to vision. Inflection and gesture are essential to processing meaning, and such experience and interpretation is so incredibly integrated and automatic it operates as intuition does.

While the fundamental descriptive unit of language seems to be the word, with the description generating itself through the appearances of language acquisition, the fundamental unit of language seems to be the sequence of sounds, the sentence. The word “book” or for that matter the sound of the word has some basic meaning but no real rich semantics. What book? What’s it doing? Where is it? What’s in it? How thick is it? Do you even mean a thing with pages? Frankly we have no idea what questions even make sense to ask in the first place. The word and the sound alike seem devoid of context, seem completely empty of a single thought. But once we launch into a sentence, the book comes to life, to at least a bare minimum of utility, representation with correspondence to some reality. It seems the sentence is the first level at which language has information.

But it seems that the sentence, the meaning-melody of distinct thought, is composed more essentially with some visual representational content, something rudimentary that is pre-experiential (children blind from birth seem to have no profound barriers to becoming healthy and fully literate adult language users). There seems to be something visual that is degenerative in nature involved in language. Not generative. It seems that language comprehension is based on breaking down the continuous auditory signal into something very roughly visual and then the utterance becomes informative.

My take on such a process is really not so unusual but rather fundamental to one of the most important linguistic discoveries of the modern era. Wernicke believed that the input to both language comprehension and language production systems was the “auditory word image.”

So here’s what I’m thinking. Language’s syntax is not fundamentally linguistic per se nor compositional but rather sensory (audio-visual) and decompositional. So I wonder, is there some sort of syntax for vision, some decompositional apparatus? Or are we just getting back into rule-sets?

I think we can understand something fundamental in this syntax between the sensory and the linguistic. Linguistic decompositon, which is really either auditory or visual decompositon, becomes visual composition in understanding. Likewise, the visual must be decomposed before it can be composed into a sentence.

In other words, if we knew rules for visual decompositon we could automatically compose descriptions of scenes. Likewise we should be able to compose images from decomposition of linguistic signals.

And how do we do that without rules or functions?

But language is not pure sign, it is also a thing. This exteriority -word as object rather than sense- is an irreducible element within the signifying scene. Language is tied to voice, to typeface, to bitmaps on a screen, to materiality. But graphic traces, visualizations are irreducible to words. Their interpretation is never fully controllable by the writing scientist.

– Timothy Lenoir and Hans Ulrich Gumbrecht,
from the introduction to the Writing Science series

I am hung up on a concern about the application of text mining to scientific discovery from which I seem unable to shake free. That simple hang-up is due to the importance of visual analogy to scientific discovery and the rather trivial or secondary narration that follows it. That narrative content (see narrative fallacy – explaining an event post hoc so that it will seem to have a cause) is the very material that text mining seeks to leverage. Language is supposed to capture in some way the network of causes, many of them supposedly sufficient to help presage novel treatments, procedures, further explanations, and so on. But if the generative seed of discovery is visual analogy itself, no amount of linguistic-based reasoning, whether contextual, deductive, or inductive, can ever make new discoveries. Because the explanation is not equivalent to the image.

And yet. And yet we know that we can make discoveries by deductions from multiple texts, as Don Swanson has repeatedly shown us. But Swanson’s discoveries using disjoint literatures are marginal and hypothetical and remain in desperate need of empirical review. Disjoint literatures don’t appear to be radically increasing the speed at which scientific discovery is made, which means that the process of leveraging implicit multi-document logics is missing something essential.

I’ll venture a guess and say that pictures are missing.

If a picture is worth a thousand words, is the relation symmetric? That is to ask, given a thousand words, can we draw a picture? Could we, say, use hypothesis generation to augment the creation of visual metaphor apparently crucial to scientific discovery? Alternately, it seems that a picture is not inherently worth any word whatsoever, and that inequivalence is symmetric.

Most pictures generated these days via automated means are entirely dimensionless, metaphorically speaking. Graphs, trees, constellations of points in a space. But what makes our understanding of constellations rich? Ahh yes, those stars in our southern summer sky appear to look like a scorpion, become known as Scorpius, and that’s how we remember those specks, and that’s how we use them as well. Memory, after all, is inseparable from use. And yet those stars are no more a scorpion than a snake or a lock of hair or whatever else you can make up.

So it’s not enough perhaps to plot networks on a 2D screen. Why not compare those assemblages of seemingly random points to visual shapes? Why not revtrieve the visual metaphor for an item automatically?

This however is utterly unconvincing. There’s no way, for example, special relativity could be arrived at in such a way. And yet, hold on just a sec, elements of the discovery of special relativity are in part a result of a visual search activity–Einstein imagining many rich ways of illustrating previous mathematical expressions and testing the illustrations to measure their utility, their usability, their ability to survive multiple looks and provide a rich metaphor capturing the scientific phenomenon. And then using those images to tell further stories, and then usiong those stories to generate more mathematical expressions. A picture is worth a thousand words and a thousand words is worth many pictures.

For my master’s thesis I performed a case study of a very large multinational drug company to evaluate how it innovates in text mining to drive its central mission of drug innovation. Drug discovery is hard and therefore expensive, but with high performance computing now a commodity, drug companies should be at the bleeding edge of text mining innovation, particularly in the area of virtual hypothesis formation and testing (deriving novel insights from mining multiple inputs, from clinical data stores to genetics databases to research literature collections and even the so-called grey literature). But guess what? At least with respect to the case I studied, they aren’t. They are highly focused on circa-1997 extraction tasks with little to no interest in statistical learning and a confused interest in taxonomies and automated inductive reasoning. They invest in formal logics and in information extraction but the meat in the middle, the statistical learning, is kept strictly to data mining of data sets severly limited in scope. Simply put, the company has little to no coherent and well-articulated vision of how it can tackle its most daunting problem for drug discovery: information overload.

How can this problem arise? Isn’t the central mission of a drug company, its core competency, to create new drugs? Well, historically it has been. But competing with the core competency of the drug company is another, oft conflicting, central mission, to make money. What this means for drug discovery is that it is only kind of important. The company I studied was laying off key drug innovators globally as it was focusing its investments further down the drug pipeline, placing more and more emphasis on Phase 2 & 3 projects, more on lower risk short term gains. What this means is that the central mission has become, to get drugs to market, particularly ones with a recurring revenue model.

Historically the drug companies could hang their hats on introducing drug treatments that have contributed to huge improvements in human health over the last century. Drug companies have been in the business of saving lives. Drugs are largely responsible for the 50%+ increase in life expectancy in the US over the last century.

Sometimes, however, human health improvements are not profitable. Sometimes drug companies will select strategies far less beneficial to human health that are far more financially beneficial to the organization. Consider the focus on marketing deregulation in the US, or FDA deregulation. Why invest in developing drugs when you can invest in removing barriers to sales? Now that deregulation has just about all but run its course, drug companies will soon face the fact that they will need to depend more and more on releasing new drugs. When the two largest drug companies in the world can’t combine for more than a dozen new drugs in any given calendar year, you can tell that something’s clearly broke. You can’t hang the shortage on regulation or on a shortage of actionable research.

So why the institutional emphasis away from innovation? One can only speculate; I will use Portfolio Theory to speculate. The dominant forces controlling large multinational drug companies are people of a certain kind, namely, aging investors. They have invested their dollars and expect something in return. Portfolo Theory tells us that our optimal investment trend as we age is to go from high risk to lower risk, income-generating investment. For example, I’m 35, and if I expect to live to, say, 80, I’m probably at least three decades from retirement, from a time where I need my investments to generate income. Because I have decades to invest, I can handle the risk of higher risk investments, namely because I don’t need the reliable income, and because I have time to recover if I lose. The game of investing depends entirely on how much time you perceive yourself as having, namely because on your death bed all the money in the world is worth nothing, but having a lot of cash on hand that last week before your death bed is pretty damned important. A promise for a check next week won’t do you good if you’re dead. And so I think, unlike me, the investors in large mutinationals are old men, frankly. They need to allocate their assets on the income-generating end of the spectrum. They need that cash and they need it now.

This asset allocation model is confirmed by the reduced interest in technological innovation and the increase of interest in being merely early adopters. Adopting established technologies carries a lower risk, as it has a higher probability of some payoff.

And so why invest in high risk, in innovation? The argument for it would be three-fold: to attract younger investors, to focus on the longevity and long-term stability of the company, and to be true to the core mission, which should be to treat health ailments. Maybe my experience is anecdotal, but at least to me it appears that investors in my parent’s generation (they’re 65) are far more likely to invest in, say Pfizer or GSK, than investors my age. They’re safe, they’ll do OK next quarter, but that picture is very murky a decade from now. Not to mention that younger investors no longer see drug companies as beneficial to human health. There’s nothing attractive for the average younger investor.

One of the saddest consequences of this reluctance to innovate, this focus on profit, is the impact on human health. Drug companies are far more willing to repackage old drugs and market the heck out of them, renewing their proprietary charges, than to find new drugs. And when the drug companies choose new drugs to invest in, they are going to look for “comeback” drugs, drugs that cure nothing but treat indefinitely. No new antibiotics are reaching market because there’s no incentive

Next Page »