Info Science Theory

I am attending the New Media Consortium (NMC) Conference at Princeton University today and tomorrow (Thursday June 12- Friday June 13) and am giving a talk tomorrow with three of my favorite colleagues at Duke. Currently I am sitting in a morning session hosted by Susan Barnes and Stephen Jacobs of RIT discussing LMSs and social media. A special focus was shared on the formation, identification, and communication of student identity and its role in educational media, specifically LMSs.

Stephen and Susan posed an interesting question to the audience as to how students construct and communicate their identities to others. The answers tended to focus within contexts, ignoring the wholistic nature of computing contexts and what a student’s presence or absence form those contexts communicates to others. If we have backend access to Facebook or an LMS we can certainly build a model of what a student is like, at least to some extent.

What is absent or tacit, and yet what may be most telling about a student about their identities, hwo they contruct them, and how they communicate them, is the presence/absence of students from multiple web contexts. What sites do they use? What don’t they use? How much do they even use the web? How long were they members (or active members) at sites? What years? Were they late-comers to MySpace? Are they students not on Facebook?

I think most web users at least have tacit knowledge of this sense of identifying people by identifying the character, location and scope of participation across the whole web: this is why people Google one another. The search results on a person project this sort of finely detailed whole that might tell us most.

Great presentation on social networks in LMSs by Susan and Stephen. I found it useful in part because it leads me to better understand & appreciate the value of person search engines and the creation of sites like ClaimID ( This knowledge may also be of great help as we move forward with building our MacArthur DML Virtual Conflict Resolution website. Maybe we want to be aware of how we help students manage their identities as complex heterogeneous wholes.

The practical implication is that I may be using a combination of Ning and Moola for the Virtual Conflict Resolution project site. Maybe I will encourage students within the course sites to actually use ClaimID as they traverse their academic years to construct a sort of timeline of their web presences–and absences. Makes me want to apply our (Casey Alt’s) timeline software to ClaimID, or to the Virtual Conflict Resolution site.

After much anticipation an amazing new Patent retrieval tool launched yesterday. SparkIP is an amazing new patent search tool of which my colleague (he is my boss-man really) Tim Lenoir is a founder. SparkIP combines the robust on-the-fly clustering of search results similar to Vivisimo’s Clusty but with a pretty incredible twist. The search engine results are navigated by the user in a visual way. Results are clustered, and first the user is presented not with patent results per se but rather with patent cluster results. The company refers to each cluster as a “SparkCluster Map.” Each of these cluster “maps” have numerous clusters within them. This set of cluster maps (shown here)


referred to as a landscape, is an excellent and robust way of reducing often-overwhelmingly-sized relevant document results while providing complex visual information about each cluster. This is truly a forward-looking tool in many respects but particularly in terms of generating intelligent and useful information about technologies, people, and institutions related to a keyword search. SparkIP has raised the bar on information retrieval right here. But your search is not done yet.

Given the landscape you can then select any of the specific cluster maps (seven in all were returned on “text mining”) by clicking directly on the map graphic. I selected the second cluster map, “information retrieval.” This then brings an enlarged view of the cluster map revealing the clusters within the map, shown here:


Then clicking onto one of the map nodes/clusters (I selected the “document information retrieval” node at the very center of the cluster map) you see a view called “Technology Detail” (shown below):


More information-overload-reducing brilliance on display here in SparkIP. First, note that while 61 patents were retrieved, only 10 were returned. Further, there are likely hundreds more patents relevant to “text mining.” What appears to be happening here is that SparkIP has developed patent-filtering heuristics “under the hood” that get rid of the high volume of junk patents cluttering any patent database. After all, many if not most patents are created by their originators for purposes other than to stake a claim on a highly specific technology. Many a business game is played with patents as the pieces. An organization might want to try and occupy an intellectual property space to see if it can land licensing suckers. Other patents are premature. Some others overreach or are incredibly vague and therefore unenforceable. And so on.

There are a number of small problems with the interface as with many a beta product. The back buttom removes you entirely from your search results rather than helping you navigate backwards from, say, technology detail view to cluster map view. The meaning of visual iconography such as cluster map node size or color, while intuitive, are not altogether clear just from naively using the tool.

But wait folks, that’s not all. In addition to keyword-to-landscape patent search SparkIP will also open up an eBay-esque marketplace for intellectual property. I don’t know of that part is already live or not. I hope to have more time to play around with the site in the coming days.

SparkIP was founded at Duke University through collaboration between Dr. Lenoir, current Pratt School of Engineering Dean Rob Clark, and John Hopkins Provost and Senior President of Academic Affairs Kristina Johnson. Since joining Lenoir at Duke I’ve had a couple of small windows of opportunity to provide some technical advice on cluster metrics with SparkIP engineer (and founder) Kevin Webb. But I never even got to see a demo of this thing. And let me tell you, man, this thing is amazing. I put this tool right up there with Clusty and the TRIP evidence-based medicine site as a retrieval tool among the best since the arrival of Google beta.

Congratulations to you Tim, and to you Kevin, and to the rest of the SparkIP team.

Last evening during the weekly Duke FOCUS cluster meeting we enjoyed a talk from Duke OIT AVP and Croquet principal architect Julian Lombardi. Julian is also aligned with ISIS at Duke which is where I enjoy the opportunity to teach on occasion. I can’t say enough how much of a neat guy Julian is, or how his presentation on Croquet was absolutely fascinating. Suffice it to say he had my head nodding in agreement and his ideas were controversial enough to get the Freshman in the room to make smart-alecky remarks. If that’s not a positive sign of innovation I don’t know what is.

I promised that this would be a note, and it will be a note. I promise.

Julian asked those attending his presentation last evening why we use computers that are overpowered and undercollaborated (to coin a word), why we use machines with seemingly prehistoric interface tools like a mouse and keyboard. Further he asked why we don’t have better technologies that work better with the way we work. I’m not sure how he answered this question except to say that we need to engineer software that supports “deep collaboration,” as Julian called it. I think Julian was suggesting that we were sort of stuck in our ways and that we just weren’t picking up available technologies, sticking instead to old guns.

I don’t think the problem is that simple. In fact I suspect there are two significant problems, one intellectual, the other sociological.

The intellectual problem is that I suspect few if any actually understand what “deep collaboration” really is. It appears to me that we are only starting to understand collaboration as a phenomenon, and then only a phenomenon of a digital variety, and then only through data about how people use collaborative technologies. That type of understanding seems to be a sort of cart-leading-the-horse phenomenon.

I don’t think (but I certainly do not know for certain) we have very good understanding of phenomenae such as tacit knowledge, communities of practice, activity theory, and so forth. Do we possess a very good grounding insofar as understanding how people work together?  How they have worked together?  How people might work together?

Funny that Julian called the web pages “brochures.” He’s right, they are brochures. I love that perspective he shared as it made me laugh and then blush.  It also appears to me that we’re a pamphlet-publishing culture in general so web-publishing activities seem to actually comprise an adequate reflection of the way we seem to work. After all, where in our culture are we not engaged in this sort of pamphlet-publishing work mode?  You’re going to have to go far outside information technology in order to respond (e.g., construction). It appears to me that knowledge workers of all sorts work in a rather linear fashion and this is perhaps not surprising since our concepts of ourselves as subject arises from engaging in linear tasks such as writing, reading, watching, all coded in terms of first-person perspective.

Collaboration at least in US technoculture seems to find its apex of sophistication in the assembly of multiple independently-produced pamphlets. We can see this even in open-source software development projects where repositories are open to collaboration. With tools like CVS we lock out as we write to a file and resolve conflicts with any other code-pamphlets that have been written concurrently.

So the intellectual problem appears to have at least three components each of which should be explored independent of computational technology: knowing how we know how to work together; knowing how humans have worked together in the past; and divining how we might expand on the combination of past collaboration modes and knowledge of tacit knowledge to innovate new collaborative paradigms. I think this is an area ripe for intellectual innovation, and I don’t think such an effort should be limited to software engineering.

If this sort of intellectual problem has already been conquered then I admit I just completely missed it.  But I currently see that there is a huge gap between our understanding of the cognitive dimensions of collaboration and the understanding of how people use, say, Facebook, to collaborate with one another.  What is the biology, the phenomenology, and the behavior of human interaction?

The sociological problem is that simply such innovative interfaces have lacked, for a huge number of reasons, crossover to early adopters. Who are early adopters? The cool hip techies to whom the masses look for what’s hot, what’s cool, those who bellwether their intellectual and geographic locales. Those of us who are into inventing are not very good at engineering social transitions and we don’t make early adopters at all. And when we lack early adopters we lack, well, adoption itself, don’t we?

Was this just a note?

When reading some Steven Pinker a couple of years back I wondered whether language could be better understood via sound, sentence, and vision rather than by words and rules as Pinker suggests (see his Words and Rules). Rules seem to be elements of narration we use or rather abuse to divine a neat model of causality. However there seems to be very little in biology that’s rather rule-like. Biology is inherently anti-functional, at least in the strict mathematical sense of the word function. Cells and subcellular systems can and do appear to regularly do different things given the same input. And that’s assuming we can even truly tightly control an input to a biological system in any meaningful (re: in vivo) way. Weak and strong AI proponents would have us think that neurons are analogues for computer circuits, but the complexity of neural matter is hardly reducible to such a model without sacrificing crucial information.

Rules just don’t seem inherent to language. Words, however, do seem on some level fundamental to language. From a textual perspective certainly. We can see evidence for this in many ways; in my experience the evidence is in building representations of document collections for various text mining experiments. But from an oral perspective, are words fundamental?

Spoken language seems far more continuous that written language not only from a processing standpoint but also from a sensory point of view. Spoken language is experienced and performed in a rather continuous way; words are deduced in learning language, but it remains to be shown whether words are in and of themselves mere narrative convenience for explaining how we understand language rather than language itself. these sounds continue rather fluidly within sentences. The auditory experience of language is that the most coarse break, the most distinct break, is the break between sentences. But spoken language is not just continuous in the way it is serially composed and experienced in an auditory fashion. It is also continuous in that it speads across the sensory spectrum, from sound to vision. Inflection and gesture are essential to processing meaning, and such experience and interpretation is so incredibly integrated and automatic it operates as intuition does.

While the fundamental descriptive unit of language seems to be the word, with the description generating itself through the appearances of language acquisition, the fundamental unit of language seems to be the sequence of sounds, the sentence. The word “book” or for that matter the sound of the word has some basic meaning but no real rich semantics. What book? What’s it doing? Where is it? What’s in it? How thick is it? Do you even mean a thing with pages? Frankly we have no idea what questions even make sense to ask in the first place. The word and the sound alike seem devoid of context, seem completely empty of a single thought. But once we launch into a sentence, the book comes to life, to at least a bare minimum of utility, representation with correspondence to some reality. It seems the sentence is the first level at which language has information.

But it seems that the sentence, the meaning-melody of distinct thought, is composed more essentially with some visual representational content, something rudimentary that is pre-experiential (children blind from birth seem to have no profound barriers to becoming healthy and fully literate adult language users). There seems to be something visual that is degenerative in nature involved in language. Not generative. It seems that language comprehension is based on breaking down the continuous auditory signal into something very roughly visual and then the utterance becomes informative.

My take on such a process is really not so unusual but rather fundamental to one of the most important linguistic discoveries of the modern era. Wernicke believed that the input to both language comprehension and language production systems was the “auditory word image.”

So here’s what I’m thinking. Language’s syntax is not fundamentally linguistic per se nor compositional but rather sensory (audio-visual) and decompositional. So I wonder, is there some sort of syntax for vision, some decompositional apparatus? Or are we just getting back into rule-sets?

I think we can understand something fundamental in this syntax between the sensory and the linguistic. Linguistic decompositon, which is really either auditory or visual decompositon, becomes visual composition in understanding. Likewise, the visual must be decomposed before it can be composed into a sentence.

In other words, if we knew rules for visual decompositon we could automatically compose descriptions of scenes. Likewise we should be able to compose images from decomposition of linguistic signals.

And how do we do that without rules or functions?

But language is not pure sign, it is also a thing. This exteriority -word as object rather than sense- is an irreducible element within the signifying scene. Language is tied to voice, to typeface, to bitmaps on a screen, to materiality. But graphic traces, visualizations are irreducible to words. Their interpretation is never fully controllable by the writing scientist.

– Timothy Lenoir and Hans Ulrich Gumbrecht,
from the introduction to the Writing Science series

For my master’s thesis I performed a case study of a very large multinational drug company to evaluate how it innovates in text mining to drive its central mission of drug innovation. Drug discovery is hard and therefore expensive, but with high performance computing now a commodity, drug companies should be at the bleeding edge of text mining innovation, particularly in the area of virtual hypothesis formation and testing (deriving novel insights from mining multiple inputs, from clinical data stores to genetics databases to research literature collections and even the so-called grey literature). But guess what? At least with respect to the case I studied, they aren’t. They are highly focused on circa-1997 extraction tasks with little to no interest in statistical learning and a confused interest in taxonomies and automated inductive reasoning. They invest in formal logics and in information extraction but the meat in the middle, the statistical learning, is kept strictly to data mining of data sets severly limited in scope. Simply put, the company has little to no coherent and well-articulated vision of how it can tackle its most daunting problem for drug discovery: information overload.

How can this problem arise? Isn’t the central mission of a drug company, its core competency, to create new drugs? Well, historically it has been. But competing with the core competency of the drug company is another, oft conflicting, central mission, to make money. What this means for drug discovery is that it is only kind of important. The company I studied was laying off key drug innovators globally as it was focusing its investments further down the drug pipeline, placing more and more emphasis on Phase 2 & 3 projects, more on lower risk short term gains. What this means is that the central mission has become, to get drugs to market, particularly ones with a recurring revenue model.

Historically the drug companies could hang their hats on introducing drug treatments that have contributed to huge improvements in human health over the last century. Drug companies have been in the business of saving lives. Drugs are largely responsible for the 50%+ increase in life expectancy in the US over the last century.

Sometimes, however, human health improvements are not profitable. Sometimes drug companies will select strategies far less beneficial to human health that are far more financially beneficial to the organization. Consider the focus on marketing deregulation in the US, or FDA deregulation. Why invest in developing drugs when you can invest in removing barriers to sales? Now that deregulation has just about all but run its course, drug companies will soon face the fact that they will need to depend more and more on releasing new drugs. When the two largest drug companies in the world can’t combine for more than a dozen new drugs in any given calendar year, you can tell that something’s clearly broke. You can’t hang the shortage on regulation or on a shortage of actionable research.

So why the institutional emphasis away from innovation? One can only speculate; I will use Portfolio Theory to speculate. The dominant forces controlling large multinational drug companies are people of a certain kind, namely, aging investors. They have invested their dollars and expect something in return. Portfolo Theory tells us that our optimal investment trend as we age is to go from high risk to lower risk, income-generating investment. For example, I’m 35, and if I expect to live to, say, 80, I’m probably at least three decades from retirement, from a time where I need my investments to generate income. Because I have decades to invest, I can handle the risk of higher risk investments, namely because I don’t need the reliable income, and because I have time to recover if I lose. The game of investing depends entirely on how much time you perceive yourself as having, namely because on your death bed all the money in the world is worth nothing, but having a lot of cash on hand that last week before your death bed is pretty damned important. A promise for a check next week won’t do you good if you’re dead. And so I think, unlike me, the investors in large mutinationals are old men, frankly. They need to allocate their assets on the income-generating end of the spectrum. They need that cash and they need it now.

This asset allocation model is confirmed by the reduced interest in technological innovation and the increase of interest in being merely early adopters. Adopting established technologies carries a lower risk, as it has a higher probability of some payoff.

And so why invest in high risk, in innovation? The argument for it would be three-fold: to attract younger investors, to focus on the longevity and long-term stability of the company, and to be true to the core mission, which should be to treat health ailments. Maybe my experience is anecdotal, but at least to me it appears that investors in my parent’s generation (they’re 65) are far more likely to invest in, say Pfizer or GSK, than investors my age. They’re safe, they’ll do OK next quarter, but that picture is very murky a decade from now. Not to mention that younger investors no longer see drug companies as beneficial to human health. There’s nothing attractive for the average younger investor.

One of the saddest consequences of this reluctance to innovate, this focus on profit, is the impact on human health. Drug companies are far more willing to repackage old drugs and market the heck out of them, renewing their proprietary charges, than to find new drugs. And when the drug companies choose new drugs to invest in, they are going to look for “comeback” drugs, drugs that cure nothing but treat indefinitely. No new antibiotics are reaching market because there’s no incentive

The following comprises a collection of my intuitions and “big picture” insights resulting from graduate study focused on text mining at SILS. These are insights related to feature representation, knowledge engineering, model building, the application of statistics to real-life phenomena, and the greater whole of information science.

Many of these apparently go without saying, yet so many discussions of supposed problems would go away if some of these observations were made explicit. This is my attempt to make them explicit. Maybe it goes without saying that expressing the obvious is sometimes quite necessary.

1. Statistical models often fail because they’re missing key attributes necessary to describe the phenomena they represent

Attributes that are altogether unrecognized, difficult to quantify, difficult to analyze, truncated out, or simply forgotten arguably dominate and confound the predictive/explanatory power of statistical models. These missing variable abound. Their absence dominates to the point where theory itself must give way to empiricism and its sister, skepticism. It also means that we simply don’t see everything and that it never hurts to try and see more things.
2. Feature reduction of highly dimensional linguistic data sets is a misguided, outdated and counterproductive approach

There. I said it.

Claude Shannon’s model of information as that which is located among noise is a metaphor that appears to have been misleading a number of people in information science, particularly those involved with anything even remotely tangential to text mining (or, if you must, “knowledge discovery”). Information in an atomic form (e.g., bits) allows for the differentiation of signal and noise. A bit either is a signal or it isn’t. Attributes of real-life phenomena (e.g., average first down yardage in football for a team) are not like bits, at least not in the way we experience them and interpret those phenomena, whether in written explanations or in databases. “Real-life” phenomenae comprise different sorts of real-world features that can never be honestly reduced to their atomic constituents. And, pragmatically speaking, they won’t be reduced to quantum atomic states any time soon.

Given that every attribute of real-world phenomenae we identify partakes of both signal and noise, the removal of any attribute (save for the case of redundancy) always corresponds to the loss of information. Ultimately the statistical modeling of phenomena such as competitive sports and stock markets and clinical emergency room chief complaints is wholly unlike modeling communication channels. There’s something immediately discontinuous about binary electronic signals while other these other phenomena need dramatic interpretive steps before they can be represented with discontinuous electronic signals. Finally, signal and noise are terms that don’t apply very well because that which we are modeling can only be realistically described by features that are both informative and misleading at the same time.

There’s something rather continuous about language (something that latent semantic indexing attempts to capture) and that even the simplest of approaches, such as applying stop word lists to bag of word representations, lost critical information that dictate the semantics of the document. “Dog,” “a dog” and “the dog” quite clearly mean different things, as do “of the dog”, “out of a dog” and so forth. Representing all of those quotations as “dog” or going a step further and representing all of these quotes with the very same word-sense identifier, dumbs down human language beyond recognition. Garbage in, garbage out is a phrase I learned more than a quarter century ago when learning to program games for the Commodore Vic-20.

Reading a text book from 1993 on the C4.5 algorithm, I came across reflections that some crucial elements of C4.5 appeared to be motivated by economizing on computer resource issues. Not enough memory, too slow of processing, etc. In 2007 high performance computing is a commodity. The pressures for feature reduction in machine learning needed to be heeded 14 years ago, but they’re considerably less of an issue today.

Finally, at the very end of my stretch of graduate school studies I accidentally came across a new strategy for feature representation that is so painfully obvious in retrospect it leaves me wondering why no one else has been doing this. Fortunately for Hypothia it spells one very big competitive advantage. But I digress.

3. There’s always something missing from your set of attributes (cf. 1 & 2)

4. There’s no substitute for knowing your data set (cf. 1)I credit this oft-neglected, oft-devalued approach to my first and truly excellent data mining instructor, Miles Efron, who may be to blame for turning me on to text mining in the first place. What have you wrought? He made sure to repeat this lesson of knowing thy data a few times, and the lesson was surely not lost on me. In fact it seems as it it frames and justifies my confidence in my approach.

5. [DELETED] and let your algorithms optimize your attributes for maximal classification margin (cf. 2 & 3)

Can’t say the deleted part yet. But I will, eventually. It probably should be obvious by now. But still I’m not prepared to say.

6. SVM+SMO is very good for binary classification of highly dimensional data (cf. 5)

Improvements to SVM+SMO are always welcome of course, and it appears there are now numerous implementations of SVM that improve. I should note that, according to Eibe Frank, SMO in Weka (written in Java) is just as fast as Joachims’ SVM-light written in C. SMO’s pretty good.

SMO solves the QP problem created by SVM efficiently.

7. You always need more computing power (cf. 2, 5 & 6)

The curse is not dimensionality, the curse is not intellectual. The curse is economic, a problem of resources.
Likely it will be difficult to produce a dataset that is intractable for a good HPC setup running SVM+SMO but it doesn’t exactly hurt to try as long as you’re trying to harness more and more power.

8. You don’t know everything (cf. 3 & 4)

9. models only forecast well in forecast-influenced environments only when the model has an information advantage over other models (information assymetry, competitive advantage)

10. You’ll never get it quite right ( cf. 8 )

11. There always more left to do (cf. 5, 7 & 10)

12. Disambiguation can be better pursued not in any pure sense by machinic strategies but rather by messier approach of utilizing the greater context surrounding term, document, and corpus, which in turns permits some degree of ambiguity, which is necessary for understanding

13. Word sense disambiguation is quite possibly the wrong way to go to conjure semantics in one’s text representation (cf. 2 & 12)

As I’ve written before, there are other approaches available to leverage semantic information that are better than word-sense diambiguation (WSD) .

14. More formally, the incorporation of ambiguity into linguistic representations (i.e, representing all possible word senses/meanings and POSs for any given word) allows for better representations of intelligence than ones produced at least in part through WSD strategies

15. For artificial intelligence to become smarter than humans, it must at least be as smart as humans first.  A person’s ability to understand multiple senses of a given word at once (of which poetry is perhaps the most striking example) is strikingly intelligent and far more intelligent than most WSD approaches I’ve seen (cf. 14).  And when you consider that the basic unit of meaning is truly not the word but the sentence, WSD seems all the more foolish, and yet makes me feel there’s a huge opportunity to understand language from its wholes and holes.  Discourse analysis anyone?

16. Not knowing everything, not always getting it right, and always having more left to do makes the hard work a great deal of fun. Discoveries are everywhere waiting to be written into existence. (cf. 8, 10, 11)

17. Don’t panic, be good, and have fun. (cf. 16)

18. The essence of human language is nothing less than the totality of the human language in all of its past present and future configurations and possibilities.

Next Page »