Journal of Data and Information Science, 2017, 2(4): 43-64
doi: 10.1515/jdis-2017-0019
Rediscovering Don Swanson: The Past, Present and Future of Literature-based Discovery
Neil R. Smalheiser
Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612, USA
 Cite this article:
Neil R. Smalheiser. Rediscovering Don Swanson: The Past, Present and Future of Literature-based Discovery. Journal of Data and Information Science[J], 2017, 2(4): 43-64 doi:10.1515/jdis-2017-0019


Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don’s contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.Design/methodology/approach: Personal recollections and literature review.Findings: The Swanson A-B-C model of literature-based discovery has been successfully used by laboratory investigators analyzing their findings and hypotheses. It continues to be a fertile area of research in a wide range of application areas including text mining, drug repurposing, studies of scientific innovation, knowledge discovery in databases, and bioinformatics. Recently, additional modes of discovery that do not follow the A-B-C model have also been proposed and explored (e.g. so-called storytelling, gaps, analogies, link prediction, negative consensus, outliers, and revival of neglected or discarded research questions).Research limitations: This paper reflects the opinions of the author and is not a comprehensive nor technically based review of literature-based discovery.Practical implications: The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of discovery tools that are free and open to the public (, as does BITOLA which is maintained by Dmitar Hristovski (http://, and Epiphanet which is maintained by Trevor Cohen ( Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic.Originality/value: This paper discusses problems and issues which were inherent in Don’s thoughts during his life, including those which have not yet been fully taken up and studied systematically.

Key words: Literature-based discovery ; Biography ; Text mining ; Knowledge discovery in databases ; Implicit information ; Information science

1 Introduction

Don R. Swanson (1924-2012) was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles (Figure 1). Don became Emeritus in 1996, but did not truly retire until around 2007, when he suffered a series of strokes. Around 10 years ago, Tanja Bekhuis (2006) wrote a review article that discussed Don’s contributions and their subsequent influence on bioinformatics and text mining. Recently, Sebastian, Siew, & Orimaye (2017a) have published a comprehensive review from a technical standpoint, and the reader is urged to consult this article for an overview of existing and emerging methods that are being applied to the field of literature-based discovery. Here I give a more personal perspective. In particular, I will include a discussion of problems and issues which were inherent in Don’s thoughts during his life, but which have not yet been fully taken up and studied systematically.

The first thing to realize about Don is that Don is not short for Donald. Don was his legal first name. Do not make that mistake, please—it irritated him to no end!

The second thing to realize is that my relationship with Don was idyllically intellectual in nature. I call my collaboration with Don my “Garage Band” period—the term referring to buddies who spend their free time playing rock music in their garages, playing out of sheer enjoyment, and oblivious of the outer world at large. We were unconcerned whether our research would be seen as important by others, whether it would be published in high-impact journals, whether we would secure grant funding, or other non-scientific concerns that too often drive research efforts.

Figure 1.
Don R. Swanson.

2 Undiscovered Public Knowledge

Perhaps the most influential and enduring contribution that Don has had on information science is the concept of “undiscovered public knowledge” (UPK), which he approached from a very broad, philosophical standpoint (Swanson, 1986a). The philosopher of science Karl Popper had envisioned that man exists in three worlds—World I is the objective, real world which scientists seek to learn about; World II includes the thoughts and mental activities of scientists; and World III consists of the products of scientists, in particular, the published articles that express findings, models, assertions, and so forth (Popper, 1978). Just as man cannot hope to have perfect knowledge of reality (World I), Don realized that man cannot have perfect knowledge of World III either. Knowledge can be public (e.g. it is published) and at the same time, inaccessible or imperfectly known for one reason or another.

Undiscovered public knowledge encompasses several distinct scenarios. For example, one may ask: How many articles are published that no one reads—no one at all besides the author and (we hope) the reviewers? Information contained in such articles is, indeed, public yet undiscovered.

How much information is contained in articles that few can find, because the article is poorly indexed by Web of Science or by online search engines? Such articles may have been published without a digital presence, or placed in a journal that has limited circulation or low visibility.

A related type of information loss occurs when someone publishes an important article in an obscure or topically inappropriate journal, so that no one will take the finding seriously even if they see it. Few people have the self-confidence to recognize a breakthrough when it comes without the imprimatur of acceptance by a prestigious journal. An example of this happened quite recently: “This German retiree solved one of world’s most complex maths problems—and no one noticed” (Wolchover, 2017). Thomas Royen wrote a paper proving the Gaussian correlation inequality (GCI) and posted a preprint in the arXiv repository; when his work failed to get recognized, he chose to get his proof out in an obscure journal called the Far East Journal of Theoretical Statistics. He might as well have put it in a bottle and thrown it in the ocean!

Some of my own informatics discoveries have been closely related to undiscovered public knowledge. For example, my group discovered that many mammalian microRNAs are derived from genomic repeat elements in the genome (Smalheiser & Torvik, 2005). Although we came to this realization through computational studies (Smalheiser & Torvik, 2004), in fact, in retrospect, the discovery could have been made simply by inspection of the public data available at the UCSC Genome Browser(1)(1) This website brings together dozens of different types of genomic data that are calculated or measured, for example, predicted transcription factor binding regions, cross-species conservation levels, and so on. Each type of data is superimposed on the reference genome, and users can open up and visualize any number of the data sets to observe them in juxtaposition with each other. Two of the data tracks show a) positions of known microRNA genes in red and b) Repeatmasker output, which identifies genomic repeat elements in two shades of grey (Figure 2). If anyone had opened up these two tracks and looked carefully, they would have seen that many of the microRNAs were within the sequences encompassed entirely by specific genomic repeat elements. The fact that no one DID do this indicates that this knowledge was public, yet undiscovered.

Figure 2.
Screenshot of UCSC Genome Browser showing the sequence for human mir-95 juxtaposed to tracks for genomic repeats. The genomic region of the mir-95 sequence corresponds to two LINE2 elements in opposite orientations. This provides evidence that, when transcribed into RNA, these LINE2 elements bind each other, creating the hairpin secondary structure that permits the processing of this sequence by enzymes (Drosha & Dicer) to form a microRNA (Smalheiser & Torvik, 2005).

3 Two Medical Literatures Logically but not Bibliographically Connected

The most novel and fruitful type of undiscovered public knowledge discussed by Don occurs when information is not explicitly discussed in any single article at all. Rather, different assertions and findings need to be assembled across documents to create a new coherent assertion, much as different pieces of a puzzle are assembled to create a single picture.

But how to find these pieces residing in scattered places across the literature, and how to assemble them? Don focused his analyses on first identifying two sets of articles, or literatures, which appear to be complementary (see below) yet are not directly connected to each other. Such literatures are unconnected if they do not have any articles in common, do not have authors in common, and articles in one literature do not cite any articles in the other literature (Swanson, 1987).

In a series of articles in the 1980s, Don analyzed two classic examples of medical literatures that were not (or only slightly) connected, yet contained multiple links of the form “A affects B” in one literature and “B affects C” in the other, such that when they were brought together and assembled, created a persuasive, novel hypothesis. These have become widely analyzed benchmarks for nearly all subsequent studies of literature-based discovery.

The first case was the set of articles on Raynaud’s disease vs the set of articles on fish oil (Swanson, 1986b). Don noticed that several of the pathological alterations that occur in Raynaud’s disease corresponded to physiological alternations that are produced by ingesting fish oil, only in opposite directions. That suggests that ingesting fish oil should counteract some of the signs and symptoms of Raynaud’s disease. Subsequent clinical studies supported this hypothesis (Swanson, 1993).

The second case was the set of articles on dietary magnesium vs on migraine headaches (Swanson, 1988). Again, Don noticed that magnesium deprivation has multiple effects in the body that are similar to alterations that are known to worsen migraine headaches, and magnesium itself has effects which should be expected to prevent or treat migraines. For example, magnesium is a calcium channel blocker, and reduces neuronal excitability via opening of NMDA glutamate receptors. Thus, he proposed that supplementation with dietary magnesium may prevent or alleviate migraines. Again, subsequent clinical studies supported this hypothesis (Swanson, 1993).

Don made further analyses of complementary un-connected literatures, both by himself (Swanson, 1990) and in collaboration with me (e.g. Swanson, Smalheiser, & Bookstein, 2001; Smalheiser & Swanson, 1994, 1996a, 1996b, 1998). It is noteworthy that late in his career, Don proposed a link between atrial fibrillation and running (Swanson, 2006). Exercise is known to be a risk factor for atrial fibrillation, and he proposed that this may be mediated by gastroesophageal reflux, which in turn may be alleviated by taking proton pump inhibitors. Besides being another masterful, insightful example of putting together separate pieces of evidence to form a new whole, it is worth mentioning that these analyses were all based on conditions he experienced himself. He had Raynaud’s syndrome, and he had migraine headaches. And, his chronic atrial fibrillation eventually caused his strokes and led to his withdrawal from active life.

4 Use of Implicit Information to Bridge Disparate Literatures

It is important to acknowledge a tension between two different meanings of the term “knowledge discovery.” One meaning, the one I started with, is to assemble pieces of information into new wholes, that represent new/promising/surprising/research directions or provide potentially transformative or breakthrough insights. The other meaning is to analyze and synthesize existing data to impute new but otherwise predictable, everyday information. An example of this is using first names to predict the gender of individuals. Most of the “Jane” and “Linda” individuals will be female, and most of the “Boris” and “John” individuals will be males. But regardless of which type of discovery we are talking about, to my knowledge, all systematic algorithmic methods for knowledge discovery involve linking different literatures or entities via implicit features that they share. In the case of gender prediction, US Census data can be used to associate first names of individuals in the United States with their reported genders; by aggregating the results over all individuals, each first name is associated with a gender balance score (% females/% males). This becomes reference information that is used to impute gender for a given name instance in some other database. The reference information is implicit because it derives from information that is not explicitly present within the database.

Commonly, implicit information is used as a bridge to measure the similarity of two entities. For example, two diseases A and B may be related in terms of how many Medical Subject Headings they share (in articles that describe disease A and disease B, respectively). Or, they may be related in terms of how many single-nucleotide polymorphisms (SNPs) have been shown to affect disease risk in both disease A and B. Or, they may be related in terms of how many clinical signs and symptoms they share. Or, how many single-gene mutations which affect disease A or B affect genes that lie in the same biochemical pathway. There are many possible types of implicit information that connect disease A with disease B, and it is even possible to combine multiple types of information to create a heterogeneous graph in which diseases are nodes and implicitly shared items form links between the nodes (Shi et al., 2017).

The use of implicit information is a powerful general technique of knowledge discovery, which has spawned several entire fields in bioinformatics and genomics (Bekhuis, 2006; Zweigenbaum et al., 2007). Don is the father of the field of drug repurposing, which proposes new uses for existing approved drugs (e.g. Weeber et al., 2003; Yang et al., 2017). Prediction of adverse drug effects follows a similar type of logic (e.g. Hristovski et al., 2016; Shang et al., 2014), as does detection of co-morbidities and other relations among drugs, diseases, and genes (Ding et al., 2013; Frijters et al., 2010; Vos et al., 2014). Almost all approaches to genomic discovery involve implicit information as well. Furthermore, implicit information is a central concept generally in text mining and natural language processing.

5 The One-node Search

In Don’s original A-B-C model, implicit information was used in what is known as the “one-node search” approach (Figure 3):

• Begin with a set of articles that discusses or presents information regarding a problem, e.g. prostate cancer or poverty = literature C.

• Look for another literature, unknown at the outset, which has information that can contribute to solving the problem = literature A.

• Use words and phrases in the titles of articles in the two literatures = B-terms [use filtering to keep only “important” words in some sense]. The B-terms are the implicit information.

• Carry out many searches to create B1, B2, B3…. -literatures.

• Tabulate the title words and phrases in each Bi-literature = candidate A-terms and rank them according to how many B-literatures they are in.

• Carry out a search using each Ai-term to define the Ai-literature.

• An Ai-literature which shares many B-terms with the original C-literature is hypothesized to contain information that may help solve the problem.

Figure 3.
Schematic diagram illustrating the one-node search. Reprinted from Swanson & Smalheiser (1997) with permission.

Despite its conceptual appeal, the one-node search has several nuances and limitations in practice, and many variations of the ABC model have been explored (see reviews in Bruza & Weeber, 2008; Sebastian, Siew, & Orimaye, 2017a; Smalheiser, 2012b):

a) For example, different words that have essentially the same meaning (lexical variants, synonyms, abbreviations, and alternative spellings) should ideally
be counted and treated as a single B-term. Conversely, Preiss and Stevenson (2016) have demonstrated that word sense disambiguation, i.e. to separate different senses of the same word as used in different instances, can improve performance of discovery systems.

b) Titles do not capture all information in an article. Words contained in the abstract and full text will also contribute information, albeit these terms will also contribute significant noise (Cohen, Johnson, et al., 2010).

c) Words and phrases are not the only, or necessarily the best, type of information to employ for linking literatures. Many other investigators have used concepts, MeSH terms, entities, and relations extracted from text (reviewed in Bruza & Weeber, 2008; Sebastian, Siew, & Orimaye, 2017a).

d) Similarly, ranking Ai-literatures according to the number of Bi-terms in their titles is a relatively crude and nonrobust measure. The hope is the B-terms will point to the existence of causal mechanisms that link the literatures, but this is not necessarily the case. Other investigators have proposed ranking measures based on e.g. mutual information, relations, and/or network properties, including citations (e.g. Cameron et al., 2015; Ding et al., 2013; Hristovski et al., 2015; Smalheiser, 2012b; van der Eijk et al., 2004; Wren, 2004).

e) The one-node search involves multiple searches and calculations of title words and phrases, which introduce computational complexity. In practice, investigators generally restrict the number or type of B-terms to be used for linking, with either semantic or statistical criteria. Furthermore, rather than searching for all possible A-literatures that might exist, generally they are restricted to being in some predefined semantic category (such as drugs).

f) Presenting many Ai-literatures for the investigator, even when ranked, causes great cognitive complexity, since each candidate A-literature requires detailed manual examination to assess.

6 The Two-node Search

Perhaps the most important limitation of the one-node search is not technical, but sociological: The one-node search is intended to help investigators who are looking for a new hypothesis—yet most investigators are already drowning in a sea of existing potential hypotheses and findings, and their goal is not to find still more hypotheses, but rather to decide which of the existing ones is most promising to pursue. Thus, in my own work, I have emphasized the importance of the two-node search strategy, which can be summarized as follows:

• An investigator already has a hypothesis (or an experimental finding) that links A and C, but which has not been explicitly investigated directly in any single published article.

• He or she carries out a two-node search between the set of articles that discusses A and the set of articles that discusses C, and examines the shared title words and phrases Bi.

• The goals are to rank the list of Bi-terms to home in on the most relevant and promising links, and to examine possible mechanisms that link A to C.

To create a quantitative model that would allow us to rank Bi-terms, I assembled a team of neuroscientists, who used the two-node search tool freely in the course of their scientific work. Vetle Torvik and I chose 8 of their searches as a gold standard, in which Bi-terms were manually marked as being relevant for linking A to C. Each Bi-term (marked as relevant or not relevant) was scored according to eight features (Table 1). These features are domain-independent insofar as they do not rely on any reported knowledge about entities, facts, or relations; rather, they are based on statistical properties such as the frequency of the term within MEDLINE (Table 1; Torvik & Smalheiser, 2007). As a negative control training set, we chose random pairs of query literatures (having similar size and topics as the gold standard set), and scored all Bi-terms in the negative set. We created a logistic regression model, based on a weighted sum of these features, to predict the probability that a given Bi-term would be marked as relevant, i.e. that it would be deemed relevant by users for linking A and C in a meaningful manner (Torvik & Smalheiser, 2007).

Table 1
Eight features used to characterize each B-term.

The two-node search interface(2)(2) makes it easy for investigators to carry out two-node searches among PubMed articles.

The two-node search also provides an aggregate measure of the implicit semantic similarity of any two literatures, based upon the body of Bi-terms, taken as a whole. Suppose we perform a two-node search and find that there are 1,263 terms on the B-list, of which 402 are predicted to be relevant (i.e. the estimated probability of relevance is > 0.5). The ratio 402/1263 = 0.32 is called the pR score, and it provides an overall measure of the shared implicit information between the peanut butter and health literacy literatures. Randomly chosen pairs of literatures tend to have pR scores around 0.07, whereas literatures that are very closely related in terms of topics tend to have pR scores of 0.4-0.5. We have used the pR score as an important feature for literature-based discovery (Peng, Bonifield, & Smalheiser, 2017).

6.1 The One-node Search Reconceptualized as a Series of Two-node Searches

Don’s original Web-based one-node search tool is no longer available. I have implemented a simpler version(3)(3) in which the investigator starts with a literature that represents a problem to be solved (e.g. Huntington’s disease). Next, the user will be prompted to choose a category of Medical Subject Headings (MeSH) to search within, which encompasses a set of literatures describing entities (or classes of entities) that represent possible approaches or solutions to the problem. (Alternatively, the user can choose the Free Format option, to enter any list of PubMed search queries manually, one on each line.) For example, to search among different classes of drugs according to their molecular mechanism using the MeSH Tree option, the user would drill down from Chemicals and Drugs to Chemical Actions and Uses to Pharmacologic Actions to finally, Molecular Mechanisms of Pharmacological Action [D27.505.519]. This category includes about twenty classes of drugs, including Alkylating Agents [D27.505.519.124], Angiotensin Receptor Antagonists [D27.505.519.162], Antacids [D27.505.519.170], Antifoaming Agents [D27.505.519.178], and so on. Once the user chooses this MeSH term category, the software will carry out a series of two-node searches, each consisting of A = Huntington’s disease vs C = one of the drug classes. These two-node searches are characterized according to the total number of articles in A and C (and nAC, the intersection of A and C), as well as the total number of B-terms. Finally, the searches are ranked according to pR, the percentage of B-terms that are predicted to be relevant for meaningful linkage. The two-node search results are all individually stored temporarily by job ID so users can go back without the need to re-run the search each time. Thus, carrying out a one-node search is simply a matter of carrying out a series of two-node searches, one for each MeSH term within the category of interest (Smalheiser, 2012b). This greatly simplifies the computational issues involved.

7 Examples from the Front Lines of Scientific Investigation

A variety of investigators have used literature-based discovery (LBD) methods to propose specific hypotheses which were then tested experimentally. Some of these studies introduced new LBD methodology (e.g. Wren et al., 2004), whereas others used the public Arrowsmith two-node search interface. Dong et al. (2014) investigated links between anandamide and gastric cancer. Maver et al. (2013) identified novel treatments for neovascularization in diabetic retinopathy. Miller et al. (2012) found mechanisms to link hypogonadism and diminished sleep quality in aging men. Cairelli et al. (2013) proposed a possible explanation for the “obesity paradox” whereby obese patients have better outcomes in intensive care. Manev & Manev (2010) studied a 5-lipoxygenase-leptin-Alzheimer connection. Kell (2009) used LBD to assess abnormal iron chelation as a common pathogenetic factor in a variety of diseases.

In my own laboratory studies, separately from Don, I have also put together assertions and knowledge from disparate literatures to formulate hypotheses that I have tested and verified experimentally. Unlike the examples stated above, in which we or others deliberately searched for complementary literatures, the latter examples arose haphazardly during the course of laboratory investigations.

For example, we had discovered that an enzyme, dicer, which is known to cleave double-stranded RNA to form small RNAs, is expressed and even highly enriched at postsynaptic densities present at synaptic contacts in the central nervous system (Lugli et al., 2005). However, paradoxically, although the dicer protein was present, it appeared to lack enzymatic activity. On the other hand, we knew that treating purified dicer protein with certain proteases in a test tube will cause dicer to form fragments that show greatly enhanced catalytic activity. And, there was an extensive body of studies that had shown that a naturally-occurring protease called calpain is activated during synaptic stimulation and cleaves a variety of other proteins in a controlled manner. Putting the two lines of studies together, we predicted that during synaptic stimulation, calpain might cleave dicer such that the activated, cleaved form of dicer would exhibit enzymatic activity. This was confirmed in experiments carried out in mouse brain tissue (Lugli et al., 2005).

Another example of connecting two disparate literatures to create a novel testable hypothesis occurred when we proposed that a phenomenon called RNA interference, which had been studied in worms and other lower organisms, might be involved in mediating learning and memory in the mammalian brain (Smalheiser, Manev, & Costa, 2001). It took us a decade to find provisional experimental evidence that this may, indeed, be the case (Smalheiser, 2012a, 2014).

Finally, a third example occurred when we noticed detailed similarities between a class of small vesicles (called secretory exosomes)—secreted by many cell types and reported to contain microRNAs and other types of RNAs—and the structures called synaptic spinules that form at synapses during periods of intense synaptic stimulation (Smalheiser, 2007). This led to the hypothesis that neurons may transfer RNAs and proteins across synapses in an activity-dependent manner (Smalheiser, 2007).

It should be acknowledged that none of these three examples involved computer-generated or automatic LBD algorithms, or even employed an explicit A-B-C model. Instead, both Don’s and my discoveries have largely been made by manual examination of complementary literatures and assembling of quite complex information into coherent wholes (Smalheiser, 2012b). Thus, it should be kept in mind that although most LBD research has focused on situations that arre readily recognized by text mining and that follow standard templates (e.g. A affects B and B affects C), these situations represent only the “low hanging fruit,” and more sophisticated models of discourse and assertion will be needed to deal with the rest.

8 New Directions in Literature-based Discovery
8.1 Storytelling

One and two-node A-B-C search strategies all consider a single intermediate link between two literatures. Perhaps the most straightforward extension of this idea is to construct and assess multi-step paths that exist between two sets of articles (e.g. Baek et al., 2017; Hossain et al., 2012; Sebastian, Siew, & Orimaye, 2017a). Multiple paths can also be constructed to connect entities, authors, and so on. This can be conceptualized variously as an exercise in storytelling, as navigating paths within graphs or networks, or as detecting functional mechanisms.

8.2 “Gaps”—Linking Two Sub-fields that Reside inside of a Larger Field of Investigation

My own group has focused recently on linking sub-fields that reside within a larger field of investigation. For example, consider the field of prostate cancer research. Some articles study experimental tumors in mice; some follow people for effects of diet and smoking on risk; some study molecular changes inside tumor cells; some are medicinal chemistry studies, modifying drugs for better solubility or potency or fewer side effects. Not all people in the field of prostate cancer research read all these articles! More to the point, not all topics are explored in all combinations within the prostate cancer field.

If two topics appear at moderately high frequencies within the prostate cancer field and are totally independent of each other, one would expect that they should co-occur in some articles simply by chance. When two MeSH terms co-occur, they often indicate that there is some direct or implicit relationship between them. Specifically, if two topics (defined as MeSH terms) are expected to co-occur in at least 10 articles within a given field, but do not co-occur in any articles at all, we call the pair of topics a “gap.” As reported recently (Peng, Bonifield, & Smalheiser, 2017), gaps can arise for several different reasons. A few gaps reflect idiosyncracies in the rules given to MEDLINE indexers, such that certain closely related MeSH terms are rarely applied to the same article. Some gaps represent “low hanging fruit,” i.e. research directions that have not yet been investigated but are known to be promising and are likely to be followed up on in the near future. Other gaps may indicate the presence of undiscovered public knowledge—that is, investigators may be unaware of connections that exist among different sub-areas of a single field. We are continuing to investigate the phenomenon of gaps and attempting to use them as a means of discovering new, promising research directions.

8.3 Discovery via Analogy

A popular and important approach in literature-based discovery (and text mining in general) is the semantic representation of words, concepts, relations or predications by vectors (Cole & Bruza, 2005; Gordon & Dumais, 1998; Widdows & Cohen, 2015), either high-dimensional vectors (Cohen & Widdows, 2009) or low-dimensional vectors (Mikolov et al., 2013; Pennington, Socher, & Manning, 2014). One of the endearing features of semantic vector representations is that vectors that lie near each other exhibit similar meanings or similar relations. For example, the relation “King : Queen” is implemented by subtracting the vector for King from the vector for Queen, resulting in a difference vector (King - Queen) that embodies the relation. Other vectors that encode similar relations, e.g. “Man : Woman” also lie near this difference vector. In particular, one can pose the question “King : Queen as Man : X?” and solve for X by identifying the difference vector which includes Man and lies closest to (King - Queen). Trevor Cohen has extensively explored the use of an analogy model for literature-based discovery based on vector proximity (e.g. Cohen & Widdows, 2009, 2017; Cohen, Whitfield, et al., 2010, Mower et al., 2016).

8.4 Link Prediction

Many discoveries involve combining new concepts or bridging disparate fields. One may hope to identify such publications by looking for newly published articles that contain novel combinations of text terms (Packalen & Bhattacharya, 2015), novel combinations of Medical Subject Headings (Mishra & Torvik, 2016; Peng, Bonifield, & Smalheiser, 2017), or whose reference lists cite novel combinations of journals (Uzzi et al., 2013). This leads to a model of literature-based discovery that is based on link prediction on networks. For example, Kastrin, Rindflesch, & Hristovski (2016) model LBD as considering all pairs of MeSH terms that have never co-occurred within a single article before, and seek to learn the factors that best predict the likelihood of an article appearing in the near future that is indexed by both of the MeSH terms. Sebastian, Siew, & Orimaye (2017b) combined text and citation networks for link prediction.

8.5 Scientific Arbitrage

Don often referred to literature-based discovery as an exercise in “scientific arbitrage,” in which certain ideas or findings are under-valued in one scientific arena, and gain in value by applying them to another field. (In fact, I believe he performed arbitrage in financial markets too!) In his final published article (Swanson, 2011), Don discussed the problem of identifying neglected, dead, or discarded findings and hypotheses as sources of new knowledge. Neglected findings, which are explicitly stated in one or more articles yet not well cited or followed up upon, may reflect a variety of issues: The articles in which they appeared may not be easy to find (particularly in full-text form), the findings themselves may have been refuted by later studies, or they may simply have been ahead of their times. The use of text mining to identify these neglected findings, and predict which (if any) ought to be resurrected and rehabilitated, remains an open question for further investigation.

A particular type of neglected finding is what I have called “negative consensus” (Smalheiser & Gomes, 2014), in which the investigators in a given field mention that a particular event or happenstance does NOT occur in nature. Sometimes this is documented by definitive experimental studies, in which case one would expect that negative assertions would cite the negative evidence. Often, however, the negative assertions simply reflect prevailing dogma or investigators’ expectations or “common sense”, and such cases do not cite any supporting evidence at all. My (somewhat contrarian) view is that negative consensus statements that lack experimental testing are in fact good subjects for further research. A small input of experimental testing may challenge the prevailing paradigm or dogma that made the finding seem so unlikely. For example, we noted that the protein Argonaute binds DNA in the test tube, yet investigators have simply assumed that it binds RNA within living cells—in part, this is because Argonaute is thought to reside in the cytoplasm whereas cytoplasmic DNA is thought not to exist. However, Argonaute does have functions in the nucleus, and there are indeed reports that extrachromosomal DNA exists in both nucleus and cytoplasm. Hence, the idea that Argonaute may bind DNA is not absurd but is well worth investigating (Smalheiser & Gomes, 2014). I believe that it is worthwhile to develop text mining tools that can identify negative consensus statements and help investigators decide which are likely to be promising to study. Agarwal, Yu, & Kohane (2011) have compiled a database of biomedical negated sentences, which might be mined to identify those assertions that are reliably negative across multiple documents.

8.6 The Penumbra of a Field as a Source of New Knowledge

A scientist working in a field (say, Alzheimer’s disease) is acutely aware that some lines of investigation are “mainstream” and reside in the core of the field, whereas other lines of work are marginal, either because they are new, or not considered interesting or credible, or because they are pursued by people who are not themselves recognized full-time Alzheimer researchers. For example, studies of amyloid or tau protein aggregates are intensively studied and are published in high-impact journals as well as in journals devoted to aging and Alzheimer’s disease. In contrast, studies of gut microbes (the so-called microbiota) are not a mainstream topic in Alzheimer’s disease, at least not yet. Standard techniques such as text mining, summarization, and clustering, together with citation analysis, can help to identify which articles, topics, keywords, and concepts reside in the core of a given field and which reside in the periphery, or penumbra.

Initially, literature-based discovery techniques sought to make linkages across literatures, without asking whether the links predominantly involve the cores or the peripheries of the literatures. Don’s first inclination was to filter out B-terms that did not have adequate frequency of mentions in each literature, implying that he was focusing on the cores (Swanson & Smalheiser, 1997). In contrast, Kostoff et al. (2009), Petrič et al. (2010), and Workman et al. (2016) have argued that low-frequency terms which reside in the penumbra of one or both fields may sometimes be more promising for finding links that are interesting and unexpected.

8.7 Evidence Synthesis and Reproducibility in Science

In the early days of literature-based discovery, when assembling ideas, assertions, and published findings, we did not worry much about the reliability of each reported item, or how many articles obtained similar results. If a paper reported that protein A binds protein B in adult female rat lung, the extracted assertion would be “protein A binds protein B” without worrying much about its scope or generalizability to other situations. The goal has been to identify interesting and promising hypotheses, which after all need to be experimentally confirmed on their own terms.

Over the past 10 years, however, it has become clear that a significant minority (if not the majority) of published findings are hard to replicate and have low reliability, due to a combination of flaws in experimental design, small sample sizes, naïve data analysis practices, and over-interpretation of statistical testing (e.g. Ioannadis, 2005; Rzhetsky et al., 2006; Smalheiser, 2017). Thus, going forward, it will be important not merely to identify terms and concepts for linking, but to assess the reliability of the articles that contain them and to filter or rank them accordingly. Kilicoglu (2017) has recently proposed that text mining may aid in at least four ways, namely, plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics.

Even more broadly, literature-based discovery is moving closer to the field of evidence synthesis, which collects reported findings across multiple studies (e.g. the set of all clinical trials that have employed nonsteroidal anti-inflammatory agents for chronic arthritic knee pain) and attempts to reach a consensus, if possible. This field employs techniques such as systematic review, meta-analysis, and summarization. Although most of this work is currently done manually, there is a recent push for the use of automated text mining tools to accelerate the process (Jonnalagadda, Goyal, & Huffman, 2015; O’Mara-Eves et al., 2015). In fact, text mining-based detection of reliable trends in the literature, i.e. detecting when “signal” is truly above “noise,” is itself a type of literature-based discovery, albeit explicit (rather than implicit) assertions are being mined.

9 Discussion and Conclusions

The recent advent of big data has provided massive, openly available data sets that provide rich fodder for literature-based discovery, as well as serving as training sets for machine learning approaches to discovery. Furthermore, major big data techniques include linking data sets together and combining heterogeneous data sets (including electronic medical records and data warehouses), both of which are increasingly tractable with current computational resources, and both of which are fundamental to obtaining implicit information used for discovery. The new directions discussed in this review (e.g. outliers, analogies, negative consensus, and others) go beyond the A-B-C model and open up the field to an exciting variety of models of discovery.

Historically, the big stumbling-block of literature-based discovery has been the fact that its models seek to predict novel, untested, even surprising findings, which inherently are difficult to score as “right” or “wrong” without costly experimentation. This has bedeviled methodological studies that seek to improve predictive performance. Existing benchmarks are relatively few (Sebastian et al., 2017a). Time-slicing is an alternative technique in which articles up to a certain date are used to construct a hypothesis, and then the literature is examined a few years later to determine whether that hypothesis is tested or at least mentioned in the literature (Yetisgen-Yildiz & Pratt, 2009). Some of the new research directions that I have discussed in this article are easier to evaluate than the classic one or two-node searches. For example, link prediction seeks to predict which pairs (of, say, MeSH terms) are most likely to appear together in the same article in the future, which can be assessed quantitatively without considering the “truth” of the article. It is gratifying that the techniques of literature-based discovery have been absorbed into the mainstream of bioinformatics, medical informatics, and computer science, whose practitioners find abundant value even in predicting findings that are relatively non-surprising and incremental. For example, if protein A is known to have a certain function, and protein X is similar to protein A in several respects, then protein X may be hypothesized to share functions with A. Different discovery models of protein functions can be assessed on how well they predict functions across a database of known proteins, without relying on having experimental data for the unknown or novel proteins.

The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of tools(4)(4) that are free and open to the public, as does BITOLA(5)((5) which is maintained by Dmitar Hristovski, and Epiphanet(6)((6) which is maintained by Trevor Cohen. Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic.


My informatics research is supported by NIH grants R01LM010817 and P01AG039347.

The authors have declared that no competing interests exist.


Agarwal S., Yu H., & Kohane I. (2011). BioNØT: A searchable database of biomedical negated sentences. BMC Bioinformatics, 12:420. Retrieved on August 9, 2017, from https://bmcbio
DOI:10.1186/1471-2105-12-420      URL    
Background Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioN???T, a database of negated sentences that can be used to extract such negated events. Description Currently BioN???T incorporates ???32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: ???2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as ???20 million abstracts in PubMed. We evaluated BioN???T on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioN???T is able to capture negated events that may be ignored by experts. Conclusions The BioN???T database can be a useful resource for biomedical researchers. BioN???T is freely available at webcite In future work, we will develop semantic web related technologies to enrich BioN???T.
Baek S.H., Lee D., Kim M., Lee J.H., & Song M. (2017). Enriching plausible new hypothesis generation in PubMed. PLoS ONE, 12(7), e0180539.
PMID:80326079      URL    
Most of earlier studies in the field of literature-based discovery have adopted Swanson's ABC model that links pieces of knowledge entailed in disjoint literatures. However, the issue concerning their practicability remains to be solved since most of them did not deal with the context surrounding the discovered associations and usually not accompanied with clinical confirmation. In this study, we aim to propose a method that expands and elaborates the existing hypothesis by advanced text mining techniques for capturing contexts. We extend ABC model to allow for multiple B terms with various biological types. We were able to concretize a specific, metabolite-related hypothesis with abundant contextual information by using the proposed method. Starting from explaining the relationship between lactosylceramide and arterial stiffness, the hypothesis was extended to suggest a potential pathway consisting of lactosylceramide, nitric oxide, malondialdehyde, and arterial stiffness. The experiment by domain experts showed that it is clinically valid. The proposed method is designed to provide plausible candidates of the concretized hypothesis, which are based on extracted heterogeneous entities and detailed relation information, along with a reliable ranking criterion. Statistical tests collaboratively conducted with biomedical experts provide the validity and practical usefulness of the method unlike previous studies. Applying the proposed method to other cases, it would be helpful for biologists to support the existing hypothesis and easily expect the logical process within it.
[Cite within: 1]
Bekhuis T. (2006). Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomedical Digital Libraries, 3:2. Retrieved on August 9, 2017, from
DOI:10.1186/1742-5581-3-2      PMID:1459187      URL    
Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.
[Cite within: 1]
Bruza P. ,& Weeber ,M. (Eds.) (2008. Literature-based discovery. Berlin: Springer-Verlag.Cairelli, M.J., Miller, C.M., Fiszman, M., Workman, T.E., & Rindflesch, T.C. (2013). Semantic MEDLINE for discovery browsing: Using semantic predications and the literature-based discovery paradigm to elucidate a mechanism for the obesity paradox. In AMIA Annual Symposium Proceedings (pp. 164-173). Retrieved on August 9, 2017, from .
Cameron D., Kavuluru R., Rindflesch T.C., Sheth A.P., Thirunarayan K., & Bodenreider O. (2015). Context-driven automatic subgraph creation for literature-based discovery. Journal of Biomedical Informatics, 54(C), 141-157.
DOI:10.1016/j.jbi.2015.01.014      PMID:4888806      URL    
Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting scientific literature. Prior approaches to LBD include use of: 1) domain expertise and structured background knowledge to manually filter and explore the literature, 2) distributional statistics and graph-theoretic measures to rank interesting connections, and 3) heuristics to help eliminate spurious connections. However, manual approaches to LBD are not scalable and purely distributional approaches may not be sufficient to obtain insights into the meaning of poorly understood associations. While several graph-based approaches have the potential to elucidate associations, their effectiveness has not been fully demonstrated. A considerable degree ofa prioriknowledge, heuristics, and manual filtering is still required. In this paper we implement and evaluate a context-driven, automatic subgraph creation method that captures multifaceted complex associations between biomedical concepts to facilitate LBD. Given a pair of concepts, our method automatically generates a ranked list of subgraphs, which provide informative and potentially unknown associations between such concepts. To generate subgraphs, the set of all MEDLINE articles that contain either of the two specified concepts (A, C) are first collected. Then binary relationships or assertions, which are automatically extracted from the MEDLINE articles, calledsemantic predications, are used to create a labeled directedpredications graph. In this predications graph, apathis represented as a sequence of semantic predications. The hierarchical agglomerative clustering (HAC) algorithm is then applied to cluster paths that are bounded by the two concepts (A, C). HAC relies on implicit semantics captured through Medical Subject Heading (MeSH) descriptors, and explicit semantics from the MeSH hierarchy, for clustering. Paths that exceed a threshold of semantic relatedness are clustered into subgraphs based on theirshared context. Finally, the automatically generated clusters are provided as a ranked list of subgraphs. The subgraphs generated using this approach facilitated the rediscovery of 8 out of 9 existing scientific discoveries. In particular, they directly (or indirectly) led to the recovery of severalintermediates(or B-concepts) between A- and C-terms, while also providing insights into the meaning of the associations. Such meaning is derived from predicates between the concepts, as well as the provenance of the semantic predications in MEDLINE. Additionally, by generating subgraphs on different thematic dimensions (such asCellular Activity, Pharmaceutical TreatmentandTissue Function), the approach may enable a broader understanding of the nature of complex associations between concepts. Finally, in a statistical evaluation to determine theinterestingnessof the subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE on average. These results suggest that leveraging the implicit and explicit semantics provided by manually assigned MeSH descriptors is an effective representation for capturing the underlyingcontextof complex associations, along multiple thematic dimensions in LBD situations.
[Cite within: 1]
Cohen K.B., Johnson H.L., Verspoor K., Roeder C., & Hunter L.E. (2010). The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics, 11: 492. Retrieved on August 9, 2017, from
[Cite within: 1]
Cohen T., Whitfield G.K., Schvaneveldt R.W., Mukund K., & Rindflesch T. (2010). EpiphaNet: An interactive tool to support biomedical discoveries. Journal of Biomed Discovery Collaboration, 5(1), 21-49.
PMID:2990276      URL    
Background. EpiphaNet ( is an interactive knowledge discovery system, which enables researchers to explore visually sets of relations extracted from MEDLINE using a combination of language processing techniques. In this paper, we discuss the theoretical and methodological foundations of the system, and evaluate the utility of the models that underlie it for literature-based discovery. In addition, we present a summary of results drawn from a qualitative analysis of over six hours of interaction with the system by basic medical scientists. Results: The system is able to simulate open and closed discovery, and is shown to generate associations that are both surprising and interesting within the area of expertise of the researchers concerned. Conclusions: EpiphaNet provides an interactive visual representation of associations between concepts, which is derived from distributional statistics drawn from across the spectrum of biomedical citations in MEDLINE. This tool is available online, providing biomedical scientists with the opportunity to identify and explore associations of interest to them.
[Cite within: 1]
Cohen T. , & Widdows, D. (2009). Empirical distributional semantics: Methods and biomedical applications. Journal of Biomed Information, 42(2), 390-405.
DOI:10.1016/j.jbi.2009.02.002      PMID:19232399      URL    
Over the past 15 years, a range of methods have been developed that are able to learn human-like estimates of the semantic relatedness between terms from the way in which these terms are distributed in a corpus of unannotated natural language text. These methods have also been evaluated in a number of applications in the cognitive science, computational linguistics and the information retrieval literatures. In this paper, we review the available methodologies for derivation of semantic relatedness from free text, as well as their evaluation in a variety of biomedical and other applications. Recent methodological developments, and their applicability to several existing applications are also discussed.
[Cite within: 2]
Cohen T. , & Widdows, D. (2017). Embedding of semantic predications. Journal of Biomed Information, 68, 150-166.
DOI:10.1016/j.jbi.2017.03.003      PMID:28284761      URL    
Abstract This paper concerns the generation of distributed vector representations of biomedical concepts from structured knowledge, in the form of subject-relation-object triplets known as semantic predications. Specifically, we evaluate the extent to which a representational approach we have developed for this purpose previously, known as Predication-based Semantic Indexing (PSI), might benefit from insights gleaned from neural-probabilistic language models, which have enjoyed a surge in popularity in recent years as a means to generate distributed vector representations of terms from free text. To do so, we develop a novel neural-probabilistic approach to encoding predications, called Embedding of Semantic Predications (ESP), by adapting aspects of the Skipgram with Negative Sampling (SGNS) algorithm to this purpose. We compare ESP and PSI across a number of tasks including recovery of encoded information, estimation of semantic similarity and relatedness, and identification of potentially therapeutic and harmful relationships using both analogical retrieval and supervised learning. We find advantages for ESP in some, but not all of these tasks, revealing the contexts in which the additional computational work of neural-probabilistic modeling is justified. Copyright 2017 Elsevier Inc. All rights reserved.
[Cite within: 1]
Cole R. , & Bruza, P. (2005). A bare bones approach to literature-based discovery: An analysis of the Raynaud’s/Fish-oil and migraine-magnesium discoveries in semantic space. In A. Hoffmann, H. Motoda, & T. Scheffer (Eds.), Discovery Science (pp. 84-98). Berlin: Springer-Verlag.
[Cite within: 1]
Ding Y., Song M., Han J., Yu Q., Yan E., Lin L., & Chambers T. (2013). Entitymetrics: Measuring the impact of entities. PLoS ONE, 8(8), e71416.
DOI:10.1371/journal.pone.0071416      PMID:3756961      URL    
This paper proposes entitymetrics to measure the impact of knowledge units. Entitymetrics highlight the importance of entities embedded in scientific literature for further knowledge discovery. In this paper, we use Metformin, a drug for diabetes, as an example to form an entity-entity citation network based on literature related to Metformin. We then calculate the network features and compare the centrality ranks of biological entities with results from Comparative Toxicogenomics Database (CTD). The comparison demonstrates the usefulness of entitymetrics to detect most of the outstanding interactions manually curated in CTD.
[Cite within: 2]
Dong W., Liu Y., Zhu W., Mou Q., Wang J., & Hu Y. (2014). Simulation of Swanson’s literature-based discovery: Anandamide treatment inhibits growth of gastric cancer cells in vitro and in silico. PLoS ONE, 9(6), e100436.
DOI:10.1371/journal.pone.0100436      PMID:4065097      URL    
Swanson's literature-based discovery focus on resurrecting previously published but neglected knowledge. In this study, we propose a two-step model of the discovery process and generate a hypothesis between anandamide and gastric cancer. Further, the potential relationship was confirmed by follow-up experimentation. The anandamide treatment resulted in cell cycle redistribution of gastric cancer cells. Most importantly, the variation of cell cycle was mediated by some genes from the B-terms of the closed discovery, indicating the potential role of the B-terms. Swanson's literature-based discovery not only collates data for possible interactions, but also provides the potential to observe the larger background behind these direct links and is an invaluable discovery tool for investigators.
Frijters R., van Vugt M., Smeets R., van Schaik R., de Vlieg J., & Alkema W. (2010). Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Computational Biology, 6(9), e1000943.
DOI:10.1371/journal.pcbi.1000943      PMID:20885778      URL    
Author Summary The biomedical literature is an important source of knowledge on the function of genes and on the mechanisms by which these genes regulate cellular processes. Several text mining approaches have been developed to leverage this rich source of information by automatically extracting associations between concepts such as genes, diseases and drugs from a large body of text. Here, we describe a new method that extracts novel, not yet recognized associations between genes, diseases, drugs and cellular processes from the biomedical literature. Our method is built on the assumption that even if two concepts do not have a direct connection in literature, they may be functionally related if they are both connected to an overlapping set of concepts. Using this approach we predicted several novel connections between genes, diseases, drugs and pathways. Our results imply that our method is able to predict novel relationships from literature and, most importantly, that these newly identified relationships are biologically relevant. Our method can aid the drug discovery process where it can be used to find novel drug targets, increase insight in mode of action of a drug or find novel applications for known drugs.
[Cite within: 1]
Gordon, M .D. &Dumais ,S. (1998). Using latent semantic indexing for literature based discovery. Journal of the American Society for Information Science, 49(8), 674-685.
DOI:10.1002/(SICI)1097-4571(199806)49:83.0.CO;2-Q      URL    
Latent semantic indexing (LSI) is a statistical technique for improving information retrieval effectiveness. Here, we use LSI to assist in literature-based discoveries. The idea behind literature-based discoveries is that different authors have already published certain underlying scientific ideas that, when taken together, can be connected to hypothesize a new discovery, and that these connections can be made by exploring the scientific literature. We explore latent semantic indexing's effectiveness on two discovery processes: uncovering “nearby” relationships that are necessary to initiate the literature based discovery process; and discovering more distant relationships that may genuinely generate new discovery hypotheses. 08 1998 John Wiley & Sons, Inc.
Hossain M.S., Gresock J., Edmonds Y., Helm R., Potts M., & Ramakrishnan N. (2012). Connecting the dots between PubMed abstracts. PLoS ONE, 7(1), e29509.
DOI:10.1371/journal.pone.0029509      URL    
There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications. We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for onnecting the dots across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps. We demonstrate the application of our storytelling algorithm to three case studies: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.
[Cite within: 1]
Hristovski D., Kastrin A., Dinevski D., & Rindflesch T.C. (2015). Constructing a graph database for semantic literature-based discovery. Studies in Health Technology and Informatics, 216:1094. Retrieved on August 9, 2017, from
PMID:26262393      URL    
Literature-based discovery (LBD) generates discoveries, or hypotheses, by combining what is already known in the literature. Potential discoveries have the form of relations between biomedical concepts; for example, a drug may be determined to treat a disease other than the one for which it was intended. LBD views the knowledge in a domain as a network; a set of concepts along with the relations between them. As a starting point, we used SemMedDB, a database of semantic relations between biomedical concepts extracted with SemRep from Medline. SemMedDB is distributed as a MySQL relational database, which has some problems when dealing with network data. We transformed and uploaded SemMedDB into the Neo4j graph database, and implemented the basic LBD discovery algorithms with the Cypher query language. We conclude that storing the data needed for semantic LBD is more natural in a graph database. Also, implementing LBD discovery algorithms is conceptually simpler with a graph query language when compared with standard SQL.
[Cite within: 1]
Hristovski D., Kastrin A., Dinevski D., Burgun A., Žiberna L., & Rindflesch TC. (2016). Using literature-based discovery to explain adverse drug effects. Journal of Medical Systems, 40(8), 185.
DOI:10.1007/s10916-016-0544-z      PMID:27318993      URL    
We report on our research in using literature-based discovery (LBD) to provide pharmacological and/or pharmacogenomic explanations for reported adverse drug effects. The goal of LBD is to generate novel and potentially useful hypotheses by analyzing the scientific literature and optionally some additional resources. Our assumption is that drugs have effects on some genes or proteins and that these genes or proteins are associated with the observed adverse effects. Therefore, by using LBD we try to find genes or proteins that link the drugs with the reported adverse effects. These genes or proteins can be used to provide insight into the processes causing the adverse effects. Initial results show that our method has the potential to assist in explaining reported adverse drug effects.
[Cite within: 1]
Ioannidis J.P. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.
DOI:10.1016/j.urolonc.2008.07.015      URL    
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Jonnalagadda S.R., Goyal P., & Huffman M.D. (2015). Automating data extraction in systematic reviews: A systematic review. System Review, 4:78. Retrieved on August 9, 2017, from
DOI:10.1186/s13643-015-0066-7      PMID:4514954      URL    
Background Automation of the parts of systematic review process, specifically the data extraction step, may be an important strategy to reduce the time necessary to complete a systematic review. However, the state of the science of automatically extracting data elements from full texts has not been well described. This paper performs a systematic review of published and unpublished methods to automate data extraction for systematic reviews. Methods We systematically searched PubMed, IEEEXplore, and ACM Digital Library to identify potentially relevant articles. We included reports that met the following criteria: 1) methods or results section described what entities were or need to be extracted, and 2) at least one entity was automatically extracted with evaluation results that were presented for that entity. We also reviewed the citations from included reports. Results Out of a total of 1190 unique citations that met our search criteria, we found 26 published reports describing automatic extraction of at least one of more than 52 potential data elements used in systematic reviews. For 25 (48%) of the data elements used in systematic reviews, there were attempts from various researchers to extract information automatically from the publication text. Out of these, 14 (27%) data elements were completely extracted, but the highest number of data elements extracted automatically by a single study was 7. Most of the data elements were extracted with F-scores (a mean of sensitivity and positive predictive value) of over 70%. Conclusions We found no unified information extraction framework tailored to the systematic review process, and published reports focused on a limited (1???7) number of data elements. Biomedical natural language processing techniques have not been fully utilized to fully or even partially automate the data extraction step of systematic reviews.
[Cite within: 1]
Kastrin A., Rindflesch T.C., & Hristovski D. (2016). Link prediction on a network of co-occurring MeSH Terms: Towards literature-based discovery. Methods of Information in Medicine, 55(4), 340-346.
DOI:10.3414/ME15-01-0108      PMID:27435341      URL    
Literature-based discovery (LBD) is a text mining methodology for automatically generating research hypotheses from existing knowledge. We mimic the process of LBD as a classification problem on a graph of MeSH terms. We employ unsupervised and supervised link prediction methods for predicting previously unknown connections between biomedical concepts. We evaluate the effectiveness of link prediction through a series of experiments using a MeSH network that contains the history of link formation between biomedical concepts. We performed link prediction using proximity measures, such as common neighbor (CN), Jaccard coefficient (JC), Adamic66/66Adar index (AA) and preferential attachment (PA). Our approach relies on the assumption that similar nodes are more likely to establish a link in the future. Applying an unsupervised approach, the AA measure achieved the best performance in terms of area under the ROC curve (AUC65=650.76), followed by CN, JC, and PA. In a supervised approach, we evaluate whether proximity measures can be combined to define a model of link formation across all four predictors. We applied various classifiers, including decision trees, k-nearest neighbors, logistic regression, multilayer perceptron, na07ve Bayes, and random forests. Random forest classifier accomplishes the best performance (AUC65=650.87). The link prediction approach proved to be effective for LBD processing. Supervised statistical learning approaches clearly outperform an unsupervised approach to link prediction.
Kell D.B. (2009. Iron behaving badly: Inappropriate iron chelation as a major contributor to the aetiology of vascular and other progressive inflammatory and degenerative diseases. BMC Medical Genomics, 2:2. Retrieved on August 9, 2017, from 2009). Iron behaving badly: Inappropriate iron chelation as a major contributor to the aetiology of vascular and other progressive inflammatory and degenerative diseases. BMC Medical Genomics, 2:2. Retrieved on August 9, 2017, from
Kilicoglu H. (2017). Biomedical text mining for research rigor and integrity: Tasks, challenges, directions. Brief Bioinform, bbx057. Retrieved on August 9, 2017, from
DOI:10.1093/bib/bbx057      PMID:28633401      URL    
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted, due to problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the end result of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part towards enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload, and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can add checks and balances that promote responsible research practices and can provide significant benefits for the biomedical research enterprise.
Kostoff R.N., Block J.A., Solka J.L., Briggs M.B., Rushenberg R.L., Stump J.A., Johnson D., Lyons, T.J. & Wyatt, J.R. (2009). Literature-related discovery. Annual Review of Information Science and Technology, 43(1), 1-71.
Lugli G., Larson J., Martone M.E., Jones Y., & Smalheiser N.R. (2005). Dicer and eIF2c are enriched at postsynaptic densities in adult mouse brain and are modified by neuronal activity in a calpain-dependent manner. Journal Neurochem, 94(4), 896-905.
DOI:10.1111/jnc.2005.94.issue-4      URL    
[Cite within: 2]
Manev H. , & Manev, R. (2010. Benefits of neuropsychiatric phenomics: Example of the 5-lipoxygenase-leptin-Alzheimer connection. Cardiovasc Psychiatry Neurol, No. 838164. Retrieved on August 9, 2017, from 2010). Benefits of neuropsychiatric phenomics: Example of the 5-lipoxygenase-leptin-Alzheimer connection. Cardiovasc Psychiatry Neurol, No. 838164. Retrieved on August 9, 2017, from .
Maver A., Hristovski D., Rindflesch T.C., & Peterlin B. (2013. Integration of data from Omic studies with the literature-based discovery towards identification of Novel treatments for neovascularization in diabetic retinopathy. BioMed Research International, No. 848952. Retrieved on August 9, 2017, from 2013). Integration of data from Omic studies with the literature-based discovery towards identification of Novel treatments for neovascularization in diabetic retinopathy. BioMed Research International, No. 848952. Retrieved on August 9, 2017, from .
Mikolov T., Sutskever I., Chen K., Corrado G.S., & Dean J. (2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS 2013). Retrieved on August 9, 2017, from 2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS 2013). Retrieved on August 9, 2017, from .
[Cite within: 1]
Miller C.M., Rindflesch T.C., Fiszman M., Hristovski D., Shin D., Rosemblat G., Zhang H., & Strohl K.P. (2012). A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep, 35(2), 279-285.
DOI:10.5665/sleep.1640      PMID:22294819      URL    
Sleep quality commonly diminishes with age, and, further, aging men often exhibit a wider range of sleep pathologies than women. We used a freely available, web-based discovery technique (Semantic MEDLINE) supported by semantic relationships to automatically extract information from MEDLINE titles and abstracts.We assumed that testosterone is associated with sleep (the A-C relationship in the paradigm) and looked for a mechanism to explain this association (B explanatory link) as a potential or partial mechanism underpinning the etiology of eroded sleep quality in aging men.Review of full-text papers in critical nodes discovered in this manner resulted in the proposal that testosterone enhances sleep by inhibiting cortisol. Using this discovery method, we posit, and could confirm as a novel hypothesis, cortisol as part of a mechanistic link elucidating the observed correlation between decreased testosterone in aging men and diminished sleep quality.This approach is publically available and useful not only in this manner but also to generate from the literature alternative explanatory models for observed experimental results.
Mishra S. , & Torvik, V.I. (2016. Quantifying conceptual novelty in the biomedical literature. D-Lib Magazine, 22, No. 9/10. Retrieved on August 9, 2017, from 2016). Quantifying conceptual novelty in the biomedical literature. D-Lib Magazine, 22, No. 9/10. Retrieved on August 9, 2017, from .
[Cite within: 1]
Mower J., Subramanian D., Shang N., & Cohen T. (2016). Classification-by-analogy: Using vector representations of implicit relationships to identify plausibly causal drug/side-effect relationships. AMIA Annual Symposium Proceedings, 1940-1949.
PMID:5333205      URL    
react-text: 124 The integration of disparate research domains is a prerequisite for the success of the translational science initiative. MEDLINE abstracts contain content from a broad range of disciplines, presenting an opportunity for the development of methods able to integrate the knowledge they contain. Latent Semantic Analysis (LSA) and related methods learn human-like associations between terms from... /react-text react-text: 125 /react-text [Show full abstract]
[Cite within: 1]
O’Mara-Eves A., Thomas J., McNaught J., Miwa M., & Ananiadou S. (2015). Using text mining for study identification in systematic reviews: A systematic review of current approaches. System Review, 4:5. Retrieved on August 9, 2017, from
DOI:10.1186/2046-4053-4-5      PMID:4411935      URL    
Background The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities. Methods Five research questions led our review: what is the state of the evidence base; how has workload reduction been evaluated; what are the purposes of semi-automation and how effective are they; how have key contextual problems of applying text mining to the systematic review field been addressed; and what challenges to implementation have emerged? We answered these questions using standard systematic review methods: systematic and exhaustive searching, quality-assured data extraction and a narrative synthesis to synthesise findings. Results The evidence base is active and diverse; there is almost no replication between studies or collaboration between research teams and, whilst it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable. On the whole, most suggested that a saving in workload of between 30% and 70% might be possible, though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e. a 95% recall). Conclusions Using text mining to prioritise the order in which items are screened should be considered safe and ready for use in ???live??? reviews. The use of text mining as a ???second screener??? may also be used cautiously. The use of text mining to eliminate studies automatically should be considered promising, but not yet fully proven. In highly technical/clinical areas, it may be used with a high degree of confidence; but more developmental and evaluative work is needed in other disciplines.
[Cite within: 1]
Packalen M. , & Bhattacharya, J. (2015). Neophilia ranking of scientific journals. NBER Working Paper No. w21579. Retrieved on August 9, 2017, from
DOI:10.1007/s11192-016-2157-1      PMID:5506293      URL    
Abstract The ranking of scientific journals is important because of the signal it sends to scientists about what is considered most vital for scientific progress. Existing ranking systems focus on measuring the influence of a scientific paper (citations)-these rankings do not reward journals for publishing innovative work that builds on new ideas. We propose an alternative ranking based on the proclivity of journals to publish papers that build on new ideas, and we implement this ranking via a text-based analysis of all published biomedical papers dating back to 1946. In addition, we compare our neophilia ranking to citation-based (impact factor) rankings; this comparison shows that the two ranking approaches are distinct. Prior theoretical work suggests an active role for our neophilia index in science policy. Absent an explicit incentive to pursue novel science, scientists underinvest in innovative work because of a coordination problem: for work on a new idea to flourish, many scientists must decide to adopt it in their work. Rankings that are based purely on influence thus do not provide sufficient incentives for publishing innovative work. By contrast, adoption of the neophilia index as part of journal-ranking procedures by funding agencies and university administrators would provide an explicit incentive for journals to publish innovative work and thus help solve the coordination problem by increasing scientists' incentives to pursue innovative work.
[Cite within: 1]
[Cite within: 3]
Popper K.R. (1978. Three worlds. The tanner lecture on human values. The University of Michigan. Ann Arbor. Retrieved on July 17, 2017, from 1978). Three worlds. The tanner lecture on human values. The University of Michigan. Ann Arbor. Retrieved on July 17, 2017, from .
[Cite within: 1]
Pennington J., Socher R., & Manning, C.D. (2014, October). Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Lanugage Processing, Vol. 14 (pp. 1532-1543). Retrieved on August 9, 2017, from& Manning, C.D. (2014, October). Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Lanugage Processing, Vol. 14 (pp. 1532-1543). Retrieved on August 9, 2017, from
[Cite within: 1]
Petrič I., Cestnik B., Lavrač N., & Urbančič T. (2010). Outlier detection in cross-context link discovery for creative literature mining. The Computer Journal, 55(1), 47-61.
DOI:10.1093/comjnl/bxq074      URL    
This paper investigates the role of outliers in literature-based knowledge discovery. It shows that detecting interesting outliers which appear in the literature on a given phenomenon can help the expert to find implicit relationships among concepts of different domains. The underlying assumption is that while the majority of articles in the given scientific domain describe matters related to a common understanding of the domain, the exploration of outliers may lead to the detection of scientifically interesting bridging concepts among disjoint sets of scientific articles. The proposed approach contributes to cross-context link discovery by proving the utility of outlier detection for finding bisociative links in the process of autism literature exploration, as well as by uncovering implicit relationships in the articles from the migraine domain.
Preiss J. , & Stevenson, R. (2016). The effect of word sense disambiguation accuracy on literature based discovery. BMC Medical Informatics and Decision Making,16(1), 59-63.
DOI:10.1186/s12911-016-0302-7      PMID:4893223      URL    
Several models have been proposed to predict the short-term outcome of acute-on-chronic liver failure (ACLF) after treatment. We aimed to determine whether better decisions for artificial liver support system (ALSS) treatment could be made with a model than without, through decision curve analysis (DCA). The medical profiles of a cohort of 232 patients with hepatitis B virus (HBV)-associated ACLF were retrospectively analyzed to explore the role of plasma prothrombin activity (PTA), model for end-stage liver disease (MELD) and logistic regression model (LRM) in identifying patients who could benefit from ALSS. The accuracy and reliability of PTA, MELD and LRM were evaluated with previously reported cutoffs. DCA was performed to evaluate the clinical role of these models in predicting the treatment outcome. With the cut-off value of 0.2, LRM had sensitivity of 92.602%, specificity of 42.302% and an area under the receiving operating characteristic curve (AUC) of 0.68, which showed superior discrimination over PTA and MELD. DCA revealed that the LRM-guided ALSS treatment was superior over other strategies including “treating all” and MELD-guided therapy, for the midrange threshold probabilities of 16 to 6402%. The use of LRM-guided ALSS treatment could increase both the accuracy and efficiency of this procedure, allowing the avoidance of unnecessary ALSS.
[Cite within: 1]
Rzhetsky A., Iossifov I., Loh J.M., & White K.P. (2006). Microparadigms: Chains of collective reasoning in publications about molecular interactions. Proceedings of the National Academy of Sciences of the United States of America, 103(13), 4940-4945.
DOI:10.1073/pnas.0600591103      URL    
[Cite within: 1]
Sebastian Y., Siew E.G., & Orimaye S.O. (2017a). Emerging approaches in literature-based discovery: Techniques and performance review. Knowledge Engineering Review, 32, article no. e12. Retrieved on July 17, 2017, from
[Cite within: 3]
Sebastian Y., Siew E.G., & Orimaye S.O. (2017b). Learning the heterogeneous bibliographic information network for literature-based discovery. Knowledge-Based Systems, 115, 66-79.
DOI:10.1016/j.knosys.2016.10.015      URL    
This paper presents HBIN-LBD, a novel literature-based discovery (LBD) method that exploits the lexico-citation structures within the heterogeneous bibliographic information network (HBIN) graphs. Unlike other existing LBD methods, HBIN-LBD harnesses the metapath features found in HBIN graphs for discovering the latent associations between scientific papers published in otherwise disconnected research areas. Further, this paper investigates the effects of incorporating semantic and topic modeling components into the proposed models. Using time-sliced historical bibliographic data, we demonstrate the performance of our method by reconstructing two LBD hypotheses: theFish Oil and Raynaud Syndromehypothesis and theMigraine and Magnesiumhypothesis. The proposed method is capable of predicting the future co-citation links between research papers of these previously disconnected research areas with up to 88.86% accuracy and 0.89 F-measure.
Shang N., Xu H., Rindflesch T.C., & Cohen T. (2014. Identifying plausible adverse drug reactions using knowledge extracted from the literature. Journal of Biomedical Informatics, 52, 293-310. Retrieved on July 17, 2017, from 2014). Identifying plausible adverse drug reactions using knowledge extracted from the literature. Journal of Biomedical Informatics, 52, 293-310. Retrieved on July 17, 2017, from .
[Cite within: 1]
Shi C., Li Y., Zhang J., Sun Y., & Philip S.Y. (2017). A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1), 17-37.
DOI:10.1109/TKDE.2016.2598561      URL    
Abstract: Most real systems consist of a large number of interacting, multi-typed components, while most contemporary researches model them as homogeneous networks, without distinguishing different types of objects and links in the networks. Recently, more and more researchers begin to consider these interconnected, multi-typed data as heterogeneous information networks, and develop structural analysis approaches by leveraging the rich semantic meaning of structural types of objects and links in the networks. Compared to widely studied homogeneous network, the heterogeneous information network contains richer structure and semantic information, which provides plenty of opportunities as well as a lot of challenges for data mining. In this paper, we provide a survey of heterogeneous information network analysis. We will introduce basic concepts of heterogeneous information network analysis, examine its developments on different data mining tasks, discuss some advanced topics, and point out some future research directions.
[Cite within: 1]
Smalheiser N.R. (2007). Exosomal transfer of proteins and RNAs at synapses in the nervous system. Biology Direct, 2(1), 35.
DOI:10.1186/1745-6150-2-35      PMID:2219957      URL    
Background Many cell types have been reported to secrete small vesicles called exosomes, that are derived from multivesicular bodies and that can also form from endocytic-like lipid raft domains of the plasma membrane. Secretory exosomes contain a characteristic composition of proteins, and a recent report indicates that mast cell exosomes harbor a variety of mRNAs and microRNAs as well. Exosomes express cell recognition molecules on their surface that facilitate their selective targeting and uptake into recipient cells. Results In this review, I suggest that exosomal secretion of proteins and RNAs may be a fundamental mode of communication within the nervous system, supplementing the known mechanisms of anterograde and retrograde signaling across synapses. In one specific scenario, exosomes are proposed to bud from the lipid raft region of the postsynaptic membrane adjacent to the postsynaptic density, in a manner that is stimulated by stimuli that elicit long-term potentiation. The exosomes would then transfer newly synthesized synaptic proteins (such as CAM kinase II alpha) and synaptic RNAs to the presynaptic terminal, where they would contribute to synaptic plasticity. Conclusion The model is consistent with the known cellular and molecular features of synaptic neurobiology and makes a number of predictions that can be tested in vitro and in vivo. Open peer review Reviewed by Etienne Joly, Gaspar Jekely, Juergen Brosius and Eugene Koonin. For the full reviews, please go to the Reviewers' comments section.
[Cite within: 2]
Smalheiser N.R. (2012a). The search for endogenous siRNAs in the mammalian brain. Experimental Neurology, 235(2), 455-463.
DOI:10.1016/j.expneurol.2011.10.015      PMID:22062046      URL    
78 Endogenous siRNAs are detected within adult mouse hippocampus. 78 These derive from genes involved in synaptic structure and signaling. 78 Small RNAs derived from abundant cellular noncoding RNAs are also detected. 78 25–30 nt. RNAs showed very large (>100 fold) changes during learning. 78 Endo-siRNAs and ncRNA-derived small RNAs may regulate synaptic plasticity.
[Cite within: 1]
Smalheiser N.R. (2012b). Literature-based discovery: Beyond the ABCs. Journal of the Association for Information Science and Technology, 63(2), 218-224.
DOI:10.1002/asi.21599      URL    
ABSTRACT Literature-based discovery (LBD) refers to a particular type of text mining that seeks to identify nontrivial assertions that are implicit, and not explicitly stated, and that are detected by juxtaposing (generally a large body of) documents. In this review, I will provide a brief overview of LBD, both past and present, and will propose some new directions for the next decade. The prevalent ABC model is not “wrong”; however, it is only one of several different types of models that can contribute to the development of the next generation of LBD tools. Perhaps the most urgent need is to develop a series of objective literature-based interestingness measures, which can customize the output of LBD systems for different types of scientific investigations.
[Cite within: 4]
Smalheiser N.R. (2014). The RNA-centred view of the synapse: Non-coding RNAs and synaptic plasticity. Philosophical Transactions of the Royal Society B Biological Sciences, 369(1652).
[Cite within: 1]
Smalheiser N.R.(2017). Data literacy: How to make your experiments robust and reproducible. Cambridge, MA: Academic Press.
[Cite within: 2]
Smalheiser N.R. , & Gomes, O.L. (2014). Mammalian Argonaute-DNA binding? Direct, 10:27. Retrieved on July 17, 2017, from
[Cite within: 2]
Smalheiser N.R., Manev H., & Costa E. (2001). RNAi and brain function: Was McConnell on the right track? Trends in Neurosciences, 24(4), 216-218.
DOI:10.1016/S0166-2236(00)01739-2      PMID:11250005      URL    
RNA interference (RNAi), one of the hottest topics of molecular biology research today, has unique features that are eerily reminiscent of the phenomenon of ‘RNA-mediated memory transfer,’ a controversial line of work that was investigated with great enthusiasm in the 1960s. If not a coincidence, then this suggests taking a new look at RNA-mediated modulation of neural function and raises the possibility that RNAi might be one of the physiologic mechanisms that regulate long-term gene expression in the brain.
[Cite within: 1]
Smalheiser N.R. , & Swanson, D.R. (1994). Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neuroscience Research Communications, 15(1), 1-9.
DOI:10.1016/0168-0102(94)90027-2      URL    
Recent studies have focused great attention upon the role of NMDA receptor-mediated excitotoxicity in the pathogenesis of a variety of acute and chronic neurologic diseases, and upon the role of endogenous Mg ions in regulating this process. Yet, very few studies have sought to ascertain whether exogenous, e.g., dietary, manipulations of Mg levels can modulate brain function or the expression of neurologic diseases (apart from hyperexcitability and seizures that are elicited directly when Mg levels are extremely low). We argue that this issue is important, and should be addressed in existing animal models of acute and chronic CNS insults.
[Cite within: 1]
Smalheiser N.R. , & Swanson, D.R. (1996a). Indomethacin and Alzheimer’s disease. Neurology, 46(2), 583.
[Cite within: 1]
Smalheiser N.R. , & Swanson, D.R. (1996b). Linking estrogen to Alzheimer’s disease: An informatics approach. Neurology, 47(3), 809-810.
DOI:10.1212/WNL.47.3.809      PMID:8797484      URL    
Epidemiologic studies suggest that estrogen protects against AD.We employ ARROWSMITH, a novel computer-assisted approach, to identify possible links between estrogen and AD that are not explicit in the biomedical literature, by searching for substances or processes that are known targets of estrogen action and that have also been separately studied in relation to AD. Several links appear particularly promising (e.g., estrogen's antioxidant activity) and merit attention by neuroscientists.
Smalheiser N.R. , & Swanson, D.R. (1998). Calcium-independent phospholipase A2 and schizophrenia. Archives of General Psychiatry, 55(8), 752-753.
DOI:10.1001/archpsyc.55.8.752      PMID:9707387      URL    
Comment on Arch Gen Psychiatry. 1997 May;54(5):487-94.
Smalheiser N.R. , & Torvik, V.I. (2004). A population-based statistical approach identifies parameters characteristic of human microRNA-mRNA interactions. BMC Bioinformatics, 5:139. Retrieved on July 17, 2017, from
DOI:10.1186/1471-2105-5-139      PMID:15453917      URL    
pAbstract/p pBackground/p pMicroRNAs are ~17鈥24 nt. noncoding RNAs found in all eukaryotes that degrade messenger RNAs via RNA interference (if they bind in a perfect or near-perfect complementarity to the target mRNA), or arrest translation (if the binding is imperfect). Several microRNA targets have been identified in lower organisms, but only one mammalian microRNA target has yet been validated experimentally./p pResults/p pWe carried out a population-wide statistical analysis of how human microRNAs interact complementarily with human mRNAs, looking for characteristics that differ significantly as compared with scrambled control sequences. These characteristics were used to identify a set of 71 outlier mRNAs unlikely to have been hit by chance./p pUnlike the case in itC. elegans /itand itDrosophila/it, many human microRNAs exhibited long exact matches (10 or more bases in a row), up to and including perfect target complementarity. Human microRNAs hit outlier mRNAs within the protein coding region about 2/3 of the time. And, the stretches of perfect complementarity within microRNA hits onto outlier mRNAs were not biased near the 5-end of the microRNA. In several cases, an individual microRNA hit multiple mRNAs that belonged to the same functional class./p pConclusions/p pThe analysis supports the notion that sequence complementarity is the basis by which microRNAs recognize their biological targets, but raises the possibility that human microRNA-mRNA target interactions follow different rules than have been previously characterized in itDrosophila /itand itC. elegans/it./p
[Cite within: 1]
Smalheiser N.R. , & Torvik, V.I. (2005). Mammalian microRNAs derived from genomic repeats. Trends in Genetics, 21(6), 322-326.
DOI:10.1016/j.tig.2005.04.008      URL    
[Cite within: 2]
Swanson D.R. (1986a). Undiscovered public knowledge. Library Quarterly, 56(2), 103-118.
DOI:10.1086/601720      URL    
[Cite within: 1]
Swanson D.R. (1986b). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology & Medicine, 30(1), 7-18.
DOI:10.1353/pbm.1986.0087      PMID:3797213      URL    
FISH OIL, RAYNAUD'S SYNDROME, AND UNDISCOVERED PUBLIC KNOWLEDGE DON R. SWANSON* Divide and conquer—the strategy that science uses to cope with the mountains of printed matter it produces—appears on the surface to serve us well. Science organizes itself into manageable units—scientific specialties—and so its literature is created and assimilated in manageable chunks or units. But a few clouds on the horizon ought not to go unexamined. First, most of the units are no doubt logically related to other units. Second, there are far more combinations of units, therefore far more potential relationships among the units, than there are units. Third, the system is not organized to cope with combinations. I suggest that important relationships might be escaping our notice. Individual units of literature are created to some degree independently of one another, and, insofar as that is so, the logical connections among the units, though inevitable, may be unintended by and even unknown to their creators. Until those fragments, like scattered pieces of a puzzle, are brought together, the relationships among them may remain undiscovered—even though the isolated pieces might long have been public knowledge. My purpose in this essay is to show, by means of an example, how this might happen. I shall identify two units of literature that are logically connected but noninteractive; neither seems to acknowledge the other to any substantial degree. Yet the logical connections , once apparent, lead to a potentially useful and possibly new hypothesis. A Hidden Hypothesis Dietary fish oil has been shown in many experiments, human and animal, to lead to reductions in blood lipids, platelet aggregability, blood viscosity, and vascular reactivity—changes that are likely to improve *Professor, Graduate Library School, University of Chicago.08 1986 by The University of Chicago. All rights reserved. 003 1 -5982/87/300 1-05 1 2$0 1 .00 Perspectives in Biology and Medicine, 30, 1 · Autumn 1986 | 7 blood circulation. Raynaud's syndrome is a peripheral circulatory disorder associated with and exacerbated by high platelet aggregability, high blood viscosity, and vasoconstriction. These two ideas—the fish oil/blood connection and the Raynaud/blood connection—are each supported by a substantial body of scientific evidence and literature; each idea separately represents knowledge that is publicly available. What is notable about the two ideas is that, apparently, they have not heretofore been brought together in print. Together they obviously suggest the hypothesis that dietary fish oil might ameliorate or prevent Raynaud's syndrome. So far as I have been able to determine, that hypothesis also has never appeared in print. In some sense it has existed implicitly for years simply because the above two premises that lead to it have existed in published form for years. We can presume that the hypothesis has remained hidden because the separate literatures on fish oil and on Raynaud's syndrome have never been brought together in a way that would reveal their logical connection and so reveal the hypothesis. Noninteracting Literatures During the past decade almost 2,000 papers on Raynaud's syndrome and around 1,000 papers related to dietary fish oil have been published, as estimated roughly from searching a few large data bases. The two groups of papers have many attributes in common that are relevant to the proposed hypothesis, attributes related to blood viscosity, platelet aggregability, and vascular reactivity. These connections notwithstanding , the two literatures appear to be remarkably isolated from one another , so far as either common authors or references from one literature to the other are concerned. The two main groups of papers in the attached list of references—the fish-oil group [1—25] and the Raynaud group [26—59]—were selected specifically for their logical connections with one another and with the proposed hypothesis, connections that will be discussed and made explicit in the next two sections of this paper. Yet none of the articles in the first group mentions or refers to any Raynaud work, and no article in the second group mentions or refers to research on fish oil. The isolation of the two literatures goes well beyond the two groups of references just mentioned. A dialog03 search of Medline and of Embase (Excerpta...
[Cite within: 1]
Swanson D.R. (1987). Two medical literatures that are logically but not bibliographically connected. Journal of the American Society for Information Science, 38(4), 228-233.
DOI:10.1002/(ISSN)1097-4571      URL    
Swanson D.R. (1988). Migraine and magnesium: Eleven neglected connections. Perspectives in Biology & Medicine, 31(4), 526-557.
DOI:10.1353/pbm.1988.0009      PMID:3075738      URL    
MIGRAINE AND MAGNESIUM: ELEVEN NEGLECTED CONNECTIONS DON R. SWANSON* . . . tL· natural sciences . . . can be said to be a living organism developing by the addition of little celh, a veritable body of knowledge proving itself to be such by the very fact of this almost unconscious growth, with thousands ofparts oblivious to tL· wMe, nevertheless contributing to it.—Allan Bloom. [1, p. 345-346] Allan Bloom's cytologic epistemology invites further analysis. A scientific article is like a cell that interacts with its neighbors to form an organ-like cluster—a set of articles or a "literature" addressed to a common set of problems and topics. These articles interact by citing one another—by conversing in print. The clusters themselves can be seen as interacting, to varying degrees, with other clusters. This essay will focus on certain failures of intercluster communication. I shall call two literatures "logically" related if the arguments they advance about the phenomena to which they respectively refer are related in some interesting way. One can imagine that two distinct clusters or literatures might be logically related yet mutually isolated or "noninteractive "—like two clusters of cells oblivious to their relatedness, nevertheless contributing to it. The failure of two literatures to interact or communicate would suggest that any logical relationship between them may be unknown or, at least, undocumented. For any documentation acceptable to science would have to refer to or mention both literatures and so violate the assumption of noncommunication. Undocumented connections arise neither by chance nor by design but as a result ofthe inherent connectedness within the physical or biological world; they are of particular interest because of their potential for being discovered by bringing together the relevant noninteractive literatures, ?Professor, Graduate Library School, University of Chicago, 1100 East 57th Street, Chicago, Illinois 60637.08 1988 by The University of Chicago. All rights reserved. 0031-5982/88/3104-0600$01 .00 526 I Don R. Swanson 07 Migraine: Eleven Connections like assembling pieces of a puzzle to reveal an unnoticed, unintended, but not unintelligible pattern. The fragmentation of science into specialties makes it likely that there exist innumerable pairs oflogically related, mutually isolated literatures. In earlier articles, I called attention to one such pair [2, 3]. The first literature contained evidence that dietary fish oil causes certain blood and vascular changes, and the second contained evidence that these same changes might ameliorate Raynaud's disease. The two literatures were mutually isolated but logically related by the implicit hypothesis that dietary fish oil might benefit Raynaud patients. That hypothesis apparently had not previously been published—perhaps because the two literatures had not before been considered together. In the present article I demonstrate something similar for the pair of literatures on migraine and magnesium. The goal of this work is not simply to find unnoticed connections but to develop a systematic approach to the process of hunting for them. As in the preceding case, one begins with a disease for which neither cause nor cure is known. The problem is to find, within the literature, indirect evidence that an unknown cure might already exist. The literatures on fish oil and magnesium , respectively, were not fortuitous choices; they were the survivors of a process of elimination. A Systematic Trial-and-Error Search Strategy I have described in an information science article an exploratory trialand -error process to aid in the discovery of logically related noninteractive medical literatures [4]. To illustrate that process, I showed how one could begin with the literature on Raynaud's disease and follow a search strategy that leads to a cure hypothesis without knowing the specific destination in advance. The first part of the process, aided by Medline searching, is intended to stimulate hypotheses about all plausible chains of causation and mechanisms of therapeutic action. The second part includes online searching of the SCI (Science Citation Index03) and is intended to eliminate interactive literature pairs. In the present study, I began with an online search of the literature on migraine and followed a similar strategy. That strategy is based in part on a search for possible intermediate links in the causal chain of events that might lead from some unknown therapeutic agent to the amelioration of migraine...
[Cite within: 1]
Swanson D.R. (1990). Somatomedin C and arginine: Implicit connections between mutually-isolated literatures. Perspectives in Biology & Medicine, 33(2), 157-186.
DOI:10.1353/pbm.1990.0031      PMID:2406696      URL    
PERSPECTIVES IN BIOLOGY AND MEDICINE Volume 33 娄 Number 2 路 Winter 1 990 SOMATOMEDIN C AND ARGININE: IMPLICIT CONNECTIONS BETWEEN MUTUALLY ISOLATED LITERATURES DON R. SWANSON* The purpose of this review is to show how a synthesis of the arginine and somatomedin literatures can lead one to identify an important but neglected area of research, an area that ultimately might enhance our understanding of certain emaciating diseases and age-related degenerative processes. Hundreds of biomedical articles document the stimulatory effect of infused arginine on the release of growth hormone (GH) in humans. That GH in turn can stimulate the production of circulating somatomedin C (SmC) is equally well documented. One can plausibly infer that arginine intake may influence blood levels of SmC. In this article, I show that there are many additional reasons to believe that arginine may have such an effect and that increased SmC levels can have important health benefits. Remarkably, however, there are almost no published articles that explicitly mention the possible influence of arginine on SmC. The idea that the two literatures on arginine and SmC can be linked by implicit arguments yet have few or no articles in common may have This work is supported by U.S. Department of Education grant RA03980028 from the Office of Educational Research and Improvement ibrary Research and Demonstration Program. The author thanks Richard L. Landau and Lewis S. Seiden for many helpful suggestions, and Daniel A. Albert for calling attention to the report listed as reference [4]. Space does not permit publishing the complete list of references. A supplementary list is available from the author. *Graduate Library School, University of Chicago, 1100 East 57th Street, Chicago, Illinois 60637. 1990 by The University of Chicago. All rights reserved. 0031-5982/90/3302-0677$01.00 Perspectives in Biology and Medicine, 33, 2 娄 Winter 1990 \ 157 more general significance. If two such nonintersecting literatures do not cite each other and are not cited together ("co-cited") by other articles, they are said to be "noninteractive." In that event, the possibility that the two literatures have not before been brought together and synthesized may be worth considering. One may hope, through such a synthesis, to bring to light unnoticed connections that cannot be seen in the two literatures considered separately. This paper is the third in a series published in Perspectives based on the idea of bringing together logically related noninteractive literatures [1, 2]. The first two papers have each examined one example of such a pair of literatures, the first example being on Raynaud's disease and dietary fish oils, and the second on migraine and magnesium. The implied connection for each of these examples has received some degree of independent clinical corroboration [3, 4]. The first example preceded by more than 18 months the first report of a clinical test ofdietary fish oil in Raynaud patients [1, 4]. The Somatomedin Hypothesis In 1957, Salmon and Daughaday reported that a substance in the serum of normal rats stimulated the incorporation ofsulfate by cartilage from hypophysectomized (hypox) rats, and that this same factor appeared in the plasma of hypox rats over a 24-hour period following treatment with GH [5]. However, GH added in vitro to cartilage had virtually no such effect. These experiments implied that the action of GH in stimulating the uptake of sulfate by rat cartilage was mediated by some other substance, which Salmon and Daughaday called a "sulfation factor." During the ensuing decade, the presence in serum of a GHdependent sulfation factor was confirmed by various laboratories; moreover , it became clear that the sulfation factor stimulated a broad range of anabolic processes; its biological action was not limited to cartilage, or to rats, or to the incorporation of sulfate. To denote the broader anabolic and growth implications, as well as the relationship to GH (somatotropin ), Daughaday, Hall, and other leading investigators in 1972 proposed the name "somatomedins" [6]. From experiments on glucose uptake in rat tissue, Hall concluded at about the same time that somatomedins were probably also the substance that caused "non-suppressible insulinlike activity" [7]. Several types of somatomedins eventually were isolated and sequenced under the name "insulin-like growth factors;" ofinterest here is insulin-like...
[Cite within: 1]
Swanson D.R. (1993). Intervening in the life cycles of scientific knowledge. Library Trends, 41(4), 606-631.
DOI:10.1016/0364-6408(93)90049-C      URL    
ABSTRACT THE GROWTH OF SCIENTIFIC KNOWLEDGE is sometimes described in terms of a life cycle analogous to that of a living organism. This article examines certain shortcomings of such a model and proposes an alternative view of knowledge growth that has quite different implications. Literature cannot grow disproportionately to the growth of the communities and resources that produce it, but combinations of potentially related segments of literature can grow at a rate far higher than the capacity of the community to identify and assimilate such relatedness. Scientific knowledge, as it grows, becomes increasingly fragmented into specialties. Bringing together complementary specialized literatures previously isolated from one another may rejuvenate the merged components. Some contributions to knowledge therefore can acquire more than one life cycle as new relationships are formed that were not apparent at the time of original publication. Examples are given that show how such recombinant ideas can lead to previously unknown solutions to scientific problems. The dominant information problems of the future inevitably will derive from the fragmentation of knowledge, a problem shift that may lead to new discovery-oriented views of information searching and use.
[Cite within: 2]
Swanson D.R. (2006). Atrial fibrillation in athletes: Implicit literature-based connections suggest that overtraining and subsequent inflammation may be a contributory mechanism. Medical Hypotheses, 66(6), 1085-92.
DOI:10.1016/j.mehy.2006.01.006      PMID:1650441416504414      URL    
The following hypothesis is plausible, readily testable, and apparently novel: Older athletes diagnosed with AF but otherwise healthy who have engaged in rigorous aerobic endurance exercise for more than a decade will have CRP levels that are higher than those of a similar population of athletes without AF. Corroboration of this hypothesis would then justify a prospective clinical trial of anti-inflammation therapy. It is of particular interest to extend recent studies of inflammation in AF to athletes; athletic behavior that can induce inflammation may contribute to understanding the origins of AF.
[Cite within: 1]
Swanson D.R. (2011). Literature-based resurrection of neglected medical discoveries. Journal of Biomedical Discovery & Collaboration, 6(6), 34-47.
DOI:10.5210/disco.v6i0.3515      PMID:21509725      URL    
It is possible to find in the medical literature many articles that have been neglected or ignored, in some cases for many years, but which are worth bringing to light because they report unusual findings that may be of current scientific interest. Resurrecting previously published but neglected hypotheses that have merit might be overlooked because it would seem to lack the novelty of "discovery" -- but the potential value of so doing is hardly arguable. Finding neglected hypotheses may be not only of great practical value, but also affords the opportunity to study the structure of such hypotheses in the hope of illuminating the more general problem of hypothesis generation.
[Cite within: 2]
Swanson D.R. , & Smalheiser, N.R. (1997). An interactive system for finding complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91(2), 183-203.
DOI:10.1016/S0004-3702(97)00008-8      URL    
ABSTRACT An unintended consequence of specialization in science is poor communication across specialties. Information developed in one area of research may be of value in another without anyone becoming aware of the fact. We describe and evaluate interactive software and database search strategies that facilitate the discovery of previously unknown cross-specialty information of scientific interest. The user begins by searching MEDLINE for article titles that identify a problem or topic of interest. From downloaded titles the software constructs input for additional database searches and produces a series of heuristic aids that help the user select a second set of articles complementary to the first set and from a different area of research. The two sets are complementary if together they can reveal new useful information that cannot be inferred from either set alone. The software output further helps the user identify the new information and derive from it a novel testable hypothesis. We report several successful tests and applications of the system.
[Cite within: 1]
Swanson D.R., Smalheiser N.R., & Bookstein A. (2001). Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science and Technology, 52(10), 797-812.
DOI:10.1002/asi.1135      URL    
Abstract Using novel informatics techniques to process the output of Medline searches, we have generated a list of viruses that may have the potential for development as weapons. Our findings are intended as a guide to the virus literature to support further studies that might then lead to appropriate defense and public health measures. This article stresses methods that are more generally relevant to information science. Initial Medline searches identified two kinds of virus literatures he first concerning the genetic aspects of virulence, and the second concerning the transmission of viral diseases. Both literatures taken together are of central importance in identifying research relevant to the development of biological weapons. Yet, the two literatures had very few articles in common. We downloaded the Medline records for each of the two literatures and used a computer to extract all virus terms common to both. The fact that the resulting virus list includes most of an earlier independently published list of viruses considered by military experts to have the highest threat as potential biological weapons served as a test of the method; the test outcome showed a high degree of statistical significance, thus supporting an inference that the new viruses on the list share certain important characteristics with viruses of known biological warfare interest.
Torvik V.I. , & Smalheiser, N.R. (2007). A quantitative model for linking two disparate sets of articles in Medline. Bioinformatics, 23(13), 1658-1665.
DOI:10.1103/PhysRevB.55.R14733      PMID:17463015      URL    
Background: Identifying information that implicitly links two disparate sets of articles is a fundamental and intuitive data mining strategy that can help investigators address real scientific questions. The Arrowsmith two-node search finds title words and phrases (so-called B-terms) that are shared across two sets of articles within MEDLINE and displays them in a manner that facilitates human assessment. A serious stumbling-block has been the lack of a quantitative model for predicting which of the hundreds if not thousands of B-terms computed for a given search are most likely to be relevant to the investigator. Methodology/Principal Findings: Using a public two-node search interface, field testers devised a set of two-node searches under real life conditions and a certain number of B-terms were marked relevant. These were employed as 'gold standards;' each B-term was characterized according to eight complementary features that were strongly correlated with relevance. A logistic regression model was developed that permits one to estimate the probability of relevance for each B-term, to rank B-terms according to their likely relevance, and to estimate the overall number of relevant B-terms inherent in a given two-node search. Conclusions/Significance: The model greatly simplifies and streamlines the process of carrying out a two-node search, and may be applicable to a number of other literature-based discovery applications, including the so-called one-node search and related gene-centric strategies that incorporate implicit links to predict how genes may be related to each other and to human diseases. This should encourage much wider exploration of text mining for implicit information among the general scientific community.
[Cite within: 1]
Uzzi B., Mukherjee S., Stringer M., & Jones B. (2013). Atypical combinations and scientific impact. Science, 342(6157), 468-472.
DOI:10.1126/science.1240474      PMID:24159044      URL    
Novelty is an essential feature of creative ideas, yet the building blocks of new ideas are often embodied in existing knowledge. From this perspective, balancing atypical knowledge with conventional knowledge may be critical to the link between innovativeness and impact. Our analysis of 17.9 million papers spanning all scientific fields suggests that science follows a nearly universal pattern: The highest-impact science is primarily grounded in exceptionally conventional combinations of prior work yet simultaneously features an intrusion of unusual combinations. Papers of this type were twice as likely to be highly cited works. Novel combinations of prior work are rare, yet teams are 37.7% more likely than solo authors to insert novel combinations into familiar knowledge domains.
[Cite within: 1]
van der Eijk C.C., van Mulligen E.M., Kors J.A., Mons B., & van den Berg, J. (2004). Constructing an associative concept space for literature—based discovery. Journal of the Association for Information Science and Technology, 55(5), 436-444.
DOI:10.1002/asi.10392      URL    
Abstract Scientific literature is often fragmented, which implies that certain scientific questions can only be answered by combining information from various articles. In this paper, a new algorithm is proposed for finding associations between related concepts present in literature. To this end, concepts are mapped to a multidimensional space by a Hebbian type of learning algorithm using co-occurrence data as input. The resulting concept space allows exploration of the neighborhood of a concept and finding potentially novel relationships between concepts. The obtained information retrieval system is useful for finding literature supporting hypotheses and for discovering previously unknown relationships between concepts. Tests on artificial data show the potential of the proposed methodology. In addition, preliminary tests on a set of Medline abstracts yield promising results.
[Cite within: 1]
Vos R., Aarts S., van Mulligen E., Metsemakers J., van Boxtel M.P., Verhey F., & van den Akker, M. (2014). Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: Exploring the use of literature-based discovery in primary care research. Journal of the American Medical Informatics Association, 21(1), 139-145.
DOI:10.1136/amiajnl-2012-001448      PMID:23775174      URL    
Abstract BACKGROUND: Multimorbidity, the co-occurrence of two or more chronic medical conditions within a single individual, is increasingly becoming part of daily care of general medical practice. Literature-based discovery may help to investigate the patterns of multimorbidity and to integrate medical knowledge for improving healthcare delivery for individuals with co-occurring chronic conditions. OBJECTIVE: To explore the usefulness of literature-based discovery in primary care research through the key-case of finding associations between psychiatric and somatic diseases relevant to general practice in a large biomedical literature database (Medline). METHODS: By using literature based discovery for matching disease profiles as vectors in a high-dimensional associative concept space, co-occurrences of a broad spectrum of chronic medical conditions were matched for their potential in biomedicine. An experimental setting was chosen in parallel with expert evaluations and expert meetings to assess performance and to generate targets for integrating literature-based discovery in multidisciplinary medical research of psychiatric and somatic disease associations. RESULTS: Through stepwise reductions a reference set of 21,945 disease combinations was generated, from which a set of 166 combinations between psychiatric and somatic diseases was selected and assessed by text mining and expert evaluation. CONCLUSIONS: Literature-based discovery tools generate specific patterns of associations between psychiatric and somatic diseases: one subset was appraised as promising for further research; the other subset surprised the experts, leading to intricate discussions and further eliciting of frameworks of biomedical knowledge. These frameworks enable us to specify targets for further developing and integrating literature-based discovery in multidisciplinary research of general practice, psychology and psychiatry, and epidemiology.
[Cite within: 1]
Weeber M., Vos R., Klein H., de Jong-van den Berg, L.T.W., Aronson A.R., & Molema G. (2003). Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomide. Journal of the American Medical Informatics Association, 10(3), 252-259.
DOI:10.1197/jamia.M1158      PMID:12626374      URL    
Copyright in the material you requested is held by the American Society of Mechanical Engineers (unless otherwise noted). This email ability is provided as a courtesy, and by using it you agree that you are requesting the material solely for personal, non-commercial use, and that it is subject to the American Society of Mechanical Engineers' Terms of Use. The information provided in order to email this topic will not be used to send unsolicited email, nor will it be furnished to third parties. Please refer to the American Society of Mechanical Engineers' Privacy Policy for further information.
[Cite within: 1]
Widdows D. , & Cohen, T. (2015). Reasoning with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 23(2), 141-73.
DOI:10.1093/jigpal/jzu028      PMID:4646228      URL    
This paper describes the use of continuous vector space models for reasoning with a formal knowledge base. The practical significance of these models is that they support fast, approximate but robust inference and hypothesis generation, which is complementary to the slow, exact, but sometimes brittle of more traditional deduction engines such as theorem provers. The paper explains the way logical connectives can be used in semantic vector models, and summarizes the development of Predication-based Semantic Indexing, which involves the use of Vector Symbolic Architectures to represent the concepts and relationships from a knowledge base of subject-predicate-object triples. Experiments show that the use of continuous models for formal reasoning is not only possible, but already demonstrably effective for some recognized informatics tasks, and showing promise in other traditional problem areas. Examples described in this paper include: predicting new uses for existing drugs in biomedical informatics; removing unwanted meanings from search results in information retrieval and concept navigation; type-inference from attributes; comparing words based on their orthography; and representing tabular data, including modelling numerical values. The algorithms and techniques described in this paper are all publicly released and freely available in the Semantic Vectors open-source software package.
[Cite within: 1]
Wren J.D. (2004). Extending the mutual information measure to rank inferred literature relationships. BMC Bioinformatics, 5:145. Retrieved on July 17, 2017, from
Wren J.D., Bekeredjian R., Stewart J.A., Shohet R.V., & Garner H.R. (2004). Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics, 20(3), 389-398.
DOI:10.1093/bioinformatics/btg421      URL    
[Cite within: 2]
Wolchover N. (2017). A long-sought proof, found and almost lost. Quanta Magazine March 28, 2017. Retrieved on July 17, 2017, from
[Cite within: 1]
Workman T.E., Fiszman M., Cairelli M.J., Nahl D., & Rindflesch TC. (2016). Spark, an application based on serendipitous knowledge discovery. Journal of Biomedical Informatics, 60(c), 23-37.
DOI:10.1016/j.jbi.2015.12.014      PMID:26732995      URL    
Findings from information-seeking behavior research can inform application development. In this report we provide a system description of Spark, an application based on findings from Serendipitous Knowledge Discovery studies and data structures known as semantic predications. Background information and the previously published IF-SKD model (outlining Serendipitous Knowledge Discovery in online environments) illustrate the potential use of information-seeking behavior in application design. A detailed overview of the Spark system illustrates how methodologies in design and retrieval functionality enable production of semantic predication graphs tailored to evoke Serendipitous Knowledge Discovery in users.
Yang H.T., Ju J.H., Wong Y.T., Shmulevich I., & Chiang J.H. (2017). Literature-based discovery of new candidates for drug repurposing. Briefings in Bioinformatics, 18(3), 488-497.
DOI:10.1093/bib/bbw030      PMID:27113728      URL    
Abstract Drug development is an expensive and time-consuming process; these could be reduced if the existing resources could be used to identify candidates for drug repurposing. This study sought to do this by text mining a large-scale literature repository to curate repurposed drug lists for different cancers. We devised a pattern-based relationship extraction method to extract disease ene and gene rug direct relationships from the literature. These direct relationships are used to infer indirect relationships using the ABC model. A gene-shared ranking method based on drug target similarity was then proposed to prioritize the indirect relationships. Our method of assessing drug target similarity correlated to existing anatomical therapeutic chemical code-based methods with a Pearson correlation coefficient of 0.9311. The indirect relationships ranking method achieved a significant mean average precision score of top 100 most common diseases. We also confirmed the suitability of candidates identified for repurposing as anticancer drugs by conducting a manual review of the literature and the clinical trials. Eventually, for visualization and enrichment of huge amount of repurposed drug information, a chord diagram was demonstrated to rapidly identify two novel indications for further biological evaluations.
[Cite within: 1]
Yetisgen-Yildiz M. , & Pratt, W. (2009). A new evaluation methodology for literature-based discovery systems. Journal of Biomedical Informatics, 42(4), 633-643.
DOI:10.1016/j.jbi.2008.12.001      PMID:19124086      URL    
While medical researchers formulate new hypotheses to test, they need to identify connections to their work from other parts of the medical literature. However, the current volume of information has become a great barrier for this task. Recently, many literature-based discovery (LBD) systems have been developed to help researchers identify new knowledge that bridges gaps across distinct sections of the medical literature. Each LBD system uses different methods for mining the connections from text and ranking the identified connections, but none of the currently available LBD evaluation approaches can be used to compare the effectiveness of these methods. In this paper, we present an evaluation methodology for LBD systems that allows comparisons across different systems. We demonstrate the abilities of our evaluation methodology by using it to compare the performance of different correlation-mining and ranking approaches used by existing LBD systems. This evaluation methodology should help other researchers compare approaches, make informed algorithm choices, and ultimately help to improve the performance of LBD systems overall.
[Cite within: 1]
PDF downloaded times    
RichHTML read times    
Abstract viewed times    


External search by key words

Literature-based discovery     
Text mining     
Knowledge discovery in databases     
Implicit information     
Information science     

External search by authors

Neil R. Smalheiser    

Related articles(if any):