Background: We are witnessing an exponential increase in biomedical research citations in PubMed. However, translating biomedical discoveries into practical treatments is estimated to
take around 17 years, according to the 2000 Yearbook of Medical Informatics, and much information is lost during this transition. Pharmaceutical companies spend huge sums to identify opinion leaders and centers of excellence. Conventional methods such as literature search, survey, observation, self‐identification, expert opinion, and sociometry not only need much human effort, but are also non‐comprehensive. Such huge delays and costs can be reduced by “connecting those who produce the knowledge with those who apply it”. A humble step in this direction is large‐scale discovery of persons and organizations involved in specific areas of research. This can be achieved by automatically extracting and disambiguating author names and affiliation strings retrieved through Medical Subject Heading (MeSH) terms and other keywords associated with articles in PubMed. In this study, we propose NEMO (Normalization Engine for Matching Organizations), a system for extracting organization names from the affiliation strings provided in PubMed abstracts, building a thesaurus (list of synonyms) of organization names, and subsequently normalizing them to a canonical organization name using
the thesaurus.
Results: We used a parsing process that involves multi‐layered rule matching with multiple dictionaries. The normalization process involves clustering based on weighted local sequence
alignment metrics to address synonymy at word level, and local learning based on finding connected components to address synonymy. The graphical user interface and java client library
of NEMO are available at http://lnxnemo.sourceforge.net.
Conclusion: NEMO associates each biomedical paper and its authors with a unique organization name and the geopolitical location of that organization. This system provides more accurate
information about organizations than the raw affiliation strings provided in PubMed abstracts. It can be used for : a) bimodal social network analysis that evaluates the research relationships
between individual researchers and their institutions; b) improving author name disambiguation; c) augmenting National Library of Medicine (NLM)’s Medical Articles Record
System (MARS) system for correcting errors due to OCR on affiliation strings that are in small fonts; and d) improving PubMed citation indexing strategies (authority control) based on
normalized organization name and country.