The USI project has been initiated at the LGI2P research center during the PhD of Nicolas Fiorini. The main motivation behind this work is to provide a kNN-based approach for annotating entities, be it textual documents, songs or movies. While other methods often combine machine learning and feature analysis of a given document (e.g. textual features), USI's approach is completely independent of the document content. The only requirement in order to guarantee an accurate annotation is to provide an accurate already annotated neighborhood. The search of a good neighborhood is an independent task, related to information retrieval, for which an extensive list of tools already exist.
With the rise of thesauri, ontologies and knowledge representations in general, there are more and more data that can be annotated by concepts. The semantic indexing process has been initiated in the biomedicine field but much more content can now benefit from conceptual indexing using DBPedia or Freebase. USI aims to do so, whatever is the content, whatever is the thesaurus, all thanks to the very useful Semantic Measures Library to compute semantic similarities between concepts.
USI is presented as a heuristic algorithm optimizing an objective function. We propose an algorithmic optimization of this heuristic to make it fast enough, implemented in the USI java library. This library is also hosted on GitHub and it can be freely downloaded to be implemented in your project.
The project source code is made available on GitHub. We try to update it frequently to keep it working and generic. However, please consider it as a prototype/beta version.
The project is developed in Java as a Maven project, feel free to report the bugs. In order to include USI in your project and use its functionnalities – loading a knowledge representation using to the SML, building an index of annotated documents and annotate a document for which the neighborhood is known – you may be interested by the jar build instead.
We developed two demos to show how efficient and effective USI can be. The first demo relies on data from Freebase, more specifically on movies. Movies are annotated on Freebase with to media genres. Media genres belong to a taxonomy of genres that we used as the knowledge representation and movies are the documents to annotate. Be careful, the demonstration on movies requires a recent browser such as Google Chrome and a very good bandwidth. The other use case we focused on is the one on which we evaluated the method, that is, the indexing of biomedical papers using the MeSH.
When we validated our approach, we suffered from the lack of evaluation datasets and tools. Here, we would like to share everything needed to replicate our results or to evaluate new contributions. First of all, the dataset on which we tested our method is the Dr. Lu's webpage dataset:
Recommending MeSH terms for annotating biomedical articles.
Huang M, Névéol A, Z.
J Am Med Inform Assoc. 2011
Learning to Annotate Scientific Publications
Huang M, and Lu Z.
in Proceedings of The 23rd International Conference on Computational Linguistics (COLING 2010), 2010
We also provide 2 compressed files: USI-MeSH and MeSH-validation. The former includes a jar file including USI to replicate our results on the L1000 dataset, a properties file to configure it and a quick documentation. The latter contains a small jar file for evaluating results obtained with L1000 and a properties file. Finally, we also made available a jar version of USI so that it can easily be included in a java project.
Feel free to contact the team who initiated the project if you have any request or suggestion concerning USI, or if you want to collaborate with us. This project results from the collaboration between the école des mines d'Alès and Montpellier SupAgro.