BioContext is a text mining system for extracting information about molecular processes in biomedical articles.
Using the data extracted by BioContext, it is possible to get an overview of a range of biomolecular processes relating to a particular gene (example), or anatomical location (example).
BioContext is the subject of the following papers:
Martin Gerner, Farzaneh Sarafraz, Casey M. Bergman, Goran Nenadic. (2012) BioContext: an integrated text mining system for large-scale extraction and contextualisation of biomolecular events. Bioinformatics. (html, pdf)
The following PhD thesis also contains, among other things, extensive analysis of BioContext data in order to identify instances where statements in different papers contrast or are in direct conflict with each other:
Farzaneh Sarafraz. (2012) Finding Conflicting Statements in the Biomedical Literature. University of Manchester, UK. (html, pdf)
January, 2012: Release of the BioContext (v. 1.0) and TextPipe (v. 1.0) software packages.
December, 2011: Release of the search interface for the BioContext data.
Data Downloads
All data was extracted from the complete MEDLINE (2011 baseline files) and open-access subset of PubMed Central (download as of May 2011). Rows that are filled with NULL values indicate the absence of data (i.e., a null row in the gene table for PMID 123456 means that no genes were found for that document).
events.tar.gz (4.2 GB; 36.1 million entries): The complete event output of BioContext, containing events from the Turku event extraction system [4] and EventMine [5] event extraction systems, negation/speculation detection by Negmole [6], anatomical associations, document links, the sentences they were found in, and more. Click here for more information about the type of information contained in the download.
Document parses (133 million sentences per tool): For access to document parses from the GDep [7] (23 GB compressed), Enju [8] (149 GB compressed), or McClosky-Charniak [9] (36 GB compressed) parsers, contact us.
Source code and binary downloads
biocontext-1.0.tar.gz: Source code, binaries, and documentation for the BioContext system.
textpipe-1.0.tar.gz: Source code, binaries, and documentation for the Textpipe connection framework.
References
Gerner, M., Nenadic, G. and Bergman, C. M. (2010). "LINNAEUS: a species name identification system for biomedical literature." BMC Bioinformatics 11: 85.
Hakenberg, J., Gerner, M., Haeussler, M., Solt, I., Plake, C., Schroeder, M., Gonzalez, G., Nenadic, G. and Bergman, C. M. (2011). "The GNAT library for local and remote gene mention normalization." Bioinformatics 27(19): 2769-71.
Huang, M., Liu, J. and Zhu, X. (2011). "GeneTUKit: a software for document-level gene normalization." Bioinformatics 27(7): 1032-3.
Björne, J., Heimonen, J., Ginter, F., Airola, A., Pahikkala, T. and Salakoski, T. (2009). "Extracting complex biological events with rich graph-based feature sets." In Proceedings of the Workshop on BioNLP: Shared Task Boulder, Colorado: 10-18.
Miwa, M., Pyysalo, S., Hara, T., Tsujii, J. (2010). "Evaluating Dependency Representation for Event Extraction". In the 23rd International Conference on Computational Linguistics (COLING 2010). pp. 779--787, August 2010
Sarafraz, F. and Nenadic, G. (2010). "Using SVMs with the Command Relation Features to Identify Negated Events in Biomedical Literature." In The Workshop on Negation and Speculation in Natural Language Processing, Uppsala, Sweden.
Sagae, K. and Tsujii, J. (2007). "Dependency parsing and domain adaptation with LR models and parser ensembles." In CoNLL 2007 Shared Task.
Sagae, K., Miyao, Y. and Tsujii, J. i. (2008). "Comparative Parser Performance Analysis across Grammar Frameworks through Automatic Tree Conversion using Synchronous Grammars." In COLING 2008.
McClosky, D., Charniak, E. and Johnson, M. (2006). "Effective Self-Training for Parsing." In HLT-NAACL.