ProSpecTome Homepage
Introduction
ProSpecTome is a protein-specific corpus that is designed to facilitate the fair evaluation of protein name taggers. It has been compiled by re-annotating 243 MEDLINE abstracts from the widely-used JNLPBA evaluation corpus (available from the Evaluation Data link on this page).
The annotation guidelines used in the construction of ProSpecTome are very different to those used to contruct the JNLPBA corpus. For example, ProSpecTome incorporates two levels of specificity with regard to the category protein, with general references to proteins annotated separately from the names of individual proteins and protein families (see the ProSpecTome annotation guidelines for full details). Using both corpora together, a researcher can carry out a richer analysis of tagger performance than was previously possible.
ProSpecTome was constructed by Renata Kabiljo with the help of Diana Stoycheva under the supervision of Dr Adrian Shepherd. Inter-annontator agreement, assessed through the independent annotation of 43 (of the full 243) abstracts, was 0.89 (F-measure).
Downloads
The following files are available for download:
- ProSpecTome.xml: XML file containing 243 ProSpecTome annotated abstracts
- forIAA_A.xml: XML file containing 43 abstracts annotated by Annotator A for the purpose of calculating inter-annotator agreement
- forIAA_B.xml: XML file containing 43 abstracts annotated by Annotator B for the purpose of calculating inter-annotator agreement
- ProSpecTome.css: CSS2 stylesheet for viewing the ProSpecTome XML file (Note: requires browser that supports CSS2, e.g. Mozilla Firefox)
- guidelines.pdf: a PDF file containing the annotation guidelines used in the compilation of the ProSpecTome corpus.
Contact
All enquiries should be directed to Renata Kabiljo (email r.kabiljo@mail.cryst.bbk.ac.uk).
Publications
If you use ProSpecTome in your research, please cite:
- Renata Kabiljo, Diana Stoycheva and Adrian J Shepherd (2007) ProSpecTome: a new tagged corpus for protein named entity recognition. In Proceedings of The ISMB BioLINK, Special Interest Group on Text Data Mining, Vienna, July 2007, p24-27.