We present a database of annotatedbiomedical text corpora merged into aportable data structure with uniform conventions.MedTag combines three corpora,MedPost, ABGene and GENETAG,within a common relational database datamodel. The GENETAG corpus has beenmodified to refiect new definitions ofgenes and proteins. The MedPost corpushas been updated to include 1,000additional sentences from the clinicalmedicine domain. All data have been updatedwith original MEDLINE text excerpts,PubMed identifiers, and tokenizationindependence to facilitate data accuracy,consistency and usability.The data are available in fiat files alongwith software to facilitate loading thedata into a relational SQL databasefrom ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz.
展开▼