Modelling linguistic phenomena requires highly structured and complex data representations. Document representation frameworks (drfs) provide an interface to store and retrieve multiple annotation layers over a document. Researchers face a difficult choice: using a heavy-weight DRF or implement a custom drf. The cost is substantial, either learning a new complex system, or continually adding features to a home-grown system that risks overrunning its original scope. We introduce docrep, a lightweight and efficient drf, and compare it against existing drfs. We discuss our design goals and implementations in C++, Python, and Java. We transform the OntoNotes 5 corpus using docrep and uima, providing a quantitative comparison, as well as discussing modelling trade-offs. We conclude with qualitative feedback from researchers who have used docrep for their own projects. Ultimately, we hope docrep is useful for the busy researcher who wants the benefits of a drf, but has better things to do than to write one.
展开▼