Manual annotation is a tedious and time consuming process, usuallyneeded for generating training corpora to be used in a machine learning scenario.The distant supervision paradigm aims at automatically generating such corporafrom structured data. The active learning paradigm aims at reducing the effortneeded for manual annotation. We explore active and distant learning approachesjointly to limit the amount of automatically generated data needed for the use caseof relation extraction by increasing the quality of the annotations.The main idea of using distantly labeled corpora is that they can simplify andspeed-up the generation of models, e. g. for extracting relationships between entitiesof interest, while the selection of instances is typically performed randomly.We propose the use of query-by-committee to select instances instead. This approachis similar to the active learning paradigm, with a difference that unlabeledinstances are weakly annotated, rather than by human experts. Different strategiesusing low or high confidence are compared to random selection. Experiments onpublicly available data sets for detection of protein-protein interactions show astatistically significant improvement in F1 measure when adding instances with ahigh agreement of the committee.
展开▼