“ with” in prep with) for better performance. We find that basic dependencies give the best perfor- mance to event extraction, with little difference between the other variants. This result is surpris- ing, as variants other than basic have features such as the resolution of conjunctions that are specif- ically designed for practical applications. How- ever, basic dependencies were found to consis- tently provide best performance also for the other parsers6. Thus, in the following evaluation, the basic dependencies are adopted for all SD results.
5.3 Parser Comparison on Event Extraction
Results with different parsers and different for- mats on the development data set are summarized in Table 3. Baseline results are produced by re- moving dependency information from the parse results. The baseline results differ between the representations as the word base forms and POS tags produced by the GENIA tagger for use with SD and CoNLL are different from PAS, and be- cause head word information in the Enju format is used. The evaluation finds best results for both tasks with Enju, using its native output format. However, as discussed in Section 2.1, the treat- ment of PAS and the other two formats are slightly different, this result does not necessarily indicate that PAS is the best alternative for event extrac- tion.
The Bikel and Stanford WSJ parsers, lacking models adapted to the biomedical domain, per- forms mostly worse than the other parsers. The other parsers, even though trained on the treebank, do not provide performance as high as that for using the GENIA treebank, but, with the excep- tion of Stanford eng with CoNLL, results with the parsers are only slightly worse than results with the treebank. The results with the data derived from the GENIA treebank can be considered as upper bounds for the parsers and formats at the task, although conversion errors are expected to lower these bounds to some extent. The results suggest that there is relative little remaining ben- efit to be gained from improving parser perfor- mance.
6Collapsed tree dependencies are not evaluated on the C&C parser since the conversion is not provided.
Effects of Dependency Representation
Intrinsic evaluation results (Section 5.1) cannot be used directly for comparing the parsers, since some of the parsers contain models trained on the GENIA treebank. To investigate the effects of the evaluation results to the event extraction, we per- formed event extraction with eliminating the de- pendency types. Table 4 summarizes the results with the dependency structures (without the de- pendency types) on the development data set. In- terestingly, we find the performance increases in Bikel and Stanford by eliminating the dependency types. This implies that the inaccurate depen- dency types shown in Table 1 confused the event extraction system. SD and PAS drops more than CoNLL, and Enju with CoNLL structures perform best in total when the dependency types are re- moved. This result shows that the formats have their own strengths in finding events, and CoNLL structure with SD or PAS types can be a good rep- resentation for the event extraction.
By comparing Table 3, Table 1, and Table 4, we found that the better dependency performance does not always produce better event extraction performance especially when the difference of the dependency performance is small. MC and Enju results show that performance in dependency is important for event extraction. SD can be better than CoNLL for the event extraction (shown with the gold treebank data in Table 3), but the types and relations of CoNLL were well predicted, and MC and Enju performed better for CoNLL than for SD in total.
Performance of Event Extraction System
Several systems are compared by the extraction performance on the shared task test data in Ta- ble 5. GDep and Enju with PAS are used for the evaluation, which is the same evaluation setting with the original system by Miwa et al. (2010b). The performance of the best systems in the orig- inal shared task is shown for reference ((Bjo¨rne et al., 2009) in Task 1 and (Riedel et al., 2009) in Task 2). The event extraction system performs significantly better than the best systems in the shared task, further outperforming the original system. This shows that the comparison of the parsers is performed with a state-of-the-art sys-