Evaluation Setting


Event Extraction Evaluation

Event extraction performance is evaluated using the evaluation script provided by the BioNLP’09 shared task organizers for the development data set, and the online evaluation system of the task for the test data set2 . Results are reported under the official evaluation criterion of the task, i.e. the “Approximate Span Matching/Approximate Re- cursive Matching” criterion.

The event extraction system described in Sec- tion 2.1 is used with the default settings given in (Miwa et al., 2010b). The C-values of SVMs are set to 1.0, but the positive and negative examples are balanced by placing more weight on the posi- tive examples. The examples predicted with con- fidence greater than 0.5, as well as the examples with the most confident labels, are extracted. Task 1 and Task 2 are solved at once for the evaluation.

Some of the parse results do not include word base forms or part-of-speech (POS) tags, which are required by the event extraction system. To apply these parsers, the GENIA Tagger (Tsuruoka et al., 2005) output is adopted to add this informa- tion to the results.

4.2 Dependency Representation Evaluation

The parser outputs in SD and CoNLL can be assumed to be trees, so each node in the tree have only one parent node. However, in the converted tree nodes can have more than one parent. We cannot simply apply accuracy, or (un)labeled at- tachment score3. Word-based normalization is performed to avoid negative impact by the word segmentations by parsers. When (a) and (d) in Figure 6 are compared, the counts of correct re- lations will be 1.0 (0.5 for upper NMOD and 0.5 for lower NMOD in Figure 6 (d)) for the parser (precision), and the counts of correct relations will be 1.0 (for NMOD in Figure 6 (a)) for the gold (recall). This F-score is a good approximation of accuracy.


GENIA treebank processing

For comparison and evaluation, the texts in the GENIA treebank (Tateisi et al., 2005) are con- verted to the various formats as follows. To create PAS, the treebank is converted with Enju, and for trees that fail conversion, parse results are used in- stead. The GENIA treebank is also converted into PTB4, and then converted into SD and CoNLL as described in Section 3. While based on manually annotated gold data, the converted treebanks are not always correct due to conversion errors.

The parsers are evaluated with precision, recall, and F-score for each dependency type. We note that the parsers may produce more fine-grained word segmentations than that of the gold standard: for example, two words “p70(S6)-kinase activa- tion” in the gold standard tree (Figure 6 (a)) is segmented into five words by Enju (Figure 6 (b)). In the evaluation the word segmentations in the gold tree are used, and dependency transfer and word-based normalization are performed to match parser outputs to these. Dependencies related to the segmentations are transferred to the enclosing word as follows. If one word is segmented into several segments by a parser, all the dependencies between the segments are removed (Figure 6 (c)) and the dependency between another word and the segments is converted into the dependency be- tween the two words (Figure 6 (d)).

5 Evaluation

This section presents evaluation results. Intrinsic evaluation is first performed in Section 5.1. Sec- tion 5.2 considers the effect of different SD vari- ants. Section 5.3 presents the results of experi- ments with different parsers. Section 5.4 shows the performance of different parsers. Finally, the performance of the event extraction system is dis- cussed in context of other proposed methods for the task in Section 5.5.


Intrinsic Evaluation

We initially briefly consider the results of an in- trinsic evaluation comparing parser outputs to ref- erence data automatically derived from the gold standard treebank. Table 1 shows results for the parsers whose outputs could be converted into the

