X hits on this document

PDF document

Evaluating Dependency Representation for Event Extraction - page 7 / 9

24 views

0 shares

0 downloads

0 comments

7 / 9

SD

Task 1 CoNLL

PAS

SD

Task 2 CoNLL

PAS

Baseline Bikel Stanford WSJ Stanford eng GDep MC C&C Enju GENIA

51.05 53.29 53.51 55.02 - 55.60 56.09 55.48 56.34

- 53.22 54.38 53.66 55.70 56.01 - 55.74 56.09

50.42 - - - - - - 56.57 57.94

49.17 51.40 52.02 53.41 - 53.94 54.27 54.06 55.04

- 51.27 52.04 52.74 54.37 54.51 - 54.37 54.57

48.88 - - - - - - 55.31 56.40

Bikel Stanford WSJ Stanford eng GDep MC C&C

SD 53.41 (+0.12) 53.03 (-0.48) 54.48 (-0.54) - 54.22 (-1.38) 54.64 (-1.45)

Task 1 CoNLL 53.92 (+0.70) 54.52 (+0.14) 54.02 (+0.36) 54.97 (-0.73) 55.24 (-0.77) -

PAS - - - - - -

SD 51.59 (+0.19) 51.43 (-0.59) 52.88 (-0.53) - 52.73 (-1.21) 52.98 (-1.29)

Task 2 CoNLL 52.21 (+0.94) 52.60 (-0.14) 52.28 (+0.24) 53.71 (-0.66) 53.42 (-1.09) -

PAS - - - - - -

Enju GENIA

53.74 (-1.74) 55.79 (-0.55)

55.66 (-0.08) 55.64 (-0.45)

55.23 (-1.34) 56.42 (-1.52)

52.29 (-1.77) 54.17 (-0.87)

53.97 (-0.40) 53.83 (-0.74)

53.69 (-1.62) 55.34 (-1.06)

Table 3: Comparison of F-score results with six parsers in three different formats on the development data set. Results without dependency information are shown as baselines. The results with the GENIA treebank (converted into PTB and PAS) are shown for comparison. The best score in each task is shown

in bold, and the best score in each task and format is underlined.

Table 4: Comparison of F-score results with six parsers in three different dependency structures (with- out the dependency types) on the development data set. The changes from Table 3 are shown.

Simple

Binding Task 1 48.70 / 52.65 / 50.60 52.16 / 53.08 / 52.62 40.06 / 49.82 / 44.41 23.05 / 48.19 / 31.19 48.41 / 34.50 / 40.29

Regulation

All

Ours Miwa Bj¨orne Riedel Baseline

66.84 / 78.22 / 72.08 65.31 / 76.44 / 70.44 64.21 / 77.45 / 70.21 N/A 62.94 / 68.38 / 65.55

38.48 / 55.06 / 45.30 35.93 / 46.66 / 40.60 35.63 / 45.87 / 40.11 26.32 / 41.81 / 32.30 29.40 / 40.00 / 33.89

50.13 / 64.16 / 56.28 48.62 / 58.96 / 53.29 46.73 / 58.48 / 51.95 36.90 / 55.59 / 44.35 43.93 / 50.11 / 46.82

Ours Riedel Baseline

65.43 / 75.56 / 70.13 N/A 60.88 / 63.78 / 62.30

Task 2 46.42 / 50.31 / 48.29 22.35 / 46.99 / 30.29 44.99 / 31.78 / 37.25

38.18 / 54.45 / 44.89 25.75 / 40.75 / 31.56 29.07 / 39.52 / 33.50

49.20 / 62.57 / 55.09 35.86 / 54.08 / 43.12 42.62 / 47.84 / 45.08

Table 5: Comparison of Recall / Precision / F-score results on the test data set. Results on simple, binding, regulation, and all events are shown. GDep and Enju with PAS are used. Results by Miwa et al. (2010b), Bjo¨rne et al. (2009), Riedel et al. (2009), and Baseline for Task 1 and Task 2 are shown for comparison. Baseline results are produced by removing dependency information from the parse results of GDep and Enju. The best score in each result is shown in bold.

tem.

tasks such as unbounded dependencies (Rimell

¨ et al., 2009) and textual entailment (Onder Eker,

6

Related Work

Many approaches for parser comparison have been proposed, and most comparisons have used gold treebanks with intermediate formats (Clegg and Shepherd, 2007; Pyysalo et al., 2007). Parser comparison has also been proposed on specific

2009)7. Among them, application-oriented parser comparison across several formats was first intro- duced by Miyao et al. (2009), who compared eight parsers and five formats for the protein-protein in- teraction (PPI) extraction task. PPI extraction, the

7http://pete.yuret.com/

785

Document info
Document views24
Page views24
Page last viewedWed Dec 07 16:51:40 UTC 2016
Pages9
Paragraphs323
Words5145

Comments