X hits on this document

PDF document

Evaluating Dependency Representation for Event Extraction - page 5 / 9

19 views

0 shares

0 downloads

0 comments

5 / 9

P

R

F

P

R

F

P

R

F

P

R

F

Bikel

70.31

70.37

70.34

77.81

77.56

77.69

80.54

80.60

80.57

82.43

82.18

82.31

SP WSJ

74.11

73.94

74.03

81.41

81.47

81.44

81.36

81.16

81.26

84.05

84.05

84.05

SP eng

79.08

78.89

78.98

84.92

84.82

84.87

84.16

83.96

84.06

86.54

86.47

86.51

MC

79.56

79.63

79.60

88.13

87.87

88.00

87.43

87.50

87.47

89.81

89.42

89.62

Enju

85.59

85.62

85.60

88.59

89.51

89.05

88.28

88.30

88.29

90.24

90.77

90.50

Table 1: Comparison of precision, recall, and F-score results with five parsers (two models for Stanford) in two different formats on the development data set (SP abbreviates for Stanford Parser). Results shown separately for evaluation including dependency types and one eliminating them. Parser/model combinations above the line do not use in-domain data, others do.

(a) Gold Word Segmen-

(b) Parser Word Seg-

(c) Inner Dependency

(d) Dependency Trans-

tations

mentations

Removal

fer

C&C

80.31

78.04

79.16

-

84.91

82.28

83.57

-

Figure 6: Example of Word Segmentations of the words by gold and Enju and Dependency Transfer.

SD

SD

Typed

CoNLL

Untyped

CoNLL

SD and CoNLL dependency representations us- ing the Stanford tools and Treebank Converter, re- spectively. For Stanford, both the Penn Treebank WSJ section and “augmented English” (eng) mod- els were tested; the latter includes biomedical do- main data. The Enju results for PAS are 91.48 with types and 93.39 without in F-score. GDep not shown as its output is not compatible with that of Treebank Converter.

Despite numerical differences, the two repre- sentations and two criteria (typed/untyped) all produce largely the same ranking of the parsers.5 The evaluations also largely agree on the magni- tude of the reduction in error afforded through the use of in-domain training data for the Stanford parser, with all estimates falling in the 15-19% range. Similarly, all show substantial differences between the parsers, indicating e.g. that the error rate of Enju is 50% or less of that of Bikel.

These results serve as a reference point for ex- trinsic evaluation results. However, it should be

5One larger divergence is between typed and untyped SD results for MC. Analysis suggest one cause is frequent errors in tagging hyphenated noun-modifiers such as NF-kappaB as adjectives.

BD

CD

CDP

CTD

Task 1

55.60

54.35

54.59

54.42

Task 2

53.94

52.65

52.88

52.76

Table 2: Comparison of the F-score results with different SD variants on the development data set with the MC parser. The best score in each task is shown in bold.

noted that as the parsers make use of annotated domain training data to different extents, this eval- uation does not provide a sound basis for direct comparison of the parsers themselves.

5.2

Stanford Dependency Setting

SD have four different variants: basic depen- dencies (BD), collapsed dependencies (CD), col- lapsed dependencies with propagation of conjunct dependencies (CDP), and collapsed tree depen- dencies (CTD) (de Marneffe and Manning, 2008). Except for BD, these variants do not necessarily connect all the words in the sentence, and CD and CDP do not necessarily form a tree structure. Ta- ble 2 shows the comparison results with the MC parser. Dependencies are generalized by remov- ing expressions after “ ” of the dependencies (e.g.

783

Document info
Document views19
Page views19
Page last viewedSun Dec 04 08:17:26 UTC 2016
Pages9
Paragraphs323
Words5145

Comments