Low Resources Prepositional Phrase Attachment
Pavlos Nalmpantis, Romanos Kalamatianos, Konstantinos Kordas and Katia Kermanidis
Department of Informatics, Ionian University
Corfu, Greece
{cs200664, cs200611, cs200539, kerman}@ionio.gr
Abstract—Prepositional phrase attachment is a major disambiguation problem when it’s about parsing natural language, for many languages. In this paper a low resources policy is proposed using supervised machine learning algorithms in order to resolve the disambiguation problem of Prepositional phrase attachment in Modern Greek. It is a first attempt to resolve Prepositional phrase attachment in Modern
Greek, without using sophisticated syntactic annotation and 语法semantic resources. Also there are no restrictions regarding the prepositions addressed, as is common in previous approaches.
Decision Trees, Modern Greek, PP attachment, Supervised learning
语义
Figure 2. Syntax tree of sentence 2
PP attachment has many significant uses. It can be used, as referred above, for improving the performance of syntactic parsers, as it is a major source of ambiguity in natural language. It facilitates further semantic processing, I. INTRODUCTION
and also constitutes an important pre-processing step in
The correct attachment of prepositional phrases (PPs) to many information extraction systems. It has also been another constituent in a sentence is a significant employed in speech processing as a filter in prosodic disambiguation problem for parsing natural languages. For phrasing [1]. example, take the following two sentences: Over the years, many solutions have been proposed to
resolve the disambiguation problem of PP attachment.
1 She eats soup with a spoon. Solutions include the use of machine learning algorithms [2]-2 She eats soup with tomatoes. [3], statistical analysis using corpus-based pattern distributions and lexical signatures [4], as well as the back-In sentence 1, the PP \"with a spoon\" is attached to the off model [5], the maximum entropy model [6] etc. verb phrase (VP) \"eats\" denoting the instrument utilized for However, these methods usually require many resources (i.e. the eating action, thus making it the anchor phrase of the PP. syntactic annotation and often even semantic Sentence 2 seems to differ only minimally from the first disambiguation) which are often unavailable for many sentence, but, as can be seen from their syntax trees in Fig. 1 languages. and Fig. 2 respectively, their syntactic structure is quite Regarding machine learning techniques, previous different. The PP \"with tomatoes\" does not attach to the verb approaches have experimented with various learning but to the noun phrase (NP) \"soup\" denoting the type of schemata. Memory-based learning has been proposed [2], soup. employing the 1-NN algorithm (IB1) and its tree-variation (IB1-IG). These results were compared to results of other
methods and correct attachment performance turned out much better. Also, a nearest-neighbor algorithm has been proposed, employing a cosine similarity measure of pointwise mutual information [3]. The work described in [7] recreates the EDTBL (Error-Driven Transformation-Based Learning) experiment and compares the results with three machine learning algorithms, namely: Naïve Bayes, ID3 IG
(Information Gain) and ID3 GR (Gain Ratio). The latter two
are a tree variation of the k-NN algorithm. The experiment
Figure 1. Syntax tree of sentence 1 was conducted in two phases. In phase 1 the training set was
978-0-7695-4172-3/10 $26.00 © 2010 IEEE
DOI 10.1109/PCI.2010.34
78
gradually increased until all training examples were used. In phase 2 10-fold cross validation was used, and the summary of results for all the aforementioned approaches is shown in Table 1.
In this paper we propose a methodology to resolve the disambiguation problem of PP attachment in Modern Greek using supervised machine learning algorithms given a dataset of feature-vectors extracted from a morphologically annotated corpus. The presented methodology is, to the authors’ knowledge, a first attempt to resolve the PP attachment problem in Modern Greek using minimal linguistic resources, like grammars, tools, treebanks and semantic thesauri. Finally we do not put some restriction on the type of prepositional phrases, thus taking into account all the prepositions.
TABLE I.
SUMMARY OF RESULTS OF RELATED PREVIOUS WORK
Results83.7% 84.1% 86.5% 79% (phase 1) 79% (phase 2) 78% (phase 1) 79% (phase 2) 74% (phase 1) 76% (phase 2)
III.MODERN GREEK PP-ATTACHMENT
Algorithm
IB1(1-NN) [2] IB-IG(1-NN tree) [2] Cosine similarity (NN) [3] ID3 IG(k-NN) [7] ID3 GR(k-NN) [7] Naïve Bayes [7]
The rest of this paper is organized as follows. Section 2 introduces the properties of the Modern Greek language. The corpus used in the presented experiments, the feature extraction process, as well as an example of the creation of the learning vector is discussed in Section 3. Section 4 describes in detail the experimental process and shows the results that were achieved in the experiments. The results are discussed quantitatively and qualitatively in Section 5, and future research prospects are proposed. Finally, the paper concludes with some interesting comments and remarks.
II.
MODERN GREEK PROPERTIES
Modern Greek is a relatively free-word-order language. The ordering of the phrases within a sentence may vary without affecting its correct syntax or its meaning. Internal phrase structure is stricter. Within the noun phrase, adjectives usually precede the noun (e.g. “IJȠ ȝİȖȐȜȠ ıʌȓIJȚ”, [to megalo spiti], 'the big house'), while possessors follow it (e.g. “IJȠ ıʌȓIJȚ ȝȠȣ”, [to spiti mu], 'my house'). Regarding VPs, certain grammatical elements attach to the verb as clitics and form a rigidly ordered group together with it. This applies particularly to unstressed object pronouns, negation particles, the tense particle “șĮ” [șa], and the subjunctive particle “ȞĮ” [na] [8].
In Modern Greek, prepositions normally require the accusative case: “Įʌȩ” (from), “ȖȚĮ” (for), “ȝİ” (with), “ȝİIJȐ” (after), “ȤȦȡȓȢ” (without), “ȦȢ” (as) and “ıİ” (to, in or at). The preposition “ıİ”, when followed by a definite article, fuses with it into forms like “ıIJȠ” (ıİ + IJȠ) and “ıIJȘ” (ıİ + IJȘ). For this reason, lemmatization of the preposition is required [8].
A.Corpus and Pre-processing
The text corpus used in the experiments is the ILSP/ELEFTHEROTYPIA corpus [9]. It consists of 5244 sentences; it is balanced in domain and genre, and manually annotated with complete morphological information. Further (phrase structure) information is obtained automatically by a multi-pass chunker [10].
During chunking, NPs, VPs, PPs, adverbial phrases (ADP) and conjunctions (CON) are detected via multi-pass parsing. The chunker exploits minimal linguistic resources: a keyword lexicon containing 450 keywords (i.e. closed-class words such as articles, prepositions etc.) and a suffix lexicon of 300 of the most common word suffixes in Modern Greek. The chunked phrases are non-overlapping. Embedded phrases are flatly split into distinct phrases. Nominal modifiers in the genitive case are included in the same phrase with the noun they modify; base nouns joined by a coordinating conjunction are grouped into one phrase. The chunker identifies basic phrase constructions during the first passes (e.g. adjective-nouns, article-nouns), and combines smaller phrases into longer ones in later passes (e.g. coordination, inclusion of genitive modifiers, compound phrases).
B.Feature Selection
The feature vector is necessary to enable the classification algorithms to classify the PP attachment. The attributes which can be used to form the feature vector need to be related to the PP attachment ambiguity resolution task, and they vary in the bibliography. The ones encountered most commonly in previous work (e.g. [11]) are: the number of commas between the PP and the anchor candidate, the number of other punctuation marks between the PP and the anchor candidate, the number of words between the PP and the anchor candidate, the POS-tag of the last token of the anchor candidate, the lemma of the PP, the number of PPs between the PP and the anchor candidate, the label of the phrase immediately before the PP, the anchor candidate, etc. In the present work, a feature vector is formed for every anchor candidate and every preposition in a given corpus sentence. Thus, the syntactic freedom of Modern Greek is taken into account. Of the features we mentioned earlier, after a number of experiments conducted using the tools of the WEKA machine learning workbench (Waikato Environment for Knowledge Analysis) [12], which enables the experimentation with various classification algorithms, the following features were selected to form our feature vector:
The lemma of the preposition introducing the PP. The type of the phrase immediately before the PP. The anchor candidate.
The POS-tag of the last token of the anchor
candidate.
The number of words between the PP and the
anchor candidate.
The number of PPs between the PP and the anchor
79
candidate.
The number of commas between the PP and the anchor candidate.
The number of other punctuation marks between the PP and the anchor candidate.
and they will decide which the correct attachment is. In this case the PP “with his wife” (“ȝİ IJȘȞ ıȪȗȣȖȠ” in Modern Greek) is correctly attached to the NP “The dialogue of a contemporary worker” because it denotes the person with whom the dialogue took place.
TABLE II.
POS-TAG SYMBOLS
The feature “Correct attachment” was used as the classification class of the vector in the experiments conducted. The reason we selected the feature “POS-tag of the last token of the anchor candidate” is that, in most cases, the headword of NPs and the main verb in VPs is the last token of the phrase. The anchor candidate may be preceding or following the PP, as the syntactic freedom of the language does not impose restrictions on the ordering. Therefore, no such restrictions have been imposed on the described feature set.
C.Feature Vector Extraction
The process of exporting values for each feature vector is automated. We created a program that is written in C language that automatically identifies the first eight features of our vector and stores the results in an Excel file. This program gives all possible attachments (anchor candidates) of a PP in a sentence, that are NPs and VPs, as these phrase types constitute the most significant error source for PP attachment. However it is fairly easy, in future research, to include other anchor candidate phrases types. The correct class label was assigned to every extracted feature vector by three language experts, manually. This feature takes the values of TRUE or FALSE and indicates whether the attachment example represented in the given vector is correct or not. Inter-annotation agreement between the experts was 90%. For the remaining 10%, where the experts didn’t initially agree, a discussion among them followed, resulting to a common decision about correct attachment.
An example of the feature vector extraction process follows. Take the following annotated sentence (firstly presented in the Modern Greek language and then translated in English).
Modern Greek:
B
Symol ValueN Noun A Adjective S Preposition T Article R Adverb F Punctuation mark TABLE III.
FEATURE VECTORS
LPre-P ACPOS-tag #W#PPs #C#PM ȝİ NP NP N 0 0 0 0 ȝİ NP NP F 1 0 0 0 Extracted feature vectors from example sentence; L=Lemma, Pre-P=Previous Phrase, AC=Anchor Candidate, POS-tag=Anchor Candidate last word, #W=word distance between PP and anchor candidate, #PPs=number of PP’s between PP and anchor candidate, #C=number of commas, #PM=other punctuation marks.
IV.EXPERIMENTAL PROCESS
Machine learning is used in order to train the system to be able to make “smart” decisions regarding the anchor candidate phrase to which the PP phrase is to be attached. The produced dataset consisted of 8500 vectors corresponding to 500 corpus sentences. Consequently, the type of machine learning we used is supervised machine
blearning and, having already observed the right results-outputs which have been given by the experts, the system
will be able to predict results automatically.
Of the 8500 vectors, only 7.9% of them indicated a correct attachment (positive examples) and the remaining 82.1% indicated an erroneous attachment (negative examples), so we had an imbalanced dataset and biased results (recall and precision were much higher for the negative class than for the positive class). To balance the dataset random undersampling [13] was performed, i.e. the