Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Improving Chinese Dependency Parsing on Raw Sentences using Heterogeneous Part-of-Speech Annotations

Title
Improving Chinese Dependency Parsing on Raw Sentences using Heterogeneous Part-of-Speech Annotations
Authors
Wu, Zhen
Date Issued
2013
Publisher
포항공과대학교
Abstract
Part-of-Speech (POS) tagging and parsing are essential steps toward representing the meaning of a sentence. Most parsers take sentences with POS tag information as their input, and derive parse tree mainly based on words (word surface information) and POS information provided. However, some POS tags are too general to encapsulate a word’s syntactic behavior, thus lead to a low parsing accuracy. Subdividing POS tags to a more fine-grained level can provide more information for parsing and increase parsing performance. But a subdivided tagset also makes POS tagging task difficult. In practical NLP tasks, inputs are raw sentences without POS tags. POS tagging is an inevitable step before parsing and the errors in POS tagging may propagate into parsing. It is challenging to balance the granularity of POS tagset and performance of POS tagger to improve the performance of the following parser. In this thesis, we propose to utilize heterogeneous Part-of-Speech (POS) information for POS subdividing to improve dependency parsing performance on raw sentences. We first used a Tsinghua Chinese Treebank (TCT) POS tagger to tag Chinese Dependency Treebank (CDT) training set, and converted some CDT tags to TCT tags. In this way, we obtained a CDT corpus with some TCT POS tags. We then trained a parser using this new CDT corpus. For decoding, given a raw sentence input, we proposed two methods to performword segmentation and POS tagging. We used our newly trained parser to parse the tagged sentences. Experimental results showed that the parser based on CDT corpus with some TCT POS tags performed better than one based on original CDT corpus, with an improvement of 0.67% (absolute). Better results can be expected by exploring more heterogeneous POS tags, with optimized segmentation, POS tagging and parsing models.
URI
http://postech.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000001622454
http://oasis.postech.ac.kr/handle/2014.oak/1949
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse