Open Access System for Information Sharing

Login Library

 

Article
Cited 2 time in webofscience Cited 4 time in scopus
Metadata Downloads

Integrated multi-strategic Web document pre-processing for sentence and word boundary detection SCIE SCOPUS

Title
Integrated multi-strategic Web document pre-processing for sentence and word boundary detection
Authors
Shim, JKim, DCha, JLee, GGSeo, J
Date Issued
2002-07
Publisher
PERGAMON-ELSEVIER SCIENCE LTD
Abstract
Most work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many sentence boundary and spacing errors. The objective of this paper introduces a multi-strategic integrated text preprocessing method for difficult problems of sentence boundary disambiguation and word boundary disambiguation of Web texts. We have applied a hybrid method (the regular expression rule, the heuristic rule, and the inductive learning of statistical decision trees, using a C4.5 learner) synergically to the task of raw corpus preprocessing. This work contributes to a more correct morphological analysis and guarantees a more stable working of application systems. We tackle easily definable problems with automatically acquired constraints and we use inductively learned decision trees to solve ill-defined ambiguity problems by incorporating multiple features (n-grams, relative frequency, entropy, tri-dictionary index). The multistrategy approach was thoroughly tested: it achieved approximately 99.12% (with punctuation marks) and 98.04% (without any punctuation marks) accuracy in sentence boundary disambiguation, 95.39% accuracy of word spacing correction, and 94.61% accuracy for whole intermixed text preprocessing problems. from Korean news script Web documents. (C) 2002 Elsevier Science Ltd. All rights reserved.
Keywords
text normalization; sentence boundary disambiguation; word boundary disambiguation; spacing-word correction
URI
https://oasis.postech.ac.kr/handle/2014.oak/19095
DOI
10.1016/S0306-4573(01)00044-9
ISSN
0306-4573
Article Type
Article
Citation
INFORMATION PROCESSING & MANAGEMENT, vol. 38, no. 4, page. 509 - 527, 2002-07
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Views & Downloads

Browse