Real-world sentence boundary detection using multitask learning: A case study on French

Lim, KyungTae; Park, Jungyeul

doi:10.1017/s1351324922000134

Cited by 2 publications

(4 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The task of sentence segmentation can be performed by detecting sentence boundaries [33]. The general pattern of a sentence is that it begins with a capital letter and ends with a special punctuation mark such as a period, question mark, or exclamation mark.…”

Section: A Sentence Segmentationmentioning

confidence: 99%

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

Petrus¹,

Ermatita²,

Sukemi³

et al. 2023

IJACSA

View full text Add to dashboard Cite

This study proposes a new approach in the sentence tokenization process. Sentence tokenization, which is known so far, is the process of breaking sentences based on spaces as separators. Space-based sentence tokenization only generates single word tokens. In sentences consisting of five words, tokenization will produce five tokens, one word each. Each word is a token. This process ignores the loss of the original meaning of the separated words. Our proposed tokenization framework can generate one-word tokens and multi-word tokens at the same time. The process is carried out by extracting the sentence structure to obtain sentence elements. Each sentence element is a token. There are five sentence elements that is Subject, Predicate, Object, Complement and Adverbs. We extract sentence structures using deep learning methods, where models are built by training the datasets that have been prepared before. The training results are quite good with an F1 score of 0.7 and it is still possible to improve. Sentence similarity is the topic for measuring the performance of one-word tokens compared to multi-word tokens. In this case the multiword token has better accuracy. This framework was created using the Indonesian language but can also use other languages with dataset adjustments.

show abstract

Section: A Sentence Segmentationmentioning

confidence: 99%

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

Petrus¹,

Ermatita²,

Sukemi³

et al. 2023

IJACSA

View full text Add to dashboard Cite

show abstract

“…Many studies of sentence boundary detection are for English, but they are rarely explored for languages other than English [5], [8] including Indonesian. Research for Indonesian has been done by [14] which presents the development of a training dataset to optimize sentence boundary detection using the Indonesian translation of the Al-Quran with F measure 86.4%, [6] using a rule base by looking for patterns of sentence endings based only on a combination of spaces, capital letters or quotation marks.…”

Section:  Issn: 2252-8938mentioning

confidence: 99%

“…[18] Using rule-based with 21 features and classification with k-means able to produce an average F1-score of 96.58%. [5] Proposed a multitasking neural model to detect sentence beginnings without relying on punctuation in written texts, obtaining an F1 score of up to 98.07%.…”

Section:  Issn: 2252-8938mentioning

confidence: 99%

“…A sentence is a series of words that express a complete thought [3]. Sentence segmentation is the task of breaking text into individual sentences for further processing which is done by detecting sentence boundaries by determining the beginning and end of the sentence [4], [5]. This task is the first step for several NLP applications such as document summarization, information extraction, machine translation, and syntactic parsing.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An adaptable sentence segmentation based on Indonesian rules

Petrus¹,

Ermatita

Sukemi

et al. 2023

IJ-AI

View full text Add to dashboard Cite

<p>Sentence segmentation that breaks textual data strings into individual sentences is an important phase in natural language processing (NLP). Each word in the string that is added a punctuation mark such as a period, question mark, or exclamation point, becomes the location for splitting the string. Humans can easily see the punctuation and split the string into sentences, but not machines. Basically, the three punctuation marks also perform other functions so that the sentence segmentation process must really be able to detect whether a word marked with punctuation is a sentence boundary or not. This research proposes a sentence segmentation system called segmentasi kalimat bahasa Indonesia (SKBI) or Indonesian language Sentence Segmentation by applying a set of rules and can be used in Indonesian texts and can be adapted for English. There are 34 rules built with a combination of 27 fairly complete features that contribute to this research. The experimental results for the Indonesian text show that the SKBI is able to achieve an F1-Score of 96.89% and 97.07% for English. Both need to be improved but now better than previous research.</p>

show abstract

Real-world sentence boundary detection using multitask learning: A case study on French

Cited by 2 publications

References 20 publications

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

An adaptable sentence segmentation based on Indonesian rules

Contact Info

Product

Resources

About