2023
DOI: 10.14569/ijacsa.2023.0140264
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

Abstract: This study proposes a new approach in the sentence tokenization process. Sentence tokenization, which is known so far, is the process of breaking sentences based on spaces as separators. Space-based sentence tokenization only generates single word tokens. In sentences consisting of five words, tokenization will produce five tokens, one word each. Each word is a token. This process ignores the loss of the original meaning of the separated words. Our proposed tokenization framework can generate one-word tokens a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 29 publications
0
1
0
Order By: Relevance
“…Tokenization was performed after the data normalization process. Tokenization is a crucial stage in RE that involves breaking sentences into word pieces, or tokens, for each line [13]. BERT Tokenizer was used to break a sentence into words (tokens).…”
Section: Preprocessingmentioning
confidence: 99%
“…Tokenization was performed after the data normalization process. Tokenization is a crucial stage in RE that involves breaking sentences into word pieces, or tokens, for each line [13]. BERT Tokenizer was used to break a sentence into words (tokens).…”
Section: Preprocessingmentioning
confidence: 99%