Sentence Length

Borbély, Gábor; Kornai, András

doi:10.18653/v1/w19-5710

Cited by 2 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Average sentence length can be used as a measure of grammatical complexity based on the assumption that longer sentence has a more complex syntactic and semantic structure than shorter sentences. It also shows richness and descriptiveness of sentences in the corpus [19]- [21].…”

Section: Average Sentence Lengthmentioning

confidence: 99%

A Survey of Current Datasets for Code-Switching Research

Jose

Chakravarthi

Suryawanshi

et al. 2020

2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)

View full text Add to dashboard Cite

Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-ofspeech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.

show abstract

Section: Average Sentence Lengthmentioning

confidence: 99%

A Survey of Current Datasets for Code-Switching Research

Jose

Chakravarthi

Suryawanshi

et al. 2020

2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)

View full text Add to dashboard Cite

show abstract

“…According to the sentence length distribution of various language corpora, a well-written sentence contains 15-20 words on average [7]. The average sequence length of a SMILES string is typically 3 times longer than a natural language, whereas the token space is at least 1000 times smaller than any developed language [8].…”

Section: Introductionmentioning

confidence: 99%

Atom-in-SMILES tokenization

Ucak

Ashyrmamatov

Lee

2022

Preprint

View full text Add to dashboard Cite

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. In this study we show that the conventional SMILES tokenization itself is at fault, resulting in tokens that fail to reflect the true nature of molecules. To address this we propose atom-in-SMILES approach, resolving the ambiguities in the genericness of SMILES tokens. Our findings in multiple translation tasks suggest that proper tokenization has a great impact on the prediction quality. Considering the prediction accuracy and token degeneration comparisons, atom-in-SMILES appears as an effective method to draw higher quality SMILES sequences out of AI-based chemical models than other tokenization schemes. We investigate the token degeneration, highlight its pernicious influence on prediction quality, quantify the token-level repetitions, and include generated examples for qualitative analysis. We believe that atom-in-SMILES tokenization can readily be utilized by the community at large, providing chemically accurate, tailor-made tokens for molecular prediction models.

show abstract

Sentence Length

Cited by 2 publications

References 17 publications

A Survey of Current Datasets for Code-Switching Research

A Survey of Current Datasets for Code-Switching Research

Atom-in-SMILES tokenization

Contact Info

Product

Resources

About