2014
DOI: 10.3233/ifs-141227
|View full text |Cite
|
Sign up to set email alerts
|

Weblog and short text feature extraction and impact on categorisation

Abstract: The characterisation and categorisation of weblogs and other short texts has become an important research theme in the areas of topic/trend detection, and pattern recognition, amongst others. The value of analysing and characterising short text is to understand and identify the features that can identify and distinguish them, thereby improving input to the classification process. In this research work, we analyse a large number of text features and establish which combinations are useful to discriminate betwee… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0
1

Year Published

2016
2016
2019
2019

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 13 publications
0
2
0
1
Order By: Relevance
“…To capture this difference in informality, we compute the following scores from the webpages and use them as features. These scores are indicative of the readability/informality level of a text and have been used previously to measure the informality level and reading difficulty of text (Miltsakaki and Troutt 2008;Mosquera and Moreda 2012;Pérez Téllez et al 2014;Lahiri, Mitra, and Lu 2011) CLScore, LIX and RIX are used to gauge the reading difficulty of the text. CLScore and (discretized) RIX index output the approximate US grade level required to comprehend the text whereas (discretized) LIX outputs scores corresponding to five levels of readability: very easy (0-24), easy (25-34), standard (35-44), difficult (45-54) and very difficult (more than 55).…”
Section: Feature Engineeringmentioning
confidence: 99%
“…To capture this difference in informality, we compute the following scores from the webpages and use them as features. These scores are indicative of the readability/informality level of a text and have been used previously to measure the informality level and reading difficulty of text (Miltsakaki and Troutt 2008;Mosquera and Moreda 2012;Pérez Téllez et al 2014;Lahiri, Mitra, and Lu 2011) CLScore, LIX and RIX are used to gauge the reading difficulty of the text. CLScore and (discretized) RIX index output the approximate US grade level required to comprehend the text whereas (discretized) LIX outputs scores corresponding to five levels of readability: very easy (0-24), easy (25-34), standard (35-44), difficult (45-54) and very difficult (more than 55).…”
Section: Feature Engineeringmentioning
confidence: 99%
“…Fernando et al (2014) identified various features that helped them in characterization and categorization of Weblogs and other short texts. Most of the features rely on the words (tokens) within the text [35]. For effective transformation and for representation, word frequencies must be normalized in terms of their frequency within a document and within the entire collection [29].…”
Section: Similarity Analysismentioning
confidence: 99%
“…Η διαδικασία της αναγνώρισης του συνόλου των κόμβων που έχουν τη μεγαλύτερη επιρροή σε έναν κόμβο για ένα δεδομένο θέμα εισάγεται στο [61], η οποία, τελικά, έχει ως αποτέλεσμα την εύρεση των κόμβων που αντιπροσωπεύουν την πηγή για κάθε θέμα. Στο [72], ο χαρακτηρισμός και η κατηγοριοποίηση ιστολογίων και άλλων σύντομων κειμένων βελτιώνεται με την ανάλυση και τον εντοπισμό των χαρακτηριστικών που μπορούν να τα διακρίνουν. Στο [78], οι συγγραφείς προτείνουν ένα μοντέλο το οποίο χρησιμοποιεί δύο μοντέλα λανθάνουσας ανάθεσης Dirichlet (LDA) για την ομαδοποίηση παρόμοιων τηλεοπτικών χρηστών και παρόμοιων περιγραφών τηλεοπτικών προγραμμάτων ταυτόχρονα.…”
Section: προσεγγίσεις της ανίχνευσης κοινοτήτων στα κοινωνικά δίκτυαunclassified