2005
DOI: 10.1002/asi.20126
|View full text |Cite
|
Sign up to set email alerts
|

Text characteristics of English language university Web sites

Abstract: The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three Englishspeaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
1

Year Published

2005
2005
2009
2009

Publication Types

Select...
5

Relationship

4
1

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 38 publications
0
4
1
Order By: Relevance
“…Although the difference is small in size, it is large due to the logarithmic scale. This is in contrast to similar graphs for British English and academic websites, for example, in which the lines are straight and the point for word frequency 1 does not deviate (Thelwall, 2005). This confirms that there are more unique words in comments than in “normal” text.…”
Section: Resultscontrasting
confidence: 98%
“…Although the difference is small in size, it is large due to the logarithmic scale. This is in contrast to similar graphs for British English and academic websites, for example, in which the lines are straight and the point for word frequency 1 does not deviate (Thelwall, 2005). This confirms that there are more unique words in comments than in “normal” text.…”
Section: Resultscontrasting
confidence: 98%
“…The straight line graph in Figure 2 reflects a power law that is common in word frequency statistics (Li, 1992; Zipf, 1949) and found almost everywhere on the Web (Baldi et al, 2003; Barabási, 2002; Thelwall, 2005), but the nonlinear Figure 1 is more surprising, showing that there is an unnaturally high number of low‐frequency GM words. This would be consistent with a set of activists' attempting to promote the usage of certain words, but these words' not gaining a resonance with the public.…”
Section: Case Study: Frankenscience Wordsmentioning
confidence: 81%
“…Statistical analyses of Web patterns for many Web‐related phenomena, including page sizes, link counts, and word usage patterns, often show small numbers of documents dominating, an aspect of the “power law” effect (Baldi, Frasconi, & Smyth, 2003; Barabási, 2002; Levene & Poulovassilis, 2004; Thelwall, 2005; Zipf, 1949). This suggests that an effective strategy for finding hybrid word family members would be to look for individual pages that contain many such terms, which may be seen as authorities on the topic and hence likely to contain many different related words.…”
Section: A Methods For Hybrid Word Family Usage Identificationmentioning
confidence: 99%
“…A few Web text analysis studies have investigated the text of Web sites from an issue perspective, by using either content analysis (Weare & Lin, 2000) or metrics based on word frequency counting (Price & Thelwall, 2005; Thelwall, 2004c, 2005). Although content analyses can yield detailed and insightful information, they are labor‐intensive and time‐consuming.…”
Section: Introductionmentioning
confidence: 99%