Web document clustering

Zamir, Oren; Etzioni, Oren

doi:10.1145/290941.290956

Cited by 743 publications

(28 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They condense documents into a few words and phrases, offering a brief and precise description of a document's contents. They have many further applications, including the classification or clustering of documents (Jones & Mahoui, 2000; Zamir & Etzioni, 1998, 1999), search and browsing interfaces (Gutwin, Paynter, Witten, Nevill‐Manning, & Frank, 1999; Jones, 1999; Jones & Paynter, 1999), retrieval engines (Arampatzis, Tsoris, Koster, & Van der Weide, 1998; Croft, Turtle, & Lewis, 1991; Jones & Staveley, 1999), and thesaurus construction (Kosovac, Vanier, & Froese, 2000; Paynter, Witten, & Cunningham, 2000).…”

Section: Introductionmentioning

confidence: 99%

Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications

Jones

Paynter

2002

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

This article describes an evaluation of the Kea automatic keyphrase extraction algorithm. Document keyphrases are conventionally used as concise descriptors of document content, and are increasingly used in novel ways, including document clustering, searching and browsing interfaces, and retrieval engines. However, it is costly and time consuming to manually assign keyphrases to documents, motivating the development of tools that automatically perform this function. Previous studies have evaluated Kea's performance by measuring its ability to identify author keywords and keyphrases, but this methodology has a number of well-known limitations. The results presented in this article are based on evaluations by human assessors of the quality and appropriateness of Kea keyphrases. The results indicate that, in general, Kea produces keyphrases that are rated positively by human assessors. However, typical Kea settings can degrade performance, particularly those relating to keyphrase length and domain specificity. We found that for some settings, Kea's performance is better than that of similar systems, and that Kea's ranking of extracted keyphrases is effective. We also determined that author-specified keyphrases appear to exhibit an inherent ranking, and that they are rated highly and therefore suitable for use in training and evaluation of automatic keyphrasing systems.

show abstract

Section: Introductionmentioning

confidence: 99%

Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications

Jones

Paynter

2002

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

show abstract

“…Clusters are nodes of a suffix tree formed from suffix trees of the input documents (trees containing all suffixes of a string). An original STC [6] method has a great contextual dependence and low accuracy, so it has developed its DIG [7] modification, which precision is about 70%. Its shortcoming is a high price of the tree or graph building in the case of receiving documents by network [8].…”

Section: Hierarchical Methods (Single Link Complete Linkmentioning

confidence: 99%

Search automation of the generalized method of device operational characteristics improvement

Petrova

Puchkova

Zaripova

2017

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

“…A significant portion of the unstructured content collected from social media is text. Text mining techniques can be applied for automatic organization, navigation, retrieval, and summary of huge volumes of text documents [59][60][61]. This concept covers a number of topics and algorithms for text analysis including natural language processing (NLP), information retrieval, data mining, and machine learning [62].…”

Section: Text Analyticsmentioning

confidence: 99%

Social big data: Recent achievements and new challenges

2016

View full text Add to dashboard Cite

a b s t r a c tBig data has become an important issue for a large number of research areas such as data mining, machine learning, computational intelligence, information fusion, the semantic Web, and social networks. The rise of different big data frameworks such as Apache Hadoop and, more recently, Spark, for massive data processing based on the MapReduce paradigm has allowed for the efficient utilisation of data mining methods and machine learning algorithms in different domains. A number of libraries such as Mahout and SparkMLib have been designed to develop new efficient applications based on machine learning algorithms. The combination of big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas as social media and social networks. These new challenges are focused mainly on problems such as data processing, data storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and tracking data, among others. In this paper, we present a revision of the new methodologies that is designed to allow for efficient data mining and information fusion from social media and of the new applications and frameworks that are currently appearing under the "umbrella" of the social networks, social media and big data paradigms. (D. Camacho). petabytes (and even exabytes) in size, and the massive sizes of these datasets extend beyond the ability of average database software tools to capture, store, manage, and analyse them effectively.The concept of big data has been defined through the 3V model, which was defined in 2001 by Laney [5] as: "high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making". More recently, in 2012, Gartner [6] updated the definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization". Both definitions refer to the three basic features of big data: Volume, Variety, and Velocity. Other organisations, and big data practitioners (e.g., researchers, engineers, and so on), have extended this 3V model to a 4V model by including a new "V": Value [7]. This model can be even extended to 5Vs if the concepts of Veracity is incorporated into the big data definition.Summarising, this set of * V-models provides a straightforward and widely accepted definition related to what is (and what is not) a big-data-based problem, application, software, or framework. These concepts can be briefly described as follows [5,7]:• Volume: refers to large amounts of any kind of data from any different sources, including mobile digital data creation devices and digital devices. The benefit from gathering, processing, and analysing these large amounts of data generates a number http://dx.

show abstract

Web document clustering

Cited by 743 publications

References 12 publications

Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications

Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications

Search automation of the generalized method of device operational characteristics improvement

Social big data: Recent achievements and new challenges

Contact Info

Product

Resources

About