Understanding bag-of-words model: a statistical framework

Zhang, Yin; Jin, Rong; Zhou, Zhi

doi:10.1007/s13042-010-0001-0

Cited by 1,091 publications

(457 citation statements)

References 17 publications

Supporting

Mentioning

439

Contrasting

Unclassified

Order By: Relevance

“…3, a bag-of-words feature model is used to represent each unstructured feature extracted from the ticket. A bag-of-words representation is known to extract good patterns from unstructured text data [80]. The bagof-words model can be learnt over a vector of unigrams or bigrams or both extracted from text data.…”

Section: Feature Extractionmentioning

confidence: 99%

Reducing user input requests to improve IT support ticket resolution process

et al. 2017

View full text Add to dashboard Cite

Management and maintenance of IT infrastructure resources such as hardware, software and network is an integral part of software development and maintenance projects. Service management ensures that the tickets submitted by users, i.e. software developers, are serviced within the agreed resolution times. Failure to meet those times induces penalty on the service provider. To prevent a spurious penalty on the service provider, non-working hours such as waiting for user inputs are not included in the measured resolution time, that is, a service level clock pauses its timing. Nevertheless, the user interactions slow down the resolution process, that is, add to user experienced resolution time and degrade user experience. Therefore, this work is motivated by the need to analyze and reduce user input requests in tickets' life cycle.To address this problem, we analyze user input requests and investigate their impact on user experienced resolution time. We distinguish between input requests of two types: real, seeking information from the user to process the ticket and tactical, when no information is asked but the user input request is raised merely to pause the service level clock. Next, we propose a system that preempts a user at the time of ticket submission to provide additional information that the analyst, a person responsible for servicing the ticket, is likely to ask, thus reducing real user input requests. Further, we propose a detection system to identify tactical user input requests.To evaluate the approach, we conducted a case study in a large global IT company. We observed that around 57% of the tickets have user input requests in the life cycle, causing user experienced resolution time to be almost twice as long as the measured service resolution time. The proposed preemptive system preempts the information needs with an average accuracy of 94-99% across five cross validations while traditional approaches such as logistic regression and naive Bayes have accuracy in the range of 50-60%. The detection system identifies around 15% of the total user input requests as tactical. Therefore, the proposed solution can efficiently bring down the number of user input requests and, hence, improve the user-experienced resolution time.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

Reducing user input requests to improve IT support ticket resolution process

et al. 2017

View full text Add to dashboard Cite

show abstract

“…There are many different techniques for feature extraction, however, BoW models [6] and distributed representation [7] is the most popular methods used in NLP. TF-IDF [8] is one of the widely used method of BoW models, it's simplistic but surprisingly useful in practice.…”

Section: Feature Vector Representation Of Newsmentioning

confidence: 99%

General Simhash-based Framework for News Aggregators

Hu¹,

You²

2017

Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017)

View full text Add to dashboard Cite

. We proposed a general simhash-based framework for news aggregator, the system has no necessary to process crawled news for retrieval, deduplication and event detection respectively, each piece of news is processed only one time and without extra storage space. Duplicates and breaking events can be detected online before new crawled news was stored in system's database. Machine learning are widely used in news aggregator for tasks like topic classification and each piece of news is mapped into a feature vector with fixed length. Simhash fingerprints are generated on feature vectors rather than original text of news, therefore news retrieval, deduplication and breaking news detection can be integrated into any running aggregator systems without extra efforts. Our aggregator collected around 9.6 million of news from Internet and the framework function well in real scenario.

show abstract

“…Code Fragment: A continuous segment of source code, specified by the triple (l, s, e), including the source file l, the line the fragment starts on, s, and the line it ends on, e. 9 1 Similar to the popular bag-of-words model [39] in Information Retrieval Clone Pair: A pair of code fragments that are similar, specified by the triple (f1, f2, φ), including the similar code fragments f1 and f2, and their clone type φ.…”

Section: Definitionsmentioning

confidence: 99%

SourcererCC

Sajnani

Saini

Svajlenko

et al. 2016

Proceedings of the 38th International Conference on Software Engineering

354

View full text Add to dashboard Cite

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized invertedindex to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone.We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

show abstract

Understanding bag-of-words model: a statistical framework

Cited by 1,091 publications

References 17 publications

Reducing user input requests to improve IT support ticket resolution process

Reducing user input requests to improve IT support ticket resolution process

General Simhash-based Framework for News Aggregators

SourcererCC

Contact Info

Product

Resources

About