Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.
Building a video search engine on the Web is a very challenging problem. Compared with web page search, video search has its unique characteristics (such as high volume of data for each video, existence of multi-modal information including meta-data, visual content, audio, closed caption, etc). In this paper, we investigate some promising approaches to boosting the search relevance of a large scale video search engine on the Web. The contribution of our work is three-fold. (1) We developed a specialized video categorization framework which combines multiple classifiers based on different modalities. (2) By learning users' querying history and clicking log, we proposed an automatic query profile generation technique and applied the profile to query categorization. (3) A highly scalable system was developed, which integrates the online query categorization and offline video categorization. Naive Bayes with mixture of multinomials, Maximum Entropy, and Support Vector Machine categorization methods and the profile learning technique were evaluated on a large scale set of video data on the Web. The evaluation of the developed system and user study has indicated that the joint categorization of queries and video data boosts the video search relevance and user search experience. The high efficiency of our approaches is also demonstrated by the good responsiveness of the system for the video search engine on the Web.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.