While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools. CCS CONCEPTS • Information systems → Video search; • Computing methodologies → Machine learning approaches; • Computer systems organization → Neural networks; • Software and its engineering → Software libraries and repositories;
We demonstrate the ability of deep architectures, specifically convolutional neural networks, to learn and differentiate the lexical features of different programming languages presented in coding video tutorials found on the Internet. We analyze over 17,000 video frames containing examples of Java, Python, and other textual and non-textual objects. Our results indicate that not only can computer vision models based on deep architectures be taught to differentiate among programming languages with over 98% accuracy, but can learn language-specific lexical features in the process. This provides a powerful mechanism for carrying out program comprehension research on repositories where source code is represented with imagery rather than text, while simultaneously avoiding the computational overhead of optical character recognition. CCS CONCEPTS • Computer systems organization → Neural networks; • Software and its engineering → Software libraries and repositories;
IntroductionIn the past couple of years, applications of deep learning to mining software repositories have grown in number and diversity of methods [1][2][3][4][5]. Fueled in part by easy-touse libraries and graphics processing unit (GPU) computing, deep architectures have facilitated new avenues for research, often producing results that far surpass previous techniques. However, despite their advantages, the huge amount of labeled truth data traditionally required to train deep architectures for classification tasks, as well as the computational time required to iteratively improve such models, remains a substantial bottleneck [6]. As a result, some researchers are forced to turn away from deep architectures, despite the fact that for certain tasks (like image analysis and computer vision), deep learning consistently outperforms alternative algorithms and methodologies.Low-shot learning refers to the practice of training machine learning models, including deep neural networks, using far fewer samples of each classification category than what is typically standard practice. In the extreme case, training data consists of only one instance for each target class, which is known as one-shot learning [7]. These approaches Abstract Background: Despite the well-documented and numerous recent successes of deep learning, the application of standard deep architectures to many classification problems within empirical software engineering remains problematic due to the large volumes of labeled data required for training. Here we make the argument that, for some problems, this hurdle can be overcome by taking advantage of low-shot learning in combination with simpler deep architectures that reduce the total number of parameters that need to be learned. Findings:We apply low-shot learning to the task of classifying UML class and sequence diagrams from Github, and demonstrate that surprisingly good performance can be achieved by using only tens or hundreds of examples for each category when paired with an appropriate architecture. Using a large, off-the-shelf architecture, on the other hand, doesn't perform beyond random guessing even when trained on thousands of samples. Conclusion:Our findings suggest that identifying problems within empirical software engineering that lend themselves to low-shot learning could accelerate the adoption of deep learning algorithms within the empirical software engineering community. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
We apply cluster analysis to a sample of 2,116 children with Autism Spectrum Disorder in order to identify patterns of challenging behaviors observed in home and centerbased clinical settings. The largest study of this type to date, and the first to employ machine learning, our results indicate that while the presence of multiple challenging behaviors is common, in most cases a dominant behavior emerges. Furthermore, the trend is also observed when we train our cluster models on the male and female samples separately. This work provides a basis for future studies to understand the relationship of challenging behavior profiles to learning outcomes, with the ultimate goal of providing personalized therapeutic interventions with maximum efficacy and minimum time and cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.