While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools. CCS CONCEPTS • Information systems → Video search; • Computing methodologies → Machine learning approaches; • Computer systems organization → Neural networks; • Software and its engineering → Software libraries and repositories;
Techniques based on artificial neural networks represent the current state-of-the-art in machine learning due to the availability of improved hardware and large data sets. Here we employ doc2vec, an unsupervised neural network, to capture the semantic content of text messages sent by adolescents during high school, and encode this semantic content as numeric vectors. These vectors e↵ectively condense the text message data into highly leverageable inputs to a logistic regression classifier in a matter of hours, as compared to the tedious and often quite lengthy task of manually coding data. Using our machine learning approach, we are able to train a logistic regression model to predict adolescents' engagement in substance abuse during distinct life phases with accuracy ranging from 76.5% to 88.1%. We show the e↵ects of grade level and text message aggregation strategy on the e cacy of document embedding generation with doc2vec. Additional examination of the vectorizations for specific terms extracted from the text message data adds quantitative depth to this analysis. We demonstrate the ability of the method used herein to overcome traditional natural language processing concerns related to unconventional orthography. These results suggest that the approach described in this thesis is a competitive and e cient alternative to existing methodologies for predicting substance abuse behaviors. This work reveals the potential for the application of machine learning-based manipulation of text messaging data to development of automatic intervention strategies against substance abuse and other adolescent challenges. viii
BackgroundA fundamental understanding of live-cell dynamics is necessary in order to advance scientific techniques and personalized medicine. For this understanding to be possible, image processing techniques, probes, tracking algorithms and many other methodologies must be improved. Currently there are no large open-source datasets containing live-cell imaging to act as a standard for the community. As a result, researchers cannot evaluate their methodologies on an independent benchmark or leverage such a dataset to formulate scientific questions.FindingsHere we present T-Time, the largest free and publicly available data set of T cell phase contrast imagery designed with the intention of furthering live-cell dynamics research. T-Time consists of over 40 GB of imagery data, and includes annotations derived from these images using a custom T cell identification and tracking algorithm. The data set contains 71 time-lapse sequences containing T cell movement and calcium release activated calcium channel activation, along with 50 time-lapse sequences of T cell activation and T reg interactions. The database includes a user-friendly web interface, summary information on the time-lapse images, and a mechanism for users to download tailored image datasets for their own research. T-Time is freely available on the web at http://ttime.mlatlab.org.ConclusionsT-Time is a novel data set of T cell images and associated metadata. It allows users to study T cell interaction and activation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.