Ben Gelman scite author profile

Ben Gelman

5Publications

49Citation Statements Received

34Citation Statements Given

How they've been cited

How they cite others

Affiliations

George Mason University, Massachusetts Institute of Technology

Publications

Order By: Most citations

Uncovering Trajectories of Informal Learning in Large Online Communities of Creators

Yang

Domeniconi

Revelle

et al. 2015

View full text Add to dashboard Cite

We analyzed informal learning in Scratch Online -an online community with over 4.3 million users and 6.7 million instances of user-generated content. Users develop projects, which are graphical interfaces consisting of interacting programming blocks. We investigated two fundamental questions of how we can model informal learning, and which patterns of informal learning emerge. We proceeded in two phases. First, we modeled learning as a trajectory of cumulative programming block usage by long-term users who created at least 50 projects. Second, we applied K-means++ clustering to uncover patterns of learning and corresponding subpopulations. We found four groups of users manifesting four different patterns of learning, ranging from the smallest to the largest improvement. At one end of the spectrum, users learned more and in a faster manner. At the opposite end, users did not show much learning progress, even after creating dozens of projects. The modeling and clustering of trajectory patterns that enabled us to quantitatively analyze informal learning may be applicable to other similar communities. The results can also support administrators of online communities in implementing customized interventions for specific subpopulations.

show abstract

A language-agnostic model for semantic source code labeling

Gelman¹,

Hoyle²,

Moore³

et al. 2018

View full text Add to dashboard Cite

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

show abstract

Data science foundry for MOOCs

Boyer

Gelman

Schreck

et al. 2015

View full text Add to dashboard Cite

Logical Segmentation of Source Code

Dormuth¹,

Gelman²,

Moore

et al. 2019

View full text Add to dashboard Cite

Many software analysis methods have come to rely on machine learning approaches. Code segmentation -the process of decomposing source code into meaningful blockscan augment these methods by featurizing code, reducing noise, and limiting the problem space. Traditionally, code segmentation has been done using syntactic cues; current approaches do not intentionally capture logical content. We develop a novel deep learning approach to generate logical code segments regardless of the language or syntactic correctness of the code. Due to the lack of logically segmented source code, we introduce a unique data set construction technique to approximate ground truth for logically segmented code. Logical code segmentation can improve tasks such as automatically commenting code, detecting software vulnerabilities, repairing bugs, labeling code functionality, and synthesizing new code.

show abstract

Source code analysis dataset

Gelman¹,

Obayomi²,

Moore³

et al. 2019

Data in Brief

View full text Add to dashboard Cite

The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ben Gelman

Uncovering Trajectories of Informal Learning in Large Online Communities of Creators

A language-agnostic model for semantic source code labeling

Data science foundry for MOOCs

Logical Segmentation of Source Code

Source code analysis dataset

Contact Info

Product

Resources

About