The problem of source code analysis using machine learning techniques has been gaining much attention recently as several powerful code embeddings methods have been created. Having available different embedding methods for source code opened the way for tackling many practical problems related to source code analysis. This paper addresses the issue of determining the number of distinct algorithmic strategies that may be found in a set of correct solutions for a competitive programming problem. We have investigated using five embedding methods with three algorithmic strategies in a data analysis pipeline that tackle the previously described issues on a newly created dataset consisting of 15 algorithmic problems. We propose an unsupervised algorithm that considers each embedding as a different dataset view. On each view, a hard clustering algorithm can be employed to split the correct solutions into K distinct algorithmic approaches, where K is a hyperparameter. In the ideal case, if all the algorithmic approaches in each view are determined correctly, a mapping could be done between each consecutive view such that each algorithmic approach from one view maps to a single corresponding algorithmic approach in the other. We are interested in determining the maximum subset of solutions that satisfy the assumption from above and use this subset as a base for a co-training procedure where an estimator employed on each view votes for the algorithmic approach of each solution which is not part of the subset. According to the results, the proposed unsupervised voting algorithm highly improves the baseline clustering approach. This improvement was observed across all problems in the dataset except one. Scale-up of the data analysis pipeline to datasets of thousands of problems may yield the ability to profoundly understand and learn about the innovative process of correctly designing and writing code in the context of competitive programming or even industry code.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.