Understanding the behavior of source code written in an unfamiliar programming language is difficult. One way to aid understanding of difficult code is to add corresponding pseudo-code, which describes in detail the workings of the code in a natural language such as English. In spite of its usefulness, most source code does not have corresponding pseudocode because it is tedious to create. This paper demonstrates a tool Pseudogen that makes it possible to automatically generate pseudo-code from source code using statistical machine translation (SMT). 1 Pseudogen currently supports generation of English or Japanese pseudo-code from Python source code, and the SMT framework makes it easy for users to create new generators for their preferred source code/pseudo-code pairs.
Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This article studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that comprehensive descriptions and experimental analyses on their ideas are not provided. Engineers face difficulties in implementing an efficient DAAC. In this article, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github. com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.
Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC.In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github.com/ daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.
Error analysis is used to improve accuracy of machine translation (MT) systems. Various methods of analyzing MT errors have been proposed; however, most of these methods are based on differences between translations and references that are translated independently by human translators, and few methods have been proposed for manual error analysis. This work proposes a method that uses a machine learning framework to identify errors in MT output, and improves efficiency of manual error analysis. Our method builds models that classify low and high quality translations, then identifies features of low quality translations to improve efficiency of the manual analysis. Experiments showed that by using our methods, we could improve the efficiency of MT error analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.