Daniel Le scite author profile

Daniel Le

4Publications

97Citation Statements Received

38Citation Statements Given

How they've been cited

How they cite others

Affiliations

Innovative Research (United States), United States National Library of Medicine, National Center for Biotechnology Information

Publications

Order By: Most citations

CoTexT: Multi-task Learning with Code-Text Transformer

Phan¹,

Tran²,

Le³

et al. 2021

View full text Add to dashboard Cite

We present CoTexT, a pre-trained, transformerbased encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pretrained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate Co-TexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

show abstract

Locating and parsing bibliographic references in HTML medical articles

Zou

Thoma

2010

IJDAR

View full text Add to dashboard Cite

The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

show abstract

<title>Automated labeling in document images</title>

Kim

Thoma

2000

View full text Add to dashboard Cite

CoTexT: Multi-task Learning with Code-Text Transformer

Phan¹,

Tran²,

Le³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daniel Le

CoTexT: Multi-task Learning with Code-Text Transformer

Locating and parsing bibliographic references in HTML medical articles

<title>Automated labeling in document images</title>

CoTexT: Multi-task Learning with Code-Text Transformer

Contact Info

Product

Resources

About