Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

Wang, Desheng; Liu, Jiawei; Qi, Xiang; Sun, Baolin; Zhang, Peng

doi:10.48550/arxiv.2005.02558

Cited by 1 publication

(2 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior work explored building compact regular expressions for short-strings -phone numbers, IP addresses, etc [9,14]. Their scalability is limited by example-driven state-and regex-induction methods.…”

Section: Discussionmentioning

confidence: 99%

“…• URLs and Instagram hashtags that often merge words • Account, page and group names as well as search queries that usually separate words, but don't form full sentences • Entire posts and individual comments Checking entire posts calls for more subtlety than short-text categories, due to a broader vocabulary and longer phrases. The ability to handle many content types well sets our work apart from much of prior art [2,3,7,9,10,12,14].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Regular Expressions for Fast-response COVID-19 Text Classification

Markov,

Liu,

Vagner

2021

Preprint

View full text Add to dashboard Cite

Text classifiers are at the core of many NLP applications and use a variety of algorithmic approaches and software. This paper describes how Facebook determines if a given piece of text -anything from a hashtag to a post -belongs to a narrow topic such as COVID-19. To fully define a topic and evaluate classifier performance we employ human-guided iterations of keyword discovery, but do not require labeled data. For COVID-19, we build two sets of regular expressions: (1) for 66 languages, with 99% precision and recall >50%, (2) for the 11 most common languages, with precision >90% and recall >90%. Regular expressions enable lowlatency queries from multiple platforms. Response to challenges like COVID-19 is fast and so are revisions. Comparisons to a DNN classifier show explainable results, higher precision and recall, and less overfitting. Our learnings can be applied to other narrow-topic classifiers.

show abstract

Section: Discussionmentioning

confidence: 99%