2020
DOI: 10.48550/arxiv.2005.02558
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

Abstract: Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 15 publications
0
2
0
Order By: Relevance
“…Prior work explored building compact regular expressions for short-strings -phone numbers, IP addresses, etc [9,14]. Their scalability is limited by example-driven state-and regex-induction methods.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Prior work explored building compact regular expressions for short-strings -phone numbers, IP addresses, etc [9,14]. Their scalability is limited by example-driven state-and regex-induction methods.…”
Section: Discussionmentioning
confidence: 99%
“…• URLs and Instagram hashtags that often merge words • Account, page and group names as well as search queries that usually separate words, but don't form full sentences • Entire posts and individual comments Checking entire posts calls for more subtlety than short-text categories, due to a broader vocabulary and longer phrases. The ability to handle many content types well sets our work apart from much of prior art [2,3,7,9,10,12,14].…”
Section: Introductionmentioning
confidence: 99%