Natural SQL: Making SQL Easier to Infer from Natural Language Specifications

Gan, Yujian; Xie, Jinxia; Purver, Matthew; Woodward, John R.; Drake, John H.; Zhang, Qiaofu

doi:10.18653/v1/2021.findings-emnlp.174

Cited by 29 publications

(18 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 2 presents the difference between the SQL in Spider and the SQL generated by NatSQL in Spider-SS. Our evaluation results are lower than the original NatSQL dataset (Gan et al, 2021b) because the Spider-SS uses equivalent SQL and corrects some errors, as discussed in Section 2.3. Some equivalent and corrected SQL cannot get positive results in exact match metric and execution match.…”

Section: Dataset Analysismentioning

confidence: 86%

“…The difficulty criteria are defined by Spider benchmark, including easy, medium, hard and extra hard. Experiments show that the more difficult the SQL is, the more difficult it is to predict correctly Shi et al, 2021;Gan et al, 2021b). It can be found from Table 3 that the difficulty distribution of CG-SUB T and CG-SUB D is similar to that of Spider D .…”

Section: Dataset Analysismentioning

confidence: 90%

“…Therefore, the model trained on Spider-SS may not be ideal for chasing the Spider benchmark, especially based on the exact match metric. Similarly, the RATSQL G extending Nat-SQL had achieved a previous SOTA result in the execution match of the Spider test set but get a worse result than the original in the exact match (Gan et al, 2021b). Thus, we recommend using NatSQL-based datasets to evaluate models trained on NatSQL.…”

Section: Dataset Analysismentioning

confidence: 94%

“…Unlike Spider, which annotates a whole SQL query to an entire sentence, Spider-SS annotates the SQL clauses to sub-sentences. Spider-SS uses NatSQL (Gan et al, 2021b) instead of SQL for annotation, because it is sometimes difficult to annotate the sub-sentences with corresponding SQL clauses due to the SQL language design. The Spider-SS provides a combination algorithm that collects all NatSQL clauses and then generates the NatSQL query, where the NatSQL query can be converted into an SQL query.…”

Section: Overviewmentioning

confidence: 99%

“…Next, we annotate every sub-sentence with its cor- , and how much does it weigh? responding SQL clause, reducing the difficulty of this task by using the intermediate representation language NatSQL (Gan et al, 2021b), which is simpler and syntactically aligns better with natural language (NL). Spider-SS thus provides a new resource for designing models with better generalization capabilities without designing a complex alignment algorithm.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment

Gan¹,

Huang²,

Purver³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In text-to-SQL tasks -as in much of NLPcompositional generalization is a major challenge: neural networks struggle with compositional generalization where training and test distributions differ. However, most recent attempts to improve this are based on word-level synthetic data or specific dataset splits to generate compositional biases. In this work, we propose a clause-level compositional example generation method. We first split the sentences in the Spider text-to-SQL dataset into subsentences, annotating each sub-sentence with its corresponding SQL clause, resulting in a new dataset Spider-SS. We then construct a further dataset, Spider-CG, by composing Spider-SS sub-sentences in different combinations, to test the ability of models to generalize compositionally. Experiments show that existing models suffer significant performance degradation when evaluated on Spider-CG, even though every sub-sentence is seen during training. To deal with this problem, we modify a number of state-of-the-art models to train on the segmented data of Spider-SS, and we show that this method improves the generalization performance. 1

show abstract

Section: Dataset Analysismentioning

confidence: 86%

Section: Dataset Analysismentioning

confidence: 90%

Section: Dataset Analysismentioning

confidence: 94%

Section: Overviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations