JSON is a popular data format used pervasively in web APIs, cloud computing, NoSQL databases, and increasingly also machine learning. JSON Schema is a language for declaring the structure of valid JSON data. There are validators that can decide whether a JSON document is valid with respect to a schema. Unfortunately, like all instance-based testing, these validators can only show the presence and never the absence of a bug. This paper presents a complementary technique: JSON subschema checking, which can be used for static type checking with JSON Schema. Deciding whether one schema is a subschema of another is non-trivial because of the richness of the JSON Schema specification language. Given a pair of schemas, our approach first canonicalizes and simplifies both schemas, then decides the subschema question on the canonical forms, dispatching simpler subschema queries to type-specific checkers. We apply an implementation of our subschema checking algorithm to 8,548 pairs of real-world JSON schemas from different domains, demonstrating that it can decide the subschema question for most schema pairs and is always correct for schema pairs that it can decide. We hope that our work will bring more static guarantees to hardto-debug domains, such as cloud computing and artificial intelligence.
IntroductionJSON (JavaScript Object Notation) is a data serialization format that is widely adopted to store data on disk or send it over the network. Derived from JavaScript, JSON is both humanand machine-readable, and there are now JSON parsers for many programming languages. JSON supports primitive data types, such as strings, numbers, and Booleans, and two data structures: arrays, which represent lists of values, and objects, which represent maps of key-value pairs. The data types can be nested, e.g., to have an array of two objects that each map a key to some primitive value.JSON is used in numerous applications. It is the most popular data exchange format in web APIs, ahead of XML [21]. Cloud-hosted applications also use JSON pervasively, e.g., in micro-services that communicate via JSON data [17]. On the data storage side, not only do traditional database management systems, such as Oracle, IBM DB2, MySQL, and PostgreSQL, now support JSON, but two of the most widely deployed NoSQL database management systems, MongoDB and Cloudant, are entirely based on JSON. Beyond web, cloud, and database applications, JSON is also gaining adoption in artificial intelligence (AI) [13,23].