Query data with DuckDB
Once you’ve loaded data into DuckDB, you can query it using SQL expressions. The format does not differ from “normal” queries:
SELECT * FROM users WHERE ID>1000 ORDER BY Name DESC LIMIT 5;
If you want to use a client API for DuckDB queries, there are two options: You can pass SQL strings via the API – or use the client’s relational interface to build data queries programmatically. Specifically, this could look like this in Python with a JSON file:
import duckdb
file = duckdb.read_json("users.json")
file.select("*").filter("ID>1000").order("Name").limit(5)
In the case of Python, you also have the option of using the PySpark API to query DuckDB directly. It should be noted that the implementation does not (yet) support the full range of functions.
DuckDB’s SQL dialect also contains some analytics-related additional elements. For example, if you only want to query a subset of data within a table, you can do this using the SAMPLE clause. The resulting query runs significantly faster, but may also be less accurate. In addition, DuckDB also supports, among other things:
-
the
PIVOT-Keyword to create corresponding tables, -
Window functions as well
-
QUALIFY-Clauses to filter these.
DuckDB-Extensions
As mentioned earlier, DuckDB is not limited to the built-in data formats and behaviors. It is also possible to integrate various addons via the Extension API. For example, extensions are available for:
This article originally appeared at our sister publication Infoworld.com.
