Tembo’s Analytics Stack enables you to efficiently query large amounts of “off-site” data (such as S3 parquet files) easily. The Stack is powered by ParadeDB’s pg_analytics extension and is tuned for analytics workloads.
Extensions
- pg_analytics - provides implementations of various foreign data wrappers for querying diverse table formats (Iceberg, Delta lake, etc.) in S3 and other backing stores. DuckDB ensures efficient query plans
- pg_partman - simplifies and automates partitioning of large database tables
- pg_cron - for scheduling maintenance operations
Getting started
This guide will demonstrate how to query a parquet table stored in an S3 bucket.
Preparing your database
Simply launch an Analytics stack in Tembo Cloud and wait for the database to be ready in the UI.
Once that’s up and running, you’ll need a client machine with psql
(to connect to your database). Our parquet file is available in a public S3 bucket.
Getting psql
If you need help installing psql
, finding instructions for you platform should be relatively straightforward:
- Mac — Homebrew’s
libpq
package includespsql
- Ubuntu/Debian — the
postgresql-client
apt package providespsql
- Windows — EDB’s installers can help
Create the foreign table
Foreign data wrappers can be a little verbose to use, but the gist is a wrapper must be declared first, then a server using that wrapper, then a table using that server:
CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler
VALIDATOR parquet_fdw_validator;
CREATE SERVER parquet_server
FOREIGN DATA WRAPPER parquet_wrapper;
CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://tembo-demo-bucket/yellow_tripdata_2024-01.parquet');
After creating the table, querying can begin immediately. Note that the column names and types for this table have automatically been inferred from the parquet file itself.
SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
vendorid | passenger_count | trip_distance
----------+-----------------+---------------
2 | 1 | 1.72
(1 row)
SELECT COUNT(*) FROM trips;
count
---------
2964624
(1 row)
Executing Analytical Queries
Executing aggregates and analytics queries on the data that is in S3 works exactly the same as if that data were in Postgres. For example, the business time of time of day when the most trips are taken can be determined with a simple query:
SELECT
EXTRACT(HOUR FROM tpep_pickup_datetime) AS pickup_hour,
COUNT(*) AS total_records
FROM trips
GROUP BY pickup_hour
ORDER BY pickup_hour;
pickup_hour | total_records
-------------+---------------
0 | 79094
1 | 53627
2 | 37517
3 | 24811
4 | 16742
5 | 18764
6 | 41429
7 | 83719
8 | 117209
9 | 128970
10 | 138778
11 | 150542
12 | 164559
13 | 169903
14 | 182898
15 | 189359
16 | 190201
17 | 206257
18 | 212788
19 | 184032
20 | 159989
21 | 160888
22 | 143261
23 | 109287
(24 rows)
Time: 694.308 ms
Performance enabled by DuckDB
We can quickly compare this to running the same query but with the data locally in Postgres and without using ParadeDB’s DuckDB powered extensions.
First, create a local table and populate it with the same data from S3.
CREATE TABLE local_trips (LIKE trips INCLUDING ALL);
INSERT INTO local_trips SELECT * FROM trips;
Then run the same query on the local Postgres table. Notice the execution time being nearly 5x faster it is enabled by DuckDB.
SELECT
EXTRACT(HOUR FROM tpep_pickup_datetime) AS pickup_hour,
COUNT(*) AS total_records
FROM trips
GROUP BY pickup_hour
ORDER BY pickup_hour;
pickup_hour | total_records
-------------+---------------
0 | 79094
1 | 53627
2 | 37517
3 | 24811
4 | 16742
5 | 18764
6 | 41429
7 | 83719
8 | 117209
9 | 128970
10 | 138778
11 | 150542
12 | 164559
13 | 169903
14 | 182898
15 | 189359
16 | 190201
17 | 206257
18 | 212788
19 | 184032
20 | 159989
21 | 160888
22 | 143261
23 | 109287
(24 rows)
Time: 3317.911 ms (00:03.318)
Other Formats
Though the above example specifies a Parquet file for use, pg_analytics
supports several different formats, each with its own FOREIGN DATA WRAPPER
:
- CSV — comma-separated values, the classic
- Delta — Delta tables, enhanced Parquet tables with transactional capabilities
- Iceberg — from the Apache Iceberg project, designed for large analytic data sets
- Spatial — for querying
geojson
or similar
By varying the URI scheme in the files
option provided when creating a FOREIGN TABLE
, users can easily make use of alternate object stores. s3
(AWS), az
(Azure), r2
(Cloudflare), gs
(Google Cloud Storage), https
, and hf
(Hugging Face data sets) are all easily supported.
Authentication
pg_analytics
relies on the use of the SQL-standard CREATE USER MAPPING
command to specify credentials to be used when connecting to remote object stores. For instance, one might configure S3 credentials like so:
CREATE USER MAPPING FOR marketing_analytics
SERVER s3_foreign_server
OPTIONS (
type 'S3',
key_id 'AKIAIOSFODNN7EXAMPLE',
secret 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
region 'us-east-1'
);
PostgreSQL has support for a special PUBLIC
user, which will be the fallback if no specific user is found. See pg_analytics
’s documentation for individual object stores (for instance, S3’s object store) for full information about supported parameters.
Additional Reading
ParadeDB’s data ingest documentation has complete information about the configuration and use of pg_analytics
.