Analytics

Tembo’s Analytics Stack enables you to efficiently query large amounts of “off-site” data (such as S3 parquet files) easily. The Stack is powered by ParadeDB’s pg_analytics extension and is tuned for analytics workloads.

Extensions

  • pg_analytics - provides implementations of various foreign data wrappers for querying diverse table formats (Iceberg, Delta lake, etc.) in S3 and other backing stores. DuckDB ensures efficient query plans
  • pg_partman - simplifies and automates partitioning of large database tables
  • pg_cron - for scheduling maintenance operations

Getting started

This guide will demonstrate how to query a parquet table stored in an S3 bucket.

Preparing your database

Simply launch an Analytics stack in Tembo Cloud and wait for the database to be ready in the UI.

Once that’s up and running, you’ll need a client machine with psql (to connect to your database). Our parquet file is available in a public S3 bucket.

Getting psql

If you need help installing psql, finding instructions for you platform should be relatively straightforward:

  • Mac — Homebrew’s libpq package includes psql
  • Ubuntu/Debian — the postgresql-client apt package provides psql
  • Windows — EDB’s installers can help

Create the foreign table

Foreign data wrappers can be a little verbose to use, but the gist is a wrapper must be declared first, then a server using that wrapper, then a table using that server:

CREATE FOREIGN DATA WRAPPER parquet_wrapper
  HANDLER parquet_fdw_handler
  VALIDATOR parquet_fdw_validator;

CREATE SERVER parquet_server
  FOREIGN DATA WRAPPER parquet_wrapper;

CREATE FOREIGN TABLE trips ()
  SERVER parquet_server
  OPTIONS (files 's3://tembo-demo-bucket/yellow_tripdata_2024-01.parquet');

After creating the table, querying can begin immediately. Note that the column names and types for this table have automatically been inferred from the parquet file itself.

SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
 vendorid | passenger_count | trip_distance 
----------+-----------------+---------------
        2 |               1 |          1.72
(1 row)
SELECT COUNT(*) FROM trips;
  count  
---------
 2964624
(1 row)

Executing Analytical Queries

Executing aggregates and analytics queries on the data that is in S3 works exactly the same as if that data were in Postgres. For example, the business time of time of day when the most trips are taken can be determined with a simple query:

SELECT 
  EXTRACT(HOUR FROM tpep_pickup_datetime) AS pickup_hour,
  COUNT(*) AS total_records
FROM trips
GROUP BY pickup_hour
ORDER BY pickup_hour;
 pickup_hour | total_records 
-------------+---------------
           0 |         79094
           1 |         53627
           2 |         37517
           3 |         24811
           4 |         16742
           5 |         18764
           6 |         41429
           7 |         83719
           8 |        117209
           9 |        128970
          10 |        138778
          11 |        150542
          12 |        164559
          13 |        169903
          14 |        182898
          15 |        189359
          16 |        190201
          17 |        206257
          18 |        212788
          19 |        184032
          20 |        159989
          21 |        160888
          22 |        143261
          23 |        109287
(24 rows)

Time: 694.308 ms

Performance enabled by DuckDB

We can quickly compare this to running the same query but with the data locally in Postgres and without using ParadeDB’s DuckDB powered extensions.

First, create a local table and populate it with the same data from S3.

CREATE TABLE local_trips (LIKE trips INCLUDING ALL);
INSERT INTO local_trips SELECT * FROM trips;

Then run the same query on the local Postgres table. Notice the execution time being nearly 5x faster it is enabled by DuckDB.

SELECT 
  EXTRACT(HOUR FROM tpep_pickup_datetime) AS pickup_hour,
  COUNT(*) AS total_records
FROM trips
GROUP BY pickup_hour
ORDER BY pickup_hour;
 pickup_hour | total_records 
-------------+---------------
           0 |         79094
           1 |         53627
           2 |         37517
           3 |         24811
           4 |         16742
           5 |         18764
           6 |         41429
           7 |         83719
           8 |        117209
           9 |        128970
          10 |        138778
          11 |        150542
          12 |        164559
          13 |        169903
          14 |        182898
          15 |        189359
          16 |        190201
          17 |        206257
          18 |        212788
          19 |        184032
          20 |        159989
          21 |        160888
          22 |        143261
          23 |        109287
(24 rows)

Time: 3317.911 ms (00:03.318)

Other Formats

Though the above example specifies a Parquet file for use, pg_analytics supports several different formats, each with its own FOREIGN DATA WRAPPER:

  • CSV — comma-separated values, the classic
  • Delta — Delta tables, enhanced Parquet tables with transactional capabilities
  • Iceberg — from the Apache Iceberg project, designed for large analytic data sets
  • Spatial — for querying geojson or similar

By varying the URI scheme in the files option provided when creating a FOREIGN TABLE, users can easily make use of alternate object stores. s3 (AWS), az (Azure), r2 (Cloudflare), gs (Google Cloud Storage), https, and hf (Hugging Face data sets) are all easily supported.

Authentication

pg_analytics relies on the use of the SQL-standard CREATE USER MAPPING command to specify credentials to be used when connecting to remote object stores. For instance, one might configure S3 credentials like so:

CREATE USER MAPPING FOR marketing_analytics
SERVER s3_foreign_server
OPTIONS (
  type 'S3',
  key_id 'AKIAIOSFODNN7EXAMPLE',
  secret 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
  region 'us-east-1'
);

PostgreSQL has support for a special PUBLIC user, which will be the fallback if no specific user is found. See pg_analytics’s documentation for individual object stores (for instance, S3’s object store) for full information about supported parameters.

Additional Reading

ParadeDB’s data ingest documentation has complete information about the configuration and use of pg_analytics.

Next

chevron right arrow

ParadeDB

Analytical

postgresqlparadedbanalytics