Product arrow right Stacks arrow right Analytical arrow right Analytics

Analytics

Tembo’s Analytics Stack provides a Postgres database which is tuned for efficient querying of large amounts of data stored locally in Postgres or on remote storage such as S3. The S3 querying functionality is powered by pg_analytics.

Extensions

  • pg_stat_statements provides statistics on SQL statements executed by the database. It helps users analyze query performance and identify areas for optimization.
  • pg_analytics - provides implementations of various foreign data wrappers for querying diverse table formats (Iceberg, Delta lake, etc.) in S3 and other backing stores. DuckDB ensures efficient query plans
  • pg_partman - simplifies and automates partitioning of large database tables
  • pg_cron - for scheduling maintenance operations
  • hydra_columnar - hydra_columnar is open source, column-oriented Postgres, designed for high-speed aggregate operations.
  • postgres_fdw - postgres_fdw provides the foreign data wrapper necessary to access data stored in external Postgres servers.
  • wrappers - wrappers is a development framework for Postgres Foreign Data Wrappers (FDW), written in Rust. It also comes with collection of FDWs built by Supabase.
  • pg_parquet - pg_parquet is a PostgreSQL extension that allows you to read and write Parquet files, which are located in S3 or file system, from PostgreSQL via COPY TO/FROM commands.
  • pg_later - pg_later is a PostgreSQL extension to execute queries asynchronously.
  • wrappers - wrappers is a development framework for Postgres Foreign Data Wrappers (FDW), written in Rust. It also comes with collection of FDWs built by Supabase.
  • multicorn - multicorn2 is a foreign data wrapper that allows you to fetch foreign data in Python in your PostgreSQL server.
  • pg_tier - pg_tier is a PostgreSQL extension that enables data tiering to AWS S3.
  • Extensions from Trunk can be installed on-demand.

Getting started

Tembo supports running analytics workloads with data stored both locally and on object storage. We’ll cover both examples in this guide.

Preparing your database

Simply launch an Analytics stack in Tembo Cloud and wait for the database to be ready in the UI.

Once that’s up and running, you’ll need a client machine with psql (to connect to your database). Our parquet file is available in a public S3 bucket.

Getting psql

If you need help installing psql, finding instructions for you platform should be relatively straightforward:

  • Mac — Homebrew’s libpq package includes psql
  • Ubuntu/Debian — the postgresql-client apt package provides psql
  • Windows — EDB’s installers can help

Analytics on Parquet data in S3

This guide will demonstrate how to query a parquet table stored in an S3 bucket.

Create the foreign table

Foreign data wrappers can be a little verbose to use, but the gist is a wrapper must be declared first, then a server using that wrapper, then a table using that server:

CREATE FOREIGN DATA WRAPPER parquet_wrapper
  HANDLER parquet_fdw_handler
  VALIDATOR parquet_fdw_validator;

CREATE SERVER parquet_server
  FOREIGN DATA WRAPPER parquet_wrapper;

CREATE FOREIGN TABLE trips ()
  SERVER parquet_server
  OPTIONS (files 's3://tembo-demo-bucket/yellow_tripdata_2024-01.parquet');

After creating the table, querying can begin immediately. Note that the column names and types for this table have automatically been inferred from the parquet file itself.

SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
 vendorid | passenger_count | trip_distance 
----------+-----------------+---------------
        2 |               1 |          1.72
(1 row)
SELECT COUNT(*) FROM trips;
  count  
---------
 2964624
(1 row)

Executing Analytical Queries

Executing aggregates and analytics queries on the data that is in S3 works exactly the same as if that data were in Postgres. For example, the business time of time of day when the most trips are taken can be determined with a simple query:

SELECT 
  EXTRACT(HOUR FROM tpep_pickup_datetime) AS pickup_hour,
  COUNT(*) AS total_records
FROM trips
GROUP BY pickup_hour
ORDER BY pickup_hour;
 pickup_hour | total_records 
-------------+---------------
           0 |         79094
           1 |         53627
           2 |         37517
           3 |         24811
           4 |         16742
           5 |         18764
           6 |         41429
           7 |         83719
           8 |        117209
           9 |        128970
          10 |        138778
          11 |        150542
          12 |        164559
          13 |        169903
          14 |        182898
          15 |        189359
          16 |        190201
          17 |        206257
          18 |        212788
          19 |        184032
          20 |        159989
          21 |        160888
          22 |        143261
          23 |        109287
(24 rows)

Time: 694.308 ms

Performance enabled by DuckDB

We can quickly compare this to running the same query but with the data locally in Postgres and without using ParadeDB’s DuckDB powered extensions.

First, create a local table and populate it with the same data from S3.

CREATE TABLE local_trips (LIKE trips INCLUDING ALL);
INSERT INTO local_trips SELECT * FROM trips;

Then run the same query on the local Postgres table. Notice the execution time being nearly 5x faster it is enabled by DuckDB.

SELECT 
  EXTRACT(HOUR FROM tpep_pickup_datetime) AS pickup_hour,
  COUNT(*) AS total_records
FROM local_trips
GROUP BY pickup_hour
ORDER BY pickup_hour;
 pickup_hour | total_records 
-------------+---------------
           0 |         79094
           1 |         53627
           2 |         37517
           3 |         24811
           4 |         16742
           5 |         18764
           6 |         41429
           7 |         83719
           8 |        117209
           9 |        128970
          10 |        138778
          11 |        150542
          12 |        164559
          13 |        169903
          14 |        182898
          15 |        189359
          16 |        190201
          17 |        206257
          18 |        212788
          19 |        184032
          20 |        159989
          21 |        160888
          22 |        143261
          23 |        109287
(24 rows)

Time: 3317.911 ms (00:03.318)

Other Formats

Though the above example specifies a Parquet file for use, pg_analytics supports several different formats, each with its own FOREIGN DATA WRAPPER:

  • CSV — comma-separated values, the classic
  • Delta — Delta tables, enhanced Parquet tables with transactional capabilities
  • Iceberg — from the Apache Iceberg project, designed for large analytic data sets
  • Spatial — for querying geojson or similar

By varying the URI scheme in the files option provided when creating a FOREIGN TABLE, users can easily make use of alternate object stores. s3 (AWS), az (Azure), r2 (Cloudflare), gs (Google Cloud Storage), https, and hf (Hugging Face data sets) are all easily supported.

Authentication

pg_analytics relies on the use of the SQL-standard CREATE USER MAPPING command to specify credentials to be used when connecting to remote object stores. For instance, one might configure S3 credentials like so:

CREATE USER MAPPING FOR marketing_analytics
SERVER s3_foreign_server
OPTIONS (
  type 'S3',
  key_id 'AKIAIOSFODNN7EXAMPLE',
  secret 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
  region 'us-east-1'
);

PostgreSQL has support for a special PUBLIC user, which will be the fallback if no specific user is found. See pg_analytics’s documentation for individual object stores (for instance, S3’s object store) for full information about supported parameters.

Additional Reading

ParadeDB’s data ingest documentation has complete information about the configuration and use of pg_analytics.

Analytics on Columnar Data stored in Postgres

This guide will walk through setting up an analytical workload on Postgres using Clickbench’s “hits” dataset (representing page views on a system) and the Hydra Postgres extension.

Check that columnar is enabled

On your Tembo Analytics instance, the columnar extension should be pre-installed and enabled on the postgres database.

postgres=# \dx columnar
              List of installed extensions
   Name   | Version | Schema |       Description
----------+---------+--------+--------------------------
 columnar | 11.1-10 | public | Hydra Columnar extension
(1 row)

Load a sample dataset

Next, download then unzip the sample dataset. It is 15GB compressed.

wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
gzip -d hits.tsv.gz

There are 99,997,497 records in hitz.tsv. For simplicity (and to speed up the duration of this guide), take a subset of the data.

head -1000000 hits.tsv >> hits_1mil.tsv

Once the download is complete, create a table configured for columnar storage using the columnar extension. Note the USING columnar clause at the end of the CREATE TABLE statement.

CREATE TABLE hits
(
    WatchID BIGINT NOT NULL,
    JavaEnable SMALLINT NOT NULL,
    Title TEXT NOT NULL,
    GoodEvent SMALLINT NOT NULL,
    EventTime TIMESTAMP NOT NULL,
    EventDate Date NOT NULL,
    CounterID INTEGER NOT NULL,
    ClientIP INTEGER NOT NULL,
    RegionID INTEGER NOT NULL,
    UserID BIGINT NOT NULL,
    CounterClass SMALLINT NOT NULL,
    OS SMALLINT NOT NULL,
    UserAgent SMALLINT NOT NULL,
    URL TEXT NOT NULL,
    Referer TEXT NOT NULL,
    IsRefresh SMALLINT NOT NULL,
    RefererCategoryID SMALLINT NOT NULL,
    RefererRegionID INTEGER NOT NULL,
    URLCategoryID SMALLINT NOT NULL,
    URLRegionID INTEGER NOT NULL,
    ResolutionWidth SMALLINT NOT NULL,
    ResolutionHeight SMALLINT NOT NULL,
    ResolutionDepth SMALLINT NOT NULL,
    FlashMajor SMALLINT NOT NULL,
    FlashMinor SMALLINT NOT NULL,
    FlashMinor2 TEXT NOT NULL,
    NetMajor SMALLINT NOT NULL,
    NetMinor SMALLINT NOT NULL,
    UserAgentMajor SMALLINT NOT NULL,
    UserAgentMinor VARCHAR(255) NOT NULL,
    CookieEnable SMALLINT NOT NULL,
    JavascriptEnable SMALLINT NOT NULL,
    IsMobile SMALLINT NOT NULL,
    MobilePhone SMALLINT NOT NULL,
    MobilePhoneModel TEXT NOT NULL,
    Params TEXT NOT NULL,
    IPNetworkID INTEGER NOT NULL,
    TraficSourceID SMALLINT NOT NULL,
    SearchEngineID SMALLINT NOT NULL,
    SearchPhrase TEXT NOT NULL,
    AdvEngineID SMALLINT NOT NULL,
    IsArtifical SMALLINT NOT NULL,
    WindowClientWidth SMALLINT NOT NULL,
    WindowClientHeight SMALLINT NOT NULL,
    ClientTimeZone SMALLINT NOT NULL,
    ClientEventTime TIMESTAMP NOT NULL,
    SilverlightVersion1 SMALLINT NOT NULL,
    SilverlightVersion2 SMALLINT NOT NULL,
    SilverlightVersion3 INTEGER NOT NULL,
    SilverlightVersion4 SMALLINT NOT NULL,
    PageCharset TEXT NOT NULL,
    CodeVersion INTEGER NOT NULL,
    IsLink SMALLINT NOT NULL,
    IsDownload SMALLINT NOT NULL,
    IsNotBounce SMALLINT NOT NULL,
    FUniqID BIGINT NOT NULL,
    OriginalURL TEXT NOT NULL,
    HID INTEGER NOT NULL,
    IsOldCounter SMALLINT NOT NULL,
    IsEvent SMALLINT NOT NULL,
    IsParameter SMALLINT NOT NULL,
    DontCountHits SMALLINT NOT NULL,
    WithHash SMALLINT NOT NULL,
    HitColor CHAR NOT NULL,
    LocalEventTime TIMESTAMP NOT NULL,
    Age SMALLINT NOT NULL,
    Sex SMALLINT NOT NULL,
    Income SMALLINT NOT NULL,
    Interests SMALLINT NOT NULL,
    Robotness SMALLINT NOT NULL,
    RemoteIP INTEGER NOT NULL,
    WindowName INTEGER NOT NULL,
    OpenerName INTEGER NOT NULL,
    HistoryLength SMALLINT NOT NULL,
    BrowserLanguage TEXT NOT NULL,
    BrowserCountry TEXT NOT NULL,
    SocialNetwork TEXT NOT NULL,
    SocialAction TEXT NOT NULL,
    HTTPError SMALLINT NOT NULL,
    SendTiming INTEGER NOT NULL,
    DNSTiming INTEGER NOT NULL,
    ConnectTiming INTEGER NOT NULL,
    ResponseStartTiming INTEGER NOT NULL,
    ResponseEndTiming INTEGER NOT NULL,
    FetchTiming INTEGER NOT NULL,
    SocialSourceNetworkID SMALLINT NOT NULL,
    SocialSourcePage TEXT NOT NULL,
    ParamPrice BIGINT NOT NULL,
    ParamOrderID TEXT NOT NULL,
    ParamCurrency TEXT NOT NULL,
    ParamCurrencyID SMALLINT NOT NULL,
    OpenstatServiceName TEXT NOT NULL,
    OpenstatCampaignID TEXT NOT NULL,
    OpenstatAdID TEXT NOT NULL,
    OpenstatSourceID TEXT NOT NULL,
    UTMSource TEXT NOT NULL,
    UTMMedium TEXT NOT NULL,
    UTMCampaign TEXT NOT NULL,
    UTMContent TEXT NOT NULL,
    UTMTerm TEXT NOT NULL,
    FromTag TEXT NOT NULL,
    HasGCLID SMALLINT NOT NULL,
    RefererHash BIGINT NOT NULL,
    URLHash BIGINT NOT NULL,
    CLID INTEGER NOT NULL,
    PRIMARY KEY (CounterID, EventDate, UserID, EventTime, WatchID)
) USING columnar;

From psql, copy the data set into the table.

\copy hits FROM 'hits_1mil.tsv'
COPY 1000000

Create indices on the table

Create indices on the table to optimize the workload for analytical queries. The indices below are relevant to the Clickbench workload.

CREATE INDEX userid on hits (UserID);
CREATE INDEX eventtime on hits (EventTime);
CREATE INDEX eventdate on hits (EventDate);
CREATE INDEX search on hits (SearchPhrase);
CREATE INDEX search2 on hits (SearchPhrase) WHERE SearchPhrase <> ''::text;

Execute analytical queries on the table

Now, you can try typical analytical queries on the table. Here’s a few examples of the queries you can try.

Get the minimum and maximum event dates from the table.

SELECT MIN(EventDate), MAX(EventDate) FROM hits;
    min     |    max
------------+------------
 2013-07-15 | 2013-07-15

Get the top 10 regions with the most unique users.

SELECT RegionID, COUNT(DISTINCT UserID) AS u
FROM hits GROUP BY RegionID
ORDER BY u DESC
LIMIT 10;
 regionid |   u
----------+-------
      229 | 27961
        2 | 10413
      208 |  3073
        1 |  1720
       34 |  1428
      158 |  1110
      184 |   987
      107 |   966
       42 |   956
       47 |   943

Calculate the top 10 pages with the most views for CounterID 62, excluding refreshes and considering only certain traffic sources and a specific referring page.

SELECT URLHash, EventDate, COUNT(*) AS PageViews
FROM hits
WHERE CounterID = 62
    AND IsRefresh = 0
    AND TraficSourceID IN (-1, 6)
    AND RefererHash = 3594120000172545465
GROUP BY URLHash, EventDate
ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
       urlhash        | eventdate  | pageviews
----------------------+------------+-----------
 -5794910153905534566 | 2013-07-15 |        19
 -8675455922189315655 | 2013-07-15 |        19
 -1354147336435390881 | 2013-07-15 |        19
 -1419388746330668048 | 2013-07-15 |        18
  7644052073203380311 | 2013-07-15 |        18
  2183693295573901880 | 2013-07-15 |        18
 -8213908143099318937 | 2013-07-15 |        18
  4329780285977997346 | 2013-07-15 |        18
  1237664075729419728 | 2013-07-15 |        18
  3229224080707842023 | 2013-07-15 |        17

Support

Join the Tembo Community in Slack to ask a question or see how others are building on https://cloud.tembo.io.

Next

chevron right arrow

ParadeDB

Analytical

postgresqlparadedbanalytics