Tembo DataWarehouse Stack is tuned and configured for data warehouse workloads. Extract, Transform and Load data from external sources using extensions. Build centralize datastore for analytical and tactical queries.
Container Image
This stack is built with a custom image, dw-cnpg
, which you can find more detailed information about within the dw-cnpg Dockerfile
For interest in the other Stack-specific images, please visit the official tembo-images repository.
Extensions
pg_stat_statements
provides statistics on SQL statements executed by the database. It helps users analyze query performance and identify areas for optimization.- hydra_columnar -
hydra_columnar
is open source, column-oriented Postgres, designed for high-speed aggregate operations. - pg_partman -
pg_partman
- simplifies and automates partitioning of large database tables. It helps manage data efficiently by dividing it into smaller, more manageable partitions. - pg_cron -
pg_cron
automates database tasks within PostgreSQL, enabling scheduled maintenance, recurring tasks, and interval-based SQL queries. - postgres_fdw -
postgres_fdw
provides the foreign data wrapper necessary to access data stored in external Postgres servers. - redis_fdw -
redis_fdw
provides the foreign data wrapper necessary to access data stored in external Redis servers. - wrappers -
wrappers
is a development framework for Postgres Foreign Data Wrappers (FDW), written in Rust. It also comes with collection of FDWs built by Supabase. - multicorn -
multicorn2
Foreign Data Wrapper allows you to fetch foreign data in Python in your PostgreSQL server. - Extensions from Trunk can be installed on-demand.
Getting started
Let’s say you are wearing a data engineer hat and working on a click through rate (CTR) project and you have got a bunch of CTR data dumped in S3. Data can be in CSV, Parquet or JSON format. Motivation is to get insights from S3 data with minimal setup.
About Data in S3
The data we will use for the example is available in a publicly available S3 bucket. You can download the data using the aws-cli from the bucket using the following command. Please note that you will need to have the aws-cli installed and configured with your AWS credentials.
aws s3api get-object --bucket tembo-demo-bucket --key CTR1M.csv.gz ./CTR1M.csv.gz --no-sign-request
And then create a new bucket and upload the data using the following command.
aws s3api create-bucket --bucket tembo-demo-bucket
aws s3api put-object --bucket tembo-demo-bucket --key CTR1M.csv.gz --body CTR1M.csv.gz
Please see the AWS S3 documentation on downloading and uploading objects for more details.
Data fields
id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 — anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 — anonymized categorical variables
Setup
First, connect to your Tembo cluster:
psql 'postgresql://postgres:<your-password>@<your-host>:5432/postgres'
Accessing S3 data from the Tembo stack is a three step process.
Step 1
Define a handler and validator for wrapper.
CREATE foreign data wrapper s3_wrapper
handler s3_fdw_handler
validator s3_fdw_validator;
Step 2
Create a server on top of wrapper.
create server s3_server
foreign data wrapper s3_wrapper
options (
aws_access_key_id 'INSERT_ACCESS_KEY_ID',
aws_secret_access_key 'INSERT_SECRET_ACCESS_KEY',
aws_region 'us-east-1'
);
Step 3
Create a foreign table that matches the columns of the CSV data. Keep all the columns of type text.
create foreign table s3_ctr_1M_gzip_csv(
id text,
click text,
hour text,
c1 text,
banner_pos text,
site_id text,
site_domain text,
site_category text,
app_id text,
app_domain text,
app_category text,
device_id text,
device_ip text,
device_model text,
device_type text,
device_conn_type text,
c14 text,
c15 text,
c16 text,
c17 text,
c18 text,
c19 text,
c20 text,
c21 text
)server s3_server
options (
uri 's3://tembo-demo-bucket/CTR1M.csv.gz',
format 'csv',
has_header 'true',
compress 'gzip'
);
Good to go…
Query 1
Gather overall CTR across 1 million impressions.
SELECT count(*) impressions, sum(click::integer) clicks, sum(click::integer) / count(*)::numeric overall_ctr from s3_ctr_1M_gzip_csv;
Result:
impressions | clicks | overall_ctr
-------------+--------+------------------------
1000000 | 169566 | 0.16956600000000000000
(1 row)
Query 2
Gather hourly click trend.
SELECT EXTRACT(HOUR from to_timestamp(hour,'YYMMDDHH24')) hourly, count(*) hourly_impressions, sum(click::integer) clicks_per_hour, sum(click::integer) / count(*)::numeric hourly_ctr from s3_ctr_1M_gzip_csv GROUP BY hourly ORDER BY hourly;
Result:
hourly | hourly_impressions | clicks_per_hour | hourly_ctr
--------+--------------------+-----------------+------------------------
0 | 20954 | 3639 | 0.17366612579937004868
1 | 24454 | 4518 | 0.18475505029851966958
2 | 30281 | 5342 | 0.17641425316204880948
3 | 34689 | 6093 | 0.17564645853152296117
4 | 47580 | 7626 | 0.16027742749054224464
5 | 49165 | 8055 | 0.16383606223939794569
6 | 43411 | 7189 | 0.16560318813204026629
7 | 45395 | 8115 | 0.17876418107721114660
8 | 51687 | 8405 | 0.16261342310445566583
9 | 56745 | 9046 | 0.15941492642523570359
10 | 53366 | 8651 | 0.16210695948731402016
11 | 50360 | 8549 | 0.16975774424146147736
12 | 54729 | 9355 | 0.17093314330610827897
13 | 59380 | 10046 | 0.16918154260693836309
14 | 54753 | 9651 | 0.17626431428414881376
15 | 51496 | 9223 | 0.17910128942053751748
16 | 50269 | 9031 | 0.17965346436173387177
17 | 49995 | 8762 | 0.17525752575257525753
18 | 43401 | 7283 | 0.16780719338264095297
19 | 32943 | 5417 | 0.16443554017545457305
20 | 27687 | 4477 | 0.16170043702820818435
21 | 24543 | 3897 | 0.15878254492115878254
22 | 22523 | 3736 | 0.16587488345247080762
23 | 20194 | 3460 | 0.17133802119441418243
(24 rows)
Query 3
Gather click trend over DayOfWeek.
SELECT EXTRACT(DOW from to_timestamp(hour,'YYMMDDHH24')) DayOfWeek, count(*) dow_impressions, sum(cast(click as integer)) dow_clicks, sum(click::integer)/count(*)::numeric dow_ctr from s3_ctr_1M_gzip_csv GROUP BY DayOfWeek ORDER BY DayOfWeek;
Result:
dayofweek | dow_impressions | dow_clicks | dow_ctr
-----------+-----------------+------------+------------------------
0 | 94449 | 17216 | 0.18227826657772978009
1 | 79607 | 14508 | 0.18224527993769392139
2 | 232847 | 37659 | 0.16173281167461895580
3 | 226852 | 35536 | 0.15664838749493061556
4 | 200471 | 35201 | 0.17559148205974929042
5 | 82523 | 14410 | 0.17461798528894974734
6 | 83251 | 15036 | 0.18061044311780038678
(7 rows)
Query 4
Gather click trend on various Devices.
SELECT device_type, count(*) impressions_per_device, sum(click::integer) clicks_per_device, sum(click::integer)/count(*)::numeric device_ctr from s3_ctr_1M_gzip_csv GROUP BY 1 ORDER BY 1;
Result:
device_type | impressions_per_device | clicks_per_device | device_ctr
-------------+------------------------+-------------------+------------------------
0 | 55111 | 11636 | 0.21113752245468236831
1 | 922596 | 155849 | 0.16892442629276519733
2 | 1 | 0 | 0.00000000000000000000
4 | 19096 | 1791 | 0.09378927524088814411
5 | 3196 | 290 | 0.09073842302878598248
(5 rows)
Query 5
Gather click trend for Features.
SELECT c1 feature, count(*) impressions_per_feature, sum(click::integer) clicks_per_feature, sum(click::integer)/count(*)::numeric feature_ctr from s3_ctr_1M_gzip_csv GROUP BY 1 ORDER BY 1;
Result:
feature | impressions_per_feature | clicks_per_feature | feature_ctr
---------+-------------------------+--------------------+------------------------
1001 | 232 | 4 | 0.01724137931034482759
1002 | 55111 | 11636 | 0.21113752245468236831
1005 | 918496 | 155315 | 0.16909708915444378636
1007 | 893 | 47 | 0.05263157894736842105
1008 | 145 | 16 | 0.11034482758620689655
1010 | 22292 | 2081 | 0.09335187511214785573
1012 | 2831 | 467 | 0.16495937831155068880
(7 rows)