Skip to main content

· 9 min read
Samay Sharma

In case you missed it, Postgres 16 came out last week - and this year it arrived earlier than the last few years. There are many features that I’ve been looking forward to for the last few months and I’m excited to see them get into the hands of users. Before we dive into the specific features of this release, let’s discuss what a Postgres major release actually means.

postgres-16

Postgres Releases

The PostgreSQL Global Development Group releases a new major version every year with new features.

In addition, Postgres releases minor versions of each major release every 3 months or so with bug fixes and security fixes. No new features are released in minor versions, and that’s what makes major version releases so exciting as it’s the culmination of about a year’s worth of development work on the project.

While there's a case to be made for faster release of features, Postgres prides itself on stability and this release cadence provides enough time for features to be proposed, reviewed, committed and tested before they get shipped.

Should I upgrade to Postgres 16?

If you are building a new application, yes, I would recommend that you start with the latest major version of Postgres. This will guarantee the latest and greatest features, and a continuous flow of minor releases that fix bugs and improve the security of your database.

If you are upgrading an existing system, there are more factors to consider. The general advice is to upgrade minor versions always - because they contain security and bug fixes and the risk of not upgrading is higher.

However, for major versions, you will need to consider the tradeoffs as the major versions usually change the internal format of system tables and data files. That means, you can’t just use previous versions of the data directory — you’ll need to use pg_dump / pg_restore or pg_upgrade to upgrade. In addition, depending on the features you are using and the Postgres release, manual changes to your code or queries may also be required.

Obviously, another important factor if you are using a managed service provider is when they provide support for Postgres 16. At Tembo Cloud, we’ve already started working on supporting Postgres 16 and expect it to be available in a few weeks.

What’s most exciting about Postgres 16?

Postgres 16 delivers exciting features in all aspects of the database ranging from performance improvements, monitoring enhancements, better security and privilege handling, replication improvements, new server features and commands and a lot more.

If you’re interested in the complete list of features, you can read the detailed release notes. Below, I’ll talk about the aspects of this release which excite me the most and we will talk about a few not-so-talked about features which lay the groundwork for more exciting features in Postgres 17.

Logical replication improvements

Logical replication is a feature I’ve always been interested in, as it allows you to expand the capabilities of Postgres by moving data between different Postgres databases or from Postgres to other databases. It finds interesting use cases in: replicating selectively from one database to another, replicating across Postgres versions, online migrations and allowing consolidation from multiple databases.

This release, arguably the most exciting logical replication feature, is allowing logical replication from standby servers. Prior to this feature, you could only create a logical replication slot on the primary which meant adding more replicas would add more load on the primary. With Postgres 16, secondaries also have the ability to create replication slots allowing for more distribution of that load. What’s more is that the replication slots on the secondary are persisted even when the standby is promoted to the primary. This means that subscribers won’t be affected even during a failover! You can read more about this feature in Bertrand’s blog post.

The short of it is that now you can do this on a Postgres 16 standby:

postgres@standby=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)

postgres@standby=# SELECT * FROM pg_create_logical_replication_slot('active_slot', 'test_decoding', false, true);
slot_name | lsn
-------------+------------
active_slot | 0/C0053A78
(1 row)

On Postgres 15, the same thing would have errored out:

postgres@standby=# SELECT * FROM pg_create_logical_replication_slot('active_slot', 'test_decoding', false, true);
ERROR: logical decoding cannot be used while in recovery

In addition to this, there are a number of other logical replication performance improvements. This includes faster initial table sync using binary format, use of btree indexes during logical replication apply when tables don’t have a primary key (previously the table would be scanned sequentially) and parallel application of large transactions (~25-40% speedups).

Monitoring improvements

The other bucket of features which intrigued me are the monitoring enhancements. While Postgres provides a number of statistics tables with monitoring information, I believe more can be done to provide actionable insights to users. As an example, Lukas pointed out several interesting gaps in Postgres monitoring in his PGCon 2020 talk.

Coming back to this release, pg_stat_io has to be the most useful piece of information added to the Postgres stats views in Postgres 16. It allows you to understand the I/O done by Postgres at a more granular level, broken down by backend_type and context. This means you can calculate a more accurate cache hit ratio by ignoring the I/O done by VACUUM, differentiate between extends and flushes, and separate out bulk operations while deciding which configurations to tune. Melanie talks about this and much more in her talk and this blog post approaches how you would use this as a DBA.

Here is an example of the statistics you can see in pg_stat_io:

$ SELECT * FROM pg_stat_io ;
backend_type │ io_object │ io_context │ reads │ writes │ extends │ op_bytes │ evictions │ reuses │ fsyncs │ stats_reset
─────────────────────┼───────────────┼────────────┼─────────┼─────────┼─────────┼──────────┼───────────┼─────────┼────────┼───────────────────────────────
autovacuum launcher │ relation │ bulkread │ 0 │ 0 │ [NULL] │ 8192 │ 0 │ 0 │ [NULL] │ 2023-02-27 13:25:39.725072+01
...
autovacuum worker │ relation │ bulkread │ 0 │ 0 │ [NULL] │ 8192 │ 0 │ 0 │ [NULL] │ 2023-02-27 13:25:39.725072+01
...
client backend │ temp relation │ normal │ 0 │ 0 │ 0 │ 8192 │ 0 │ [NULL] │ [NULL] │ 2023-02-27 13:25:39.725072+01
background worker │ relation │ bulkread │ 268221 │ 268189 │ [NULL] │ 8192 │ 0 │ 268189 │ [NULL] │ 2023-02-27 13:25:39.725072+01
...
checkpointer │ relation │ normal │ [NULL] │ 32121 │ [NULL] │ 8192 │ [NULL] │ [NULL] │ 3356 │ 2023-02-27 13:25:39.725072+01
standalone backend │ relation │ bulkread │ 0 │ 0 │ [NULL] │ 8192 │ 0 │ 0 │ [NULL] │ 2023-02-27 13:25:39.725072+01
...
startup │ relation │ vacuum │ 0 │ 0 │ 0 │ 8192 │ 0 │ 0 │ [NULL] │ 2023-02-27 13:25:39.725072+01
walsender │ relation │ bulkread │ 0 │ 0 │ [NULL] │ 8192 │ 0 │ 0 │ [NULL] │ 2023-02-27 13:25:39.725072+01
...
walsender │ temp relation │ normal │ 0 │ 0 │ 0 │ 8192 │ 0 │ [NULL] │ [NULL] │ 2023-02-27 13:25:39.725072+01
...

In addition to this, there are other improvements including the addition of last_seq_scan and last_idx_scan on pg_stat_* tables which allow you to understand index usage better and figure out when plans for a query might have changed.

Special mentions

Like I said, each release comes with many improvements - and I could not outline them all in a blog post (and if I did, nobody would read it!). But I do want to mention a few other items in Postgres 16 that I won’t be able to dive deeper into but are interesting as well.

  • Load balancing with multiple hosts in libpq: This feature allows to balance the load across Postgres read replicas directly within libpq (which is the foundational Postgres client library) without having to use another load balancer. You can read this blog post on how this new feature is implemented and can be used.
  • Performance: I won’t repeat what’s in the release notes but there’s a long list of performance improvements in this release. There’s support for more parallelism on FULL and OUTER JOINs and on more aggregates, greater usage of incremental sorts, window function optimizations and even an upto 300% performance improvement in COPY.
  • VACUUM improvements: Last thing I’d mention is improvements to VACUUM which include freezing performance improvements, ability to increase (or decrease) shared buffer usage by VACUUM, and faster loading of VACUUM configs.

Laying the groundwork for an exciting Postgres 17 (and beyond)

All of the features mentioned above can immediately add a lot of value to your Postgres usage, but there are new features which lay the groundwork for powerful features in future releases. I’ll quickly touch upon three items that I believe are noteworthy:

  • Direct IO and Async IO in Postgres: In Postgres 16, several building blocks for implementing direct and asynchronous IO for Postgres were committed. This includes reduced contention on the relation extension lock and addition of Direct IO as a developer only option via the debug_io_direct setting. This important but hard work has been ongoing for several releases and Postgres 17 will likely be the first release where users will be able to utilize these features.
  • Moving towards Active Active replication: A feature was committed in Postgres 16 to allow logical replication to avoid replication loops, which is when a transaction gets replicated from source to target and back. Postgres 16 allows subscribers to process only changes which have no origin which allows you to prevent these loops. Bi-directional active-active replication is still very complicated and requires solving a lot of problems, but this feature tackles one of those important sub-problems.
  • Migration of the build system to Meson: This might have slipped under your radar but in this release, Postgres added support for a new build system which is expected to replace the Autoconf and Windows based build systems. Why, you might ask? Andres makes a compelling case for it in this thread if you’re interested but some reasons include faster build times, simpler dependency management and moving towards a common build system across Linux and Windows.

Postgres continues to deliver

With every new release, Postgres becomes more and more compelling and adds great features that improve the experience of its users. I recommend you check out the press release, and the release notes to see everything coming along with this new release.

And if you want to try out the full power of Postgres including its powerful extension ecosystem on a managed service, try out Tembo Cloud.

· 13 min read
Steven Miller

Before working for a Postgres company, I had never used extensions.

Now, I'm part of a team working to fully automate turning on any extension. I didn't find great resources to explain the process of turning on an extension, and why it varies between different extensions.

I want to share my mental model for the different types of extensions, how to know what type of extension you're working with, and how to get it turned on.

Turn on an extension

one-does-not-simply

What's traditionally involved:

  • Find the extension you want
  • Figure out how to build it
  • Sometimes, installation of dependencies (for example with apt-get or yum)
  • Sometimes, installation of other extensions (goto ‘figure out how to build it’)
  • Install your extension
  • Sometimes, load a library
  • Sometimes, provide extension-specific configurations
  • Sometimes, run the CREATE EXTENSION command

Building and installing extensions is well covered by other resources. In this blog, I want to focus on steps to get an extension up and running after it's installed, and how I believe that all extensions fit into four mostly-tidy categories.

Terminology

Extensions consist of SQL and / or libraries.

A library simply means compiled code, for example written in C or Rust.

SQL objects, let's just call it SQL, are extensions of SQL, for example new functions and data types. These are often implemented by a library, but can also be implemented in other ways, for example using a procedural language like PL/pgSQL.

Hooks: A Postgres feature informally called hooks can be used to connect into Postgres' existing functionality. Hooks allow for overwriting default Postgres functionality, or calling back into an extension's code at the appropriate time. For example, one type of hook can modify Postgres start up behavior to launch a background worker, and a different type of hook can be used to redirect queries to a different table.

note

Sometimes extensions are instead referred to as 'modules', but I like to simply refer to everything as an 'extension', but feel free to @ me on X to tell me why that's wrong (@sjmiller609).

Enter the matrix

im-in-matrix

A big part of what I have been working on is fully automating enabling any extension. In order to do that, we have to understand exactly how extensions vary. We can break it down into a 2x2 matrix by defining two boolean categories.

Requires LOAD true or false and requires CREATE EXTENSION true or false:

Requires CREATE EXTENSIONDoes not require CREATE EXTENSION
Requires LOADExtensions that use SQL and their libraries have hooksExtensions that do not use SQL, may or may not have hooks
Does not require LOADSQL-only extensions, and SQL + libraries without hooksOutput plugins

By categorizing an extension into this 2x2 matrix, we can know how to turn it on.

LOAD

LOAD is a command that tells Postgres to load a library, meaning make the code accessible to Postgres by loading the compiled code on disk into memory. If a library has hooks, performing a load will activate the hooks.

Requires LOAD: true means you have to do one of the following steps to load a library:

  • LOAD: using the LOAD command directly loads a library for the current connection only
  • session_preload_libraries: configuration, specifies which libraries to LOAD for new connections
  • shared_preload_libraries: configuration, specifies which libraries to LOAD at server start, and therefore requires a restart
note

Even though code is loaded in other ways during CREATE EXTENSION, that is not requires LOAD: true under this definition. I mean that the user must do something other than CREATE EXTENSION to load in libraries. Also, we are conflating local_preload_libraries with session_preload_libraries to simplify things in this blog post.

For example, if you installed the extension auto explain, then you may have a library file called auto_explain.so in your library directory, which can be found with pg_config --pkglibdir. Libraries are not always named exactly the same as the extension.

$ trunk install auto_explain

Using pkglibdir: /var/lib/postgresql/data/tembo/15/lib

[+] auto_explain.so => /var/lib/postgresql/data/tembo/15/lib
$ ls $(pg_config --pkglibdir) | grep auto_explain
auto_explain.so

Auto explain can be loaded into your session like LOAD 'auto_explain';. This command will always match exactly the name of the library file, less the file type, in this example .so. With a couple of configurations, now this extension will automatically log the EXPLAIN ANALYZE output for long-running queries.

postgres=# LOAD 'auto_explain';
LOAD

However, the LOAD command is not typically used directly, and many extensions require you do not load them in this way. Instead, typically the Postgres configuration shared_preload_libraries is used instead.

postgres=# LOAD 'pg_cron';
ERROR: pg_cron can only be loaded via shared_preload_libraries
HINT: Add pg_cron to the shared_preload_libraries configuration variable in postgresql.conf.

The best reason to use LOAD directly is for debugging. It can be nice to LOAD on-demand while troubleshooting.

info

What to do when an extension requires load:

Extensions that requires LOAD: true can always be configured in shared_preload_libraries, but this configuration requires a restart to take effect. Some extensions can be loaded without a restart using LOAD directly, but in this case it's usually better to use the session_preload_libraries configuration, and reload the Postgres configuration with SELECT pg_reload_conf();. You should run LOAD directly when you are intentionally loading for only the current connection.

CREATE EXTENSION

When you run CREATE EXTENSION, this basically just runs an extension's SQL script. The script will typically create new SQL objects such as functions, data types, operators and so on.

CREATE EXTENSION looks at the extension's control file, which is installed to the extension directory of sharedir.

$ trunk install pg_jsonschema;

Using pkglibdir: "/var/lib/postgresql/data/tembo/15/lib"
Using sharedir: "/var/lib/postgresql/data/tembo"

[+] pg_jsonschema.so => /var/lib/postgresql/data/tembo/15/lib
[+] extension/pg_jsonschema--0.1.4.sql => /var/lib/postgresql/data/tembo
[+] extension/pg_jsonschema.control => /var/lib/postgresql/data/tembo

sharedir can be located with pg_config --sharedir

$  ls $(pg_config --pkglibdir) | grep pg_jsonschema
pg_jsonschema.so

$ ls $(pg_config --sharedir)/extension | grep pg_jsonschema
pg_jsonschema--0.1.4.sql
pg_jsonschema.control

The information in a control file is used to determine what start up or upgrade scripts to run. We'll cover upgrades in-depth in a future blog, so let's focus on first-time enabling. For example, in the above installation output, we notice a file pg_jsonschema--0.1.4.sql. Postgres knows to run this because the name of the control file matches the name of the script suffixed by the default_version defined in the control file.

$ cat $(pg_config --sharedir)/extension/pg_jsonschema.control
comment = 'JSON schema validation on json and jsonb data types'
default_version = '0.1.4'
module_pathname = '$libdir/pg_jsonschema'
relocatable = false
superuser = true

When running CREATE EXTENSION, the extension name always matches exactly the name of a control file, less the .control file type.

postgres=# CREATE EXTENSION pg_jsonschema;
CREATE EXTENSION

I mentioned that a start up script creates new SQL, including new functions. For example in the case of pg_jsonschema, the start up script pg_jsonschema--0.1.4.sql includes the following SQL to create a new function called jsonb_matches_schema. Even though we have a library file, we don't need LOAD because CREATE FUNCTION is another way to load code from a file. This is an example of requires LOAD: false, requires CREATE EXTENSION: true.

CREATE FUNCTION ... AS 'obj_file' documentation

obj_file is the name of the shared library file containing the compiled [code]

CREATE FUNCTION "jsonb_matches_schema"(
"schema" json,
"instance" jsonb
) RETURNS bool
IMMUTABLE STRICT
LANGUAGE c
AS 'MODULE_PATHNAME', 'jsonb_matches_schema_wrapper';
info

You can always know whether or not an extension requires CREATE EXTENSION by the presence of a control file in $(pg_config --sharedir)/extension

Hooks that require a restart

An extension is in the category requires CREATE EXTENSION: true and requires LOAD: true if the extension has libraries that use hooks which require a restart and it has a control file.

You will be able to identify this is the case when the extension's documentation mentions both CREATE EXTENSION and shared_preload_libraries. Sometimes an error message or hint is provided if you run CREATE EXTENSION before loaded the library, or if you try to run LOAD directly, but you can't count on that.

For example, in the case of both pg_cron and pg_partman, there are a background workers. These are examples of extensions using hooks in the start up process of Postgres. So, in both of these cases the user is expected to configure shared_preload_libraries to start the background worker, then run CREATE EXTENSION on a cluster where that background worker is already running.

LOAD is needed when there isn't a control file

In the case of auto_explain, it uses hooks that do not require a restart. In this case, there is no control file and no extra SQL objects to be created. So LOAD is required simply because we have to load it into memory somehow. To demonstrate, it is technically possible to make a control file for auto_explain to allow for CREATE EXTENSION behavior instead of LOAD:

auto_explain.control:

comment = 'auto explain'
default_version = '0.0.1'
module_pathname = '$libdir/auto_explain'
relocatable = false
superuser = true

auto_explain--0.0.1.sql

LOAD 'auto_explain';
caution

In practice, do not use LOAD in an extension start up script to activate hooks. LOAD is only applicable for the current connection.

postgres=# CREATE EXTENSION auto_explain;
CREATE EXTENSION

postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
-----------------+---------+------------+------------------------------
auto_explain | 0.0.1 | public | auto explain
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)

postgres=# SET auto_explain.log_min_duration = 0;
SET
postgres=# SET auto_explain.log_analyze = true;
SET

After running the above, now my subsequent queries have their EXPLAIN ANALYZE logged.

So, if that could work, why not just have control files for all extensions?

Having a control file requires version upgrade handling.

When you have a control file, you also have to write upgrade scripts for every new version. In the case of pg_cron, we can find all these files in sharedir. When enabling version 1.5, it will run pg_cron--1.0.sql, then each migration script up to 1.5.

pg_cron--1.0--1.1.sql
pg_cron--1.0.sql
pg_cron--1.1--1.2.sql
pg_cron--1.2--1.3.sql
pg_cron--1.3--1.4.sql
pg_cron--1.4-1--1.5.sql
pg_cron--1.4--1.4-1.sql
pg_cron.control

Since that's not really applicable on auto_explain, because it's just logging outputs and there is nothing to migrate or handle between versions, it's just cleaner to not have a control file. Upgrading auto_explain only involves replacing the library, then loading it again.

info

Upgrade logic is not applicable for extensions that do not require CREATE EXTENSION. These cases just involve re-loading a new version of the library.

You don't load hooks during CREATE EXTENSION

It made sense to me for activating hooks that require a restart they have to be configured in shared_preload_libraries. But for extensions that do not require a restart, it's not obvious why the hooks can't just be loaded during the CREATE EXTENSION start up script like I just demonstrated is possible with auto_explain.

Even though it's technically possible to LOAD hooks during CREATE EXTENSION, it's a bad idea.

First of all, when using the LOAD command directly, it's only applicable to the current connection. So, in the above example with auto explain, the queries are only logged in the connection where I ran CREATE EXTENSION. To apply to all connections without a restart, it would need to go into session_preload_libraries. It is technically possible to do that inside of CREATE EXTENSION by doing ALTER SYSTEM SET session_preload_libraries then SELECT pg_reload_conf() in your start up script, but it is not a good approach for CREATE EXTENSION to automatically perform a configuration update. First of all it would confuse a user to change a config on the fly, and secondly there is currently no concept to automatically merge multi-value, comma-separated configurations like session_preload_libraries.

info

The 2x2 matrix makes it easier to understand how to enable an extension.

Just ask yourself "do I need to run CREATE EXTENSION?" determined by presence of a control file, and "do I need to do a LOAD?" determined by any mention of LOAD, shared_preload_libraries, or session_preload_libraries in the extension's documentation or an error message.

In all cases of needing a LOAD, you can get away with setting it in shared_preload_libraries. You can optimize to avoid restarts in some cases.

Output plugins

There are some extensions, for example wal2json that require neither CREATE EXTENSION or LOAD. In all known cases so far, these are output plugins. I think it's more of a stretch to call these 'extensions', but since they provide additional functionality to Postgres, that counts in my book.

In the case of output plugins, the library is loaded when a replication slot is created:

Postgresql documentation

SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding', false, true);

Installing extensions

Some of the above examples use the free and open source Trunk project that Tembo created, which allows us to skip the build process. It also installs extension dependencies, and provides metadata about other dependencies. When I'm trying out extensions, I am starting from one of Tembo’s container images to handle the system dependencies installation.

I wrote this guide for trying out extensions locally. If you have any issues just reach out on our community Slack channel and we can help.

Perhaps it's not so bad after all...

files-in-computer

Automate everything

We want to make it possible to automatically install and turn on any Postgres extension. For this reason, we are seeking to qualify all known extensions by these two dimensions: requires CREATE EXTENSION true or false, and requires LOAD true or false.

To enable the community, that metadata is being published on Trunk. On Tembo Cloud, we leverage that information to automatically enable extensions. Currently, we've got this working for over 150 extensions.

Dear experts, tell me how I'm wrong (seriously!)

I'm serious that I want you to tell me where this is incorrect! If you're a Postgres extensions expert, or maybe just know a thing or two about extensions that seems to conflict with something in this blog, please reach out on X @sjmiller609 and let me know. Even if it's just minor correction or subjective information, I'd love to hear from you. I also want to hear if there is an easier mental model than this. I hope this blog can serve as a minimal yet comprehensive explanation of what it takes to get extensions turned on.

Another way to contribute is to click the "Edit this page" link below, and suggest changes. I will happily accept improvements to this blog.

· 6 min read
Samay Sharma

For years, Postgres has been considered among the most popular and advanced open source OLTP databases. Interestingly, 20 years ago, it was a competition between Postgres and MySQL and this debate is still going on today. 🙂 Over the last few years, Postgres added more features and its ecosystem added more extensions and now the comparison list keeps growing. After adding support for jsonb in 9.4, Postgres began to be compared with MongoDB as users started using Postgres to store unstructured data. With Citus, people started using Postgres over other distributed SQL technologies for use cases such as multi-tenancy and analytics. And this phenomenon continues to grow with pgvector, timescaledb and others.

Use Postgres for Everything

Postgres has evolved to where people have started talking about using “Postgres for everything.” There’s strong agreement that it’s wise to use Postgres until you find a strong reason for why you can’t use it anymore - either a very specialized use case, or scale which can’t be achieved by Postgres. This topic was well-discussed on hacker news thanks to a blog by a friend, Steven Schmidt.

And to clarify, when I say Postgres, I mean the entire Postgres ecosystem - this includes Postgres, its extensions and ecosystem tooling for high-availabilty, backups, connection pooling, etc. All of these together turn Postgres into a powerful data platform which you can use to run diverse workloads. However, it’s easier said than done.

Making “Postgres for Everything” real

Advanced users may know the recipe which turns Postgres into a distributed database, a vector database, an OLAP database or a search store, but this is still inaccessible to most developers. Over years of working with Postgres users, I’ve realized that even tuning Postgres for OLTP (which is a fairly well known use case) is very hard for most developers. For example: OLTP deployments often have badly chosen indexes, over / under provisioned hardware, insufficiently tuned autovacuum etc. even though they are running on managed services which do the heavy lifting for them.

So, expecting developers to be able to figure out how to use Postgres for all these use cases when there are other databases focused on user experience for one particular use case is unrealistic. Even though developers know that dealing with 5 different databases is a pain in the medium to long run, when they are starting, they care about getting off the ground fast. So, to expect everybody to use Postgres for everything, we need to make it possible for every user to have easy access to the recipes which turn Postgres into the database for the data service of their choice.

Enter Tembo Stacks

Tembo stacks are pre-built use-case-specific Postgres deployments which are optimized and tuned to serve a specific workload. They aim to be a replacement for other databases which you actually don’t need, but you are considering because you don’t know how to solve that problem using Postgres. They help you avoid the pains associated with learning, managing and deploying other database systems. Some examples of Stacks are: OLTP, Message Queue, Data Warehouse, Enterprise LLM and Document Store.

Defining a stack

A stack is a recipe of how to run an optimized Postgres for a workload, expressed as a spec file. The spec includes the following components:

  • Docker Base Image containing a particular version of Postgres
  • Curated set of extensions which turn Postgres into best-in-class for that workload.
  • Hardware profile which is best suited for that workload
  • Postgres configs optimized according to hardware and workload
  • Use-case specific metrics, alerts and recommendations
  • On-instance sidecar - Kubernetes Services to Deploy a containerized application near Postgres to expand capabilities while minimizing network latency

Let’s look at an example of a stack spec for a Message queue stack - an SQS / RabbitMQ replacement based on Postgres.

It has the following components:

  • A standard Postgres 15 base image
  • Curated extensions: pgmq, pg_partman and pg_stat_statements
  • Optimized Postgres configs: Aggressive autovacuum settings, reduced random_page_cost and Checkpoint and WAL configs tuned for a high throughput Postgres instance
  • CPU::Memory ratio recommendations for hardware
  • Message queue specific metrics like queue length, oldest message age, newest message age
  • (Coming soon) pg_bouncer

This and many other such stack specifications are open source and can be used to deploy a stack locally on a self-managed instance or fully managed on Tembo Cloud. This is a reflection of one of our core values - to always put developers first. We open source our stacks to create the best development experience: locally and on the cloud, and to invite community feedback and collaboration.

Deploying a stack on Tembo Cloud

Tembo Cloud is a dev-first, fully-extensible, fully-managed, secure, and scalable Postgres service. It has the ability to take this stack spec and deploy a Postgres instance which is built using this stack spec. With Tembo Cloud, you can get all the benefits of a managed service like: backups, high availability, scaling of storage and compute, metrics and alerts with an easy to use UI and CLI. And you get access to an ever growing list of extensions which you can add to your stack or bring your own extensions to deploy on Tembo Cloud.

image

Evolving Postgres to be the data platform for everything

We know we’ve embarked on a challenging journey to turn Postgres into the data platform for Everything. But, we strongly believe that with the power of Postgres and its ecosystem, it’s possible to replace most deployments of esoteric databases with just a flavor of Postgres and save developers a lot of time and effort.

We’ll be building many stacks, benchmarking them against “competitive” solutions, and making sure Postgres grows to tackle these workloads. We’ll have to optimize Postgres, support a wide variety of extensions, write several new extensions to close feature and performance gaps with other databases and also evolve Postgres. But, we have no doubt that we, along with the Postgres community can make this happen!

· 10 min read
Adam Hendel

Database systems use various techniques to ensure transactionality and performance. For Postgres, this is called MVCC (Multi-Version Concurrency Control). MVCC allows Postgres to provide great performance even when multiple clients could be working with the same table concurrently.

It is useful to be aware of Postgres’ MVCC implementation to understand how Postgres manages a table’s physical storage. Internally, Postgres refers to rows as “tuples”. And as a baseline, there are two big ideas to keep in mind about how Postgres implements changes to rows in a table:

  • An UPDATE operation in Postgres is equivalent to a DELETE of the previous tuple, plus an INSERT of the new one.
  • A DELETE operation in Postgres does not cause the data to be removed from physical storage. It only causes it to be marked as deleted.

This is why Postgres has the autovacuum process: It is the automatic process in charge of cleaning up and optimizing table storage for Postgres. You can follow this blog post with a local test environment of Postgres. I will demonstrate with code how MVCC and VACUUM features work, and how to tune the Auto-vacuum process.

Why does it matter?

Vacuum is not just about cleaning up storage space. In environments where data undergoes constant change, Postgres tables often experience an excessive amount of Inserts, Updates, and Deletes. This activity can lead to table bloat. Table bloat happens when a table’s physical footprint far exceeds the size of the data that it actually holds.

Table bloat is a condition that if not managed, will likely hamper performance of our database. And so, this takes us back to the autovacuum process: to extract its maximum benefits, you may need to fine-tune its settings.

Postgres’s default autovacuum settings are pretty good. If you’re like me, it could have been years into your postgres journey before having a negative experience with bloat. However, when the time came I found it challenging to understand and tune these configuration settings. That’s why we will study them safely in a dev environment.

Unraveling the Mystery of Bloat & Vacuum's Role

PostgreSQL's mechanism for handling bloat is unique due to its adherence to MVCC. Contrary to immediate space reclamation after data is deleted or becomes obsolete, Postgres tags these rows as "dead", or “dead tuples”. However, even though they are dead they still occupy disk space and will degrade the performance of your queries. Many queries will continue to scan through these tuples, despite their “dead” status. The auto-vacuum steps in here, ensuring that these rows are removed and both the table and its associated indexes are streamlined for performance. You can’t scan the dead tuples if they no longer exist!

To illustrate, let us create some bloat and see how it affects a rowcount query:

Create a table we can easily manipulate, and let’s disable autovacuum so we can observe the consequences.

CREATE TABLE bencher (
record_id bigserial,
updated_at timestamp with time zone
);
ALTER TABLE bencher SET (autovacuum_enabled = false);
SELECT pg_reload_conf();

Now insert 10 million rows of example data.

INSERT INTO bencher(updated_at)
SELECT now()
FROM generate_series(1, 10000000);

We just created the table, and have only inserted records, so there are currently no dead tuples.

SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup
FROM pg_stat_user_tables
WHERE relname = 'bencher';

schemaname | relname | n_dead_tup | n_live_tup
------------+---------+------------+------------
public | bencher | 0 | 10000000

Let’s see how long it takes to get a rowcount with no bloat.

\timing

select * from bencher where record_id = 5000000;
record_id | updated_at
-----------+-------------------------------
5000000 | 2023-08-30 14:42:05.584778+00
(1 row)

Time: 191.006 ms

Great, 191ms. Slow, yes but we have no indices because we’re demonstrating bloat. Now lets create a bunch of dead tuples. This can take a minute or so.

DO $$
DECLARE
i integer;
BEGIN
FOR i IN 1..5 LOOP
UPDATE bencher SET updated_at = now();
END LOOP;
END $$;

Now, lets see how long it takes to fetch the same record:

select * from bencher where record_id = 5000000;
record_id | updated_at
-----------+-------------------------------
5000000 | 2023-08-30 14:42:58.964919+00
(1 row)

Time: 283.728 ms

It’s getting closer to 300ms now. Let’s check how many dead tuples are on the table.

SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup
FROM pg_stat_user_tables
WHERE relname = 'bencher';


schemaname | relname | n_dead_tup | n_live_tup
------------+---------+------------+------------
public | bencher | 50000000 | 10000000

Now let’s manually clean up the dead tuples and restore our query performance.

VACUUM FULL bencher;

And check the dead tuple count, there are no dead tuples.

SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup
FROM pg_stat_user_tables
WHERE
n_dead_tup > 0 --needed to avoid division by zero
and relname = 'bencher';
schemaname | relname | n_dead_tup | n_live_tup
------------+---------+------------+------------
public | bencher | 0 | 10000000

Finally, let’s retrieve the record. Performance restored!

select * from bencher where record_id = 500000;
record_id | updated_at
-----------+-------------------------------
500000 | 2023-08-30 14:42:58.964919+00
(1 row)

Time: 194.101 ms

Setting up a benchmarking environment

The rest of the examples will be run on Postgres in Tembo Cloud. We’ll use 8 vcore and 16Gb of memory and execute all the psql and pgbench commands from an EC2 instance within the same region as Postgres.

Let’s set up a script that will create an absurdly large amount of churn on our table and be able to execute it with pgbench. For every iteration, let’s insert a row to our “bencher” table. Then, let’s read and update a single record. Finally, let’s delete the same record. This will create a situation similar to many queue implementations (like PGMQ), where there are at least 2 transactions for every 1 insert. Additionally, the total record count on the table will typically be low - for every record we insert, we also delete one.

This creates a situation where a table consists of primarily dead tuples!

-– churn.sql
DO $$
DECLARE
rec_id INT;
BEGIN
INSERT INTO bencher(updated_at)
SELECT now();

-- read and update a row
WITH cte AS
(
SELECT record_id
FROM bencher
WHERE updated_at < now()
ORDER BY record_id ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)


UPDATE bencher
SET
updated_at = now()
WHERE record_id in (select record_id from cte)
RETURNING record_id INTO rec_id;


-- Delete the row with the returned ID
DELETE
FROM bencher
WHERE record_id = rec_id;
END $$;

Set Postgres to all the default vacuum configurations;

ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.2;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = '20ms';
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = '-1';
ALTER SYSTEM SET vacuum_cost_limit = 200;
ALTER SYSTEM SET autovacuum_naptime = '1min';
SELECT pg_reload_conf();

Let’s run a benchmark to get a baseline. We will reuse this benchmark through the process.

pgbench 'postgresql://postgres:pw@host:5432' -c 100 -j 1 -P 1 -T 300 -r -f churn.sql

default

Average latency is about 3.4 ms. We are benchmarking an expensive set of queries, and you’ve probably noticed the sawtooth pattern in the plot and a high standard deviation relative to the latency. This is a symptom of bloat accumulating on our table. Query latency grows until the vacuum process clears dead tuples, and then grows once again. This also has an inverse impact on transactions per second (TPS). Ideally we can reduce and provide some stability to latency.

Balancing Vacuum Delay for Optimal System Performance

Vacuuming is indispensable. However, it is not free and if left unchecked, it can burden your system. The balance lies in autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit. autovacuum_vacuum_cost_delay is the amount of time that the autovacuum process will halt processing when the autovacuum_vacuum_cost_limit is reached. Imagine this series of events - a table reaches 10% bloat, meaning 10% of the tuples are dead. When the 10% threshold is reached, the autovacuum worker begins to work and starts accruing cost. When that cost reaches autovacuum_vacuum_cost_limit, it will pause for the duration specified by autovacuum_vacuum_cost_delay, and then continue working until it is complete.

Modifying these can craft the perfect balance between seamless vacuuming and system efficiency. Let’s increase the cost limit to the max, and reduce the delay by half. This will let the autovacuum process run longer and pause for a shorter period of time when it does reach the cost limit, to ideally reduce bloat faster and reduce query latency.

ALTER SYSTEM SET autovacuum_vacuum_cost_delay = '10ms';
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 10000;
SELECT pg_reload_conf();

delay_cost_limit

We have a slight reduction in average latency, but we can still see that the obviously grows in latency over time and decrease in TPS. It clears roughly every 60 seconds.

Fine-Tuning Auto Vacuum Scale Factors

In the previous example, we manually vacuumed our table. But postgres gives us an automated way to configure the vacuum process. One of the most critical parameters is the autovacuum_vacuum_scale_factor; it denotes the portion of the table size that, when surpassed by "dead" rows, prompts a vacuum action. For tables that see frequent data changes, it might be beneficial to lessen this value.

ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.1;
SELECT pg_reload_conf();

scale-factor

Reducing the scale factor had minimal impact on our result, so allowing the autovacuum to trigger sooner did not help. We can see that the period of the sawtooth pattern is still about 60 seconds, which means there we are probably limited by autovacuum_naptime, which we'll talk about next.

A Quick Siesta for Your System

The autovacuum_naptime parameter in Postgres specifies the minimum delay between autovacuum runs on any given database. The default (which we set earlier) is 1min. Generally, depending on just how high-churn your workloads are, it might be necessary to decrease this value, whereas a longer interval could be suited for environments that are not churning at such a high rate. But our table has a crazy amount of churn.

We want to reduce the height of the latency peaks. One way to do this is to make the vacuum more aggressive and tell it to run sooner. We tried to influence that by setting the autovacuum_vacuum_scale_factor, but we can also lower the autovacuum_naptime value, which will also allow it to run sooner. Let’s cut it in half.

ALTER SYSTEM SET autovacuum_naptime = '30s';
SELECT pg_reload_conf();

naptime

Allowing the autovacuumer to run more frequently reduced our average latency and increase TPS. However, we’re still seeing a noticeable sawtooth pattern and high standard deviation of latency. Let’s completely disable the cost limitations to the vacuum process, let it have as much compute as it needs.

ALTER SYSTEM SET autovacuum_vacuum_cost_delay = '0';
SELECT pg_reload_conf();

disable-cost

Finally, reduce naptime to 10s

ALTER SYSTEM SET autovacuum_naptime = '10s';
SELECT pg_reload_conf();

naptime-final

Overall, we’ve iterated on autovacuum settings and reduced the average latency from 3.4ms to 2.8ms and stddev from 0.8ms to 0.7ms, which helped increase TPS from 4.3k to about 5.3k.

Configuring the autovacuum settings can be a lot of fun and the appreciated values are wildly dependent on the workload. We covered the absurdly high churn use case on a single-table today, which is very similar to what we see when running applications using PGMQ. Vacuum is complicated and can be tuned differently when considering multiple tables with different workloads. Other OLTP use cases will call for different settings, and OLAP workloads may be less influenced by the vacuum settings . Follow us, and sign up for Tembo Cloud because we will surely be writing about these other topics soon.

More on this topic

Watch the video on Optimizing autovacuum: PostgreSQL’s vacuum cleaner by Samay Sharma.

· 5 min read
Binidxaba

In my recent search for something interesting to do with Rust, I discovered that people write postgres extensions using pgrx.

I found that very cool, and while looking for some real-world examples to study and dissect, I came across pgmq: "A lightweight message queue. Like AWS SQS and RSMQ but on Postgres." So, I decided to give it a shot so that perhaps in the future, I can contribute to the project ;)

When reviewing the repository, I noticed a Python client to interact with pgmq and began to play with it.

Let me quickly describe how easy it was for me to use it.

Setting up the environment

The first step was to start Postgres with a docker container. You can check the README for detailed instructions, but in summary, just run:

docker run -d --name postgres -e POSTGRES_PASSWORD=postgres -p 5432:5432 quay.io/tembo/pgmq-pg:latest

A quick test to make sure that Postgres is running:

psql postgres://postgres:postgres@0.0.0.0:5432/postgres

After that, I simply installed the pgmq Python client in my virtual environment:

pip install tembo-pgmq-python

That's all the setup that I needed.

The PGMQueue class

To use the library, we need to instantiate a PGMQueue object and from that object we can call the methods described in the following table:

SQL functionPGMQueue methodDescription
pgmq_create(queue)create_queue(self, queue: str)Creates a new queue with the name queue.
pgmq_send(queue, message)send(self, queue: str, message: dict, delay: Optional[int] = None)Appends a message to the queue.
pgmq_read(queue, vt, num_messages)read(self, queue: str, vt: Optional[int] = None)Reads num_messages from queue and sets the visibility timeout to vt.
pgmq_archive(queue, msg_id)archive(self, queue: str, msg_id: int)Archives the message with msg_id.
pgmq_pop(queue)pop(self, queue: str)Pop the next message in the queue.
pgmq_delete(queue, msg_id)delete(self, queue: str, msg_id: int)Deletes the message with msg_id from the queue.
pgmq_drop_queue(queue)Not available yetDrops the queue.

Next, let me show you how to implement a simple producer/consumer setup using the methods above.

Implementing a Producer

In summary, the required steps are:

  1. Import the Messages and PGMQueue classes.
  2. Instantiate a PGMQueue object.
  3. Create a queue.
  4. Send N messages via the queue in a loop.
from tembo_pgmq_python import Message, PGMQueue

queue = PGMQueue(**connection_info)
queue.create_queue(test_queue)
...

for x in range(num_messages):
...
msg_id = queue.send(test_queue, test_message)
...

The PGMQueue constructor is the one that receives the connection information. For the Postgres instance initiated by the docker container, the connection details are:

queue = PGMQueue(host="localhost",
port=5432,
username="postgres",
password="postgres",
database="postgres")

Implementing a Consumer

In short, the code should basically do: Import the Messages and PGMQueue classes. Consume the messages from the queue in a loop.

from tembo_pgmq_python import Message, PGMQueue

...
queue = PGMQueue(**connection_info)
...

while True:
...
message: Message = queue.pop(queue_name)
...

Harnessing Producer and Consumer

For simplicity, I used a simple shell script to initiate my experiment:

#/bin/bash

# Spawn one producer
python3 producer.py > /tmp/producer.out &
sleep 2


# Spawn 5 consumers
for i in $(seq 1 5)
do
python3 consumer.py > /tmp/consumer_${i}.out &
done

# Wait for everyone to finish
wait

The script basically starts 1 producer and 5 consumers in the background. The output is saved in the /tmp directory.

And that was it...

From this point, you can explore the other available methods.

Some final words...

It was a pleasant surprise how easy it was to create this example: only a couple of shell commands and a couple of short Python scripts. The PGMQueue methods were very intuitive and straightforward. Personally, my next step is to understand how it works internally. But that's a topic for the future :)

I invite everyone to explore this project at: https://github.com/tembo-io/pgmq. Give it a star and also check out the other available clients for Rust and Go.

Appendix

Here is the complete code if you want to give it a try (or see it in this repository):

"""
This is the Producer's code
"""

import random
import string
from tembo_pgmq_python import Message, PGMQueue


if __name__ == '__main__':
host = "localhost"
port = 5432
username = "postgres"
password = "postgres"
database = "postgres"

num_messages = 100000
test_queue = "bench_queue_sample"

queue = PGMQueue(host=host,
port=port,
username=username,
password=password,
database=database)

try:
#queue.drop_queue(test_queue)
queue.create_queue(test_queue)

for x in range(num_messages):

payload = ''.join(random.choices(string.ascii_uppercase + string.digits, k = 10))
msg = {"payload": payload}
msg_id = queue.send(test_queue, msg)

if (x+1) % 1000 == 0:
print("Sent {} messages".format(x + 1))

except Exception as e:
print(f"{e}")
"""
This is the Consumer's code
"""

import random
import time
from tembo_pgmq_python import Message, PGMQueue


if __name__ == '__main__':
host = "localhost"
port = 5432
username = "postgres"
password = "postgres"
database = "postgres"

no_message_timeout = 0
test_queue = "bench_queue_sample"

queue = PGMQueue(host=host, port=port, username=username, password=password, database=database)

while no_message_timeout < 5:
try:
message: Message = queue.pop(test_queue) # type: ignore
print("Consumed message: {}".format(message.message["payload"]))
no_message_timeout = 0

except IndexError:
no_message_timeout += 1
print("No more messages for {no_message_timeout} consecutive reads")
time.sleep(0.500)

· 5 min read
Adam Hendel

We’re working on asynchronous query execution in Postgres and have packaged the work up in an extension we’re calling pg_later. If you’ve used Snowflake’s asynchronous queries, then you might already be familiar with this type of feature. Submit your queries to Postgres now, and come back later and get the query’s results.

Visit pg_later’s Github repository and give it a star!

elephant-tasker

Why async queries?

Imagine that you’ve initiated a long-running maintenance job. You step away while it is executing, only to come back and discover it was interrupted hours ago due to your laptop shutting down. You don’t want this to happen again, so you spend some time googling or asking your favorite LLM how to run the command in the background with screen or tmux. Having asynchronous query support from the beginning would have saved you a bunch of time and headache!

Asynchronous processing is a useful development pattern in software engineering. It has advantages such as improved resource utilization, and unblocking of the main execution thread.

Some examples where async querying can be useful are:

  • For a DBA running ad-hoc maintenance.
  • Developing in interactive environments such as a Jupyter notebook. Rather than submit a long-running query only to have your notebook hang idly and then crash, you can use asynchronous tasks to avoid blocking your Notebook, or simply come back later to check on your task.
  • Having a long-running analytical query, for example fulfilling an ad-hoc request like seeing how many new users signed up each day for the past month. You can submit that query and have it run in the background while you continue other work.

Extending Postgres with async features

At Tembo, we’ve built a similar feature for Postgres and published it as an extension called pg_later. With pg_later, you can dispatch a query to your Postgres database and, rather than waiting for the results, your program can return and retrieve the results at your convenience.

A common example is manually executing VACUUM on a table. Typically one might execute VACUUM in one session, and then use another session to check the status of the VACUUM job via pg_stat_progress_vacuum. pg_later gives you the power to do that in a single session. You can use it to queue up any long-running analytical or administrative task on your Postgres database.

Stacking Postgres Extensions

pg_later is built on top of PGMQ, another one of Tembo's open source extensions. Once a user submits a query, pg_later seamlessly enqueues the request in a Postgres-managed message queue. This mechanism then processes the query asynchronously, ensuring no unnecessary wait times or hold-ups.

The pg_later background worker picks up the query from the queue and executes it. The results are persisted by being written to a table as JSONB and can be easily retrieved using the pg_later API. You can simply reference the unique job id given upon query submission, and retrieve the result set, or query the table directly. By default, the results are retained forever, however we are building retention policies as a feature into pg_later.

diagram

Using pg_later

To get started, check out our project’s README for a guide on installing the extension.

First, you need to initialize the extension. This handles the management of PGMQ objects like a job queue and some metadata tables.

select pglater.init();

You're now set to dispatch your queries. Submit the query using pglater.exec, and be sure to take note of the job_id that is returned. In this case, it’s the first job so the job_id is 1.

select pglater.exec(
'select * from pg_available_extensions limit 2'
) as job_id;
 job_id 
--------
1
(1 row)

And whenever you're ready, your results are a query away:

select pglater.fetch_results(1);
{
"query": "select * from pg_available_extensions limit 2",
"job_id": 1,
"result": [
{
"name": "pg_later",
"comment": "pg_later: Run queries now and get results later",
"default_version": "0.0.6",
"installed_version": "0.0.6"
},
{
"name": "pgmq",
"comment": "Distributed message queues",
"default_version": "0.10.1",
"installed_version": "0.10.1"
}
],
"status": "success"
}

Up next

pg_later is a new project and still under development. A few features that we are excited to build:

  • Status and progress of in-flight queries
  • Security and permission models for submitted queries
  • Cursor support for finished jobs (fetch results row by row)
  • Kill a query that is in the queue or is currently in flight
  • Support for transactions
  • Configurable concurrency levels for background works to increase the throughput of jobs
  • Push notifications for completed and failed jobs
  • Retention policies for completed jobs

Give us a star and try out pg_later by running the example in the README. If you run into issues, please create an issue. We would greatly welcome contributions to the project as well.

Please stay tuned for a follow up post on benchmarking PGMQ vs leading open source message queues.

· 5 min read
Adam Hendel

We’ve released PGMQ, a packaged extension for message queues on Postgres.

People have been implementing queues on Postgres in many different ways and we’re excited about combining lessons learned from those projects into a simple, feature-rich extension.

Some exciting features of the project include:

  • Guaranteed exactly-once delivery of messages within a visibility timeout
  • Optional archival of messages retention for replayability and retention
  • Familiar SQL interface
  • Single and Batch read of messages
  • Client SDKs in both Rust and Python for an ORM-like feel

The need for message queues

Message queues are a very common architectural feature to manage operational pipelines, particularly within the context of asynchronous tasks and distributed systems. There are products in the market that support message queues (Kafka, RSMQ, RabbitMQ, SQS); however, when adopting one of these technologies, you increase your cognitive load, required skills and production support overhead.

Building your queues on Postgres

As a Postgres startup, we had the same issue, and we decided to build our own Message Queue based on Postgres. We are not the first: others have implemented queues on Postgres, and many have written about it including Dagster, CrunchyData, and PostgresFM dedicated an entire podcast episode to them.

At Tembo, we needed a job queue to manage tasks between our control-plane and data-plane in our managed cloud offering. Our control-plane publishes tasks like create postgres cluster, and update postgres cluster.

To keep our architecture simple and reduce technology sprawl, we built a Postgres extension so that we could run queues for our cloud and more easily share the implementation with the community.

Queues Implemented with best practices

PGMQ was implemented on Postgres and follows industry best practices. One of the most important practices is the use of Postgres’s SKIP LOCKED, which is similar to NOWAIT in other databases. SKIP LOCKED helps ensure that consumers don't hang and FOR UPDATE ensures messages are not duplicated on read. PGMQ also supports partitioning, which is particularly beneficial for large queues and can be used to efficiently archive / expire old messages.

PGMQ also provides exactly once delivery semantics within a visibility timeout. Similar to Amazon’s SQS and RSMQ, PGMQ consumers set the period of time during which Postgres will prevent all consumers from receiving and processing a message. This is done by the consumer on read, and once the visibility timeout expires the message becomes available for consumption once again. That way, if a consumer crashes, there is no data loss. This effectively means at-least-once delivery semantics once the first visibility timeout has expired.

vt

Using PGMQ

To get started, check out our project’s README for a guide on installing the extension.

You can create a new queue by simply calling

SELECT pgmq_create('my_queue');

Then, pgmq_send() a couple messages to the queue. The message id is returned from the send() function.

SELECT * from pgmq_send('my_queue', '{"foo": "bar1"}');
SELECT * from pgmq_send('my_queue', '{"foo": "bar2"}');
 pgmq_send
-----------
1
(1 row)

pgmq_send
-----------
2
(1 row)

Read 2 messages from the queue. Make them invisible for 30 seconds. If the messages are not deleted or archived within 30 seconds, they will become visible again and can be read by another consumer.

SELECT * from pgmq_read('my_queue', 30, 2);
 msg_id | read_ct |              vt               |          enqueued_at          |    message
--------+---------+-------------------------------+-------------------------------+---------------
1 | 1 | 2023-02-07 04:56:00.650342-06 | 2023-02-07 04:54:51.530818-06 | {"foo":"bar1"}
2 | 1 | 2023-02-07 04:56:00.650342-06 | 2023-02-07 04:54:51.530818-06 | {"foo":"bar2"}

If the queue is empty, or if all messages are currently invisible, no rows will be returned.

SELECT * from pgmq_read('my_queue', 30, 1);
 msg_id | read_ct | vt | enqueued_at | message
--------+---------+----+-------------+---------

Archiving removes the message from the queue and inserts it to the queue’s archive table. This provides you with an opt-in retention mechanism for messages, and is an excellent way to debug applications.

Archive the message with id 2.

SELECT * from pgmq_archive('my_queue', 2);

Then inspect the message on the archive table.

SELECT * from pgmq_my_queue_archive;
 msg_id | read_ct |         enqueued_at          |          deleted_at           |              vt               |     message     
--------+---------+------------------------------+-------------------------------+-------------------------------+-----------------
2 | 1 | 2023-04-25 00:55:40.68417-05 | 2023-04-25 00:56:35.937594-05 | 2023-04-25 00:56:20.532012-05 | {"foo": "bar2"}```

Alternatively, you can delete a message forever.

SELECT * from pgmq_send('my_queue', '{"foo": "bar3"}');
 pgmq_send
-----------
3
(1 row)
SELECT pgmq_delete('my_queue', 3);
 pgmq_delete
-------------
t

Getting involved

Give us a star and try out PGMQ by cloning the repo and following the example in the README. Please use Github issues if you run into any issues or have any feedback.

We’ve also built client side libraries in Rust and Python, which will give you an ORM-like experience.

You can also try PGMQ on Tembo Cloud as part of our Message Queue Stack. Tembo Cloud’s Message Queue Stack is powered by PGMQ, but also ships with Postgres configurations optimized for message queue workloads. We’re also working on adding metrics and data visualizations specific to message queues.

Interested in learning more?

Stay tuned for our upcoming post pg_later, an extension we built on top of PGMQ as well as benchmarks comparing PGMQ to SQS and Redis.

· 6 min read
Ry Walker

tembo brand

The Database Market

The global database market is expected to grow to $100+ billion in the coming years due to enterprises generating, storing, and leveraging more data, and the need for applications to operate at a global scale.

Enterprises typically store data across various databases, grouped into transactional and analytical systems. There's roughly 10x more transactional than analytical data, mostly in Oracle, MySQL, and Postgres.

  1. The first major shift in the data stack involved analytical workloads moving to the cloud, triggering the big data era and the rise of platforms like Snowflake and Databricks.
  2. The second shift involves transactional workloads moving to streaming and real-time data, requiring hybrid transactional and analytical processing platforms that are managed by application developers, not database experts.
  3. The third shift entails abstracting application infrastructure, allowing developers to build and scale applications efficiently. However, while services like Vercel and Netlify streamline the software development lifecycle, they focus more on building on top of databases rather than the databases themselves.

Postgres

Postgres, the world's favorite database with millions of deployments, features a liberal OSS license and a large community. It efficiently manages SQL and JSON queries across diverse workloads due to its growing, decade-old ecosystem of add-ons and extensions.

Postgres is popular for its open-source, standards-compliant, extensible nature, and ACID compliance, making it a reliable, cost-effective system. It handles low latency, high throughput analytical cases, offering HTAP-lite capabilities through window functions and foreign data wrappers.

Its extensibility resulted in numerous add-ons and plugins for GIS data, image processing, and more, with some extensions evolving into companies like CitusDB and Timescale. The extension ecosystem plays a crucial role in Postgres's growth and self-managed usage.

Postgres is most admired and desired

Source: Stack Overflow Developer Survey 2023

Problem

Companies are hesitant to adopt new databases due to costs and complexity. The need to adapt to new architectures, configurations, and optimizations makes the transition value often negligible. Hence, costly Oracle instances remain in use for critical applications.

Postgres, favored by DBAs and Data Engineers, is widely adopted for transactional systems. However, deploying and managing it is complicated beyond it's basic use case.

To create a self-managed Postgres cluster, DBAs have to consider infrastructure, environment, security, data management, backups, and workload-specific tuning. Further, maintaining and scaling Postgres involves meeting high availability requirements, managing data storage, updating schemas, optimizing query performance, and managing failover protection and caching. Lastly, extensions exist to support additional functionality in Postgres but they are hard to discover, evaluate, certify and deploy.

Our Vision

Many developers aren't served by current commercial Postgres solutions due to migration costs, restrictions on the open-source ecosystem, complexity, and a lack of developer-first and use-case first focus.

Tembo aims to enhance developers' Postgres experience by allowing full functionality, including custom extensions, and promoting developer-first workflows. It simplifies deploying Postgres with a virtualized runtime experience, enabling one-click migrations and access to the Postgres ecosystem.

Developers can control the data model lifecycle, and deploy to multiple zones and clouds. Advanced options will include autoscaling, hybrid transactional and analytical processing (HTAP), and edge caching.

Additionally, Tembo invests in the Postgres extension ecosystem, aiming to standardize and simplify the use and creation of extensions. By unbundling and decoupling the database into services and abstraction layers, Tembo enables new simplicity and capability.

Product

We are productizing Postgres and the extended Postgres OSS ecosystem of add-ons and extensions, into one grand managed cloud offering.

Tembo Cloud

We are building a dev-first, fully-extensible, fully-managed, secure, and scalable Postgres service. Available on all clouds and bare metal providers, Tembo Cloud provides the largest library of easily-installed extensions and “flavored Postgres” Tembo Stacks, allowing our customers to expand their use cases of Postgres.

Org home

Tembo Stacks

“Postgres for Everything” delivered as highly-polished “flavored” Postgres — Tembo Stacks. We help teams avoid new databases, and the pains associated with that, reducing database sprawl.

NameComponents“North Star” Competition
Tembo OLTPFeature-Rich, Optimized PostgresOther managed Postgres providers
Tembo VectorOptimized Postgres + pgVector/pg_embeddingPinecone, ChromaDB
Tembo Data WarehouseOptimized Postgres + columnarSnowflake
Tembo DocumentsOptimized Postgres + FerretDBMongoDB, AWS DocumentDB
Tembo MLOptimized Postgres + PostgresMLMindsDB
Tembo MessagingOptimized Postgres + pgmqRedis, SQS, RabbitMQ
Tembo SearchOptimized Postgres + ZomboDB + Text Search ExtensionsElastic
Tembo TimeseriesOptimized Postgres + TimescaleInfluxDB

Create cluster

What is a Tembo Stack?

Stacks are pre-built, use case-specific configurations of Postgres, enabling you to quickly deploy specialized data services that can replace external, non-Postgres data services. They include:

  • Docker Base Image (Postgres Container)
  • Pre-Installed Extensions
  • Stack-Specific additional metrics, alerts + recommendations
  • Stack-Optimized Postgres Configs
  • Dynamic Config Engine
  • Sidecars (Kubernetes Services - i.e. nearby workloads)
  • Infrastructure and hardware configuration options - things like the HA setup, pgbouncer/pgcat (server-side connection pooling), etc.

Users

Initial users for our platform are application developers using or intending to use self-managed Postgres deployments, specifically those interested in maintaining the native open-source version. Tembo replaces self-managed Postgres clusters in the cloud or on-prem.

Our inital set of users are interested in one of two key propositions:

  1. Developer-first workflows tailored to specific use-cases, facilitating complete data model lifecycle control.
  2. Prioritizing the expansive Postgres extension ecosystem for add-on integration and usage.

Fast database deployment and automatic migration enable high developer satisfaction due to a smooth deployment experience. This leads to swift value realization. We plan to be aligned with the Postgres open-source principles of the Postgres community.

The channel of distribution is a fully-managed cloud service, providing enterprise-level hybrid multi-cloud solutions based on Postgres and it's open source ecosystem.

Bottom Line

Postgres is the most universally loved database for developers, teams, and enterprises of all sizes. Tembo can become the new de facto standard for deploying and managing databases for developers and companies of all sizes, changing the paradigm of how developers interact and build databases, and unlocking the power of Postgres to all developers.

· 3 min read
Ry Walker

pink nodes

Hello, world :) We are thrilled to announce the launch of our new startup, Tembo, and our mission to build a game-changing managed Postgres service.

Background

While wrapping up my involvement in my previous company, I learned about the rich Postgres extension ecosystem, and was drawn to an idea that these extensions represent a big opportunity to expand the use cases for Postgres—maybe even collapse the modern data stack into a Postgres-centric platform, and eliminate a lot of unnecessary and expensive data movement.The vast majority of developers and companies primarily use "Vanilla" Postgres. This is partly because the current user experience of discovering, installing, and gaining proficiency in Postgres extensions is subpar—especially compared to other ecosystems. This has been voiced by many community members. As such, most existing managed services offer only a sliver of the potential value that extensions could provide.

We've started a company

Tembo’s mission is to continuously improve the dev experience of Postgres, especially in the area of expanded use cases that rely on community extensions.Postgres is the most loved database by a long shot, and all Postgres users (including me) owe a huge debt of gratitude to all the companies and people who have invested into Postgres over the past 26 years. Our goal is to serve the Postgres community in an increasingly valuable manner.Our mission is to empower developers and businesses worldwide with the full power of Postgres and its extensions ecosystem, enabling users to use Postgres for a growing number of use cases.

We've raised some money

We have raised a $6.5M seed round led by Venrock, with participation from CincyTech and Wireframe Ventures, and other influential tech angel investors, and we've already got a team of 10 working on our SaaS product and related open-source projects. We've been building a product

We are building what we hope will be a world-class managed Postgres service:

  • Use any extension, even custom or private extensions
  • Provide UI for popular extensions, and a framework to include the community in those efforts
  • Stacks - packaging of extensions, tools and configurations, to make it easy to get started using Postgres for specific new use cases
  • An intelligent advisor that helps you to better use the features that you’re already using in Postgres

create cluster

org home

The Tembo team is incredibly excited to share our platform with you as we enter our closed beta period, and support your journey towards data-driven success.

We invite you to try Tembo Cloud, to access the future of database management.