Skip to main content

9 posts tagged with "hacking_postgres"

View All Tags

· 26 min read

Build the thing you wish you had right now. That’s the story behind Hacking Postgres, Ep 9, where Ry sits down with Bertrand Drouvot of AWS Cloud. Today they talked about the struggles of trying to build from scratch vs having a template, what should be core vs an extension, and being able to see what’s really happening inside your database.

Watch below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Bertrand for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Ry

Welcome to Hacking Postgres, a podcast about extensions, extension creators, and other interesting, innovative things in the world of Postgres. I'm Rye Walker, founder of Tembo, a managed Postgres company. And today I'm here with Bertrand Drouvot. And I know I said it poorly, but Bertrand, welcome to the show.

[00:00:34] - Bertrand

Thanks for having me. And no, that was perfectly said.

[00:00:39] - Ry

Yeah, I'd love to start with if you could give us a quick background. Where'd you grow up? I'm guessing it's France.

[00:00:49] - Bertrand

Yes.

[00:00:51] - Ry

What were you doing before you started working on Postgres? I'm always curious to hear background.

[00:00:57] - Bertrand

Yeah. So as you can hear, I am French. I grew up in the east of France. So before doing Postgres, I did about 20 years of Oracle DBA as a freelancer or working in some companies. And then I switched to Postgres about five years ago now and doing it as a full time job since about four years now.

[00:01:27] - Ry

Awesome. Yeah. Do you remember when you first started using Postgres? I mean, you probably touched it earlier in your career, or was it really like a first touch just five years ago?

[00:01:38] - Bertrand

Yeah, it was really a first touch, like five years ago. Before that this was mainly Oracle but wanted to see something else and wanted to contribute one way or another, not using a black box. And so that's why I moved to Postgres.

[00:01:56] - Ry

Do you like Oracle better than Postgres?

[00:02:00] - Bertrand

That's not the same, but I prefer Postgres because I can be involved in the new features, bug fix and all this kind of stuff.

[00:02:10] - Ry

Yeah. Awesome. Let's talk about some of your Postgres work. I know you've worked on pgsentinel, but yeah, I'd love to know what you've built in the past and then maybe what you're working on now. But we can start with the past.

[00:02:26] - Bertrand

Yeah. So when I started to study Postgres, I was comparing with what I knew in Oracle, in the Oracle world, and I was like something like active session history was not there. And I was like, so let's try to build this. And this is how pgsentinel came up. So it's an extension that record active session. So it's pulling regularly active session from pgstat activity. You can define the sampling period and as well as the number of record to retain. It's in-memory ring buffer. And before PG 14, I think the query ID was not present in pgstat activity. So I had also to make use of, I think the [inaudible 00:03:18] analyze hook to get the query ID and to show this up so that in every pgsentinel record you see what the session was doing and which query ID was associated at the time of the sampling.

[00:03:33] - Ry

It's interesting. We're literally building dashboards right now and I think we need pgsentinel, perhaps because I'm like, I want to know the history of the sessions. Does it get exposed? I don't know. Are people getting this data pushed out via because like we we consume all these metrics from Postgres via Prometheus. I wonder. You said it's stored in memory, but how do I access that besides running a query, I guess.

[00:04:06] - Bertrand

Yeah, so there is basically a view on top of the in-memory ring buffer. So you can query the data and I guess you can push it to Prometheus.

[00:04:17] - Ry

Yeah, it's awesome. Yeah. Okay, I think I just found a solution to a problem by inviting you to the podcast. Thanks. Are you actively working on that or is that pretty much completed? What are you working on these days?

[00:04:36] - Bertrand

Yeah, so for [inaudible 00:04:37] it's more ensuring you know when there is a new measure, ensuring it compiles and still works as it should and no more than that, and fixing bugs when they are reported. I also wrote another extension which is a pg_orphaned, could be useful to detect or find files. I mean in Postgres it's easy to create orphaned files. You just begin create table, insert some data and crash the server. Then you will get orphaned files so the extension detects the orphan files. Then you can move them to a backup directory and verify everything is still okay and then delete them or push them back in the original directory if anything went wrong. It also takes care of existing transaction using a dirty snapshot so that you can see transactions that have started but not committed yet. So that you don't move files that are part of any active transaction. And another one that is funny is pg_directpaths. It enables to do insert bypassing the share buffer and writing directly to disk. It has performance benefits because at the end you do less work. I mean you don't need to find a free buffer and all this stuff you just write to disk. And also it generates less [inaudible 00:06:13] because it only flush [inaudible 00:06:17] images by a bunch during that time.

[00:06:21] - Ry

Interesting. I'm curious the origin of these extensions, like for example pgsentinel. Like for know, you work at AWS, right? Are you on like the RDS team? I'm curious how it's organized. Are you working directly on the features of that or are you working with customers trying to use those RDS and Aurora?

[00:06:47] - Bertrand

Yes. So currently I am part of the RDS contributor Postgres team, which our role is mainly to contribute to upstream fixing bugs and provide new features.

[00:06:59] - Ry

Got it. Okay. But like Sentinel, was that something that RDS team wanted or was that like a customer, what caused you to decide to work on that? Or was it like your own idea? Like, hey, you just personally wanted it to exist. I'm curious, like the origin of the project.

[00:07:17] - Bertrand

Yeah. So for pgsentinel, I was even not employed by AWS at that time. I did it when I started to learn Postgres, in fact, because I like to learn doing something and just not reading books or whatever.

[00:07:34] - Ry

Got it. All right, that's cool. Yeah. So are you building extensions for Postgres as part of your, I get like you're contributing to the core project, but is the extension work similarly just your own personal interests or is it related to making the RDS platform better?

[00:07:58] - Bertrand

I guess, yeah. So pg_orphaned is an extension that is being used internally because we see orphaned files, so we have to deal with them. And so that's why this extension has been built. pg_directpaths. It was more for fun and to mimic an oracle feature that is allowing direct path insert.

[00:08:27] - Ry

Going back to building your first extension, what challenges were you faced with? What was hard about it?

[00:08:35] - Bertrand

Yeah, it was not that easy. So first of all, there is a lack of kind of template. You have to find an existing extension and try to, I mean, at least that's what I have done, try to understand how the existing extensions works and try to find documentation around it. I think it was not that easy. I think it is going better with pgrx because now it kind of provides an initial template so you can start building on top of this. But yeah, it was not that easy at the beginning.

[00:09:12] - Ry

Yeah. Would you use pgrx and Rust to build your next extension? Or, I don't know, maybe you're a hardcore C++ guy and prefer that. I'm curious, what's your thought on that?

[00:09:27] - Bertrand

Yeah, so the last one I built, I built it in C and in Rust, just because I am newcomer in Rust and I wanted to learn Rust and pgrx as well. So yes, I started to learn about it. So I think maybe there is some limitation in Rust. So it all depends what you need to do for memory management and this kind of stuff. Maybe it's more difficult outside of the C world for Postgres.

[00:09:54] - Ry

Yeah, I think I've talked to Eric about not everything has been exposed in pgrx, so you may have to... If you're doing something super deep, it may not have the right hooks for you.

[00:10:07] - Bertrand

But the cool thing is there is module that you can rely on. So it's easy if you want to interface with someone else's work well.

[00:10:20] - Ry

The other thing I think it brings is that a standardized template that helps a new extension developer, you don't have to go hunt for one similar to yours. I imagine you probably spent some energy like trying to look at all the different extensions to find the right pattern, the close enough pattern to borrow. Yeah. So are there big milestones related to either your extensions or you could even say in Postgres that you're looking forward to?

[00:10:54] - Bertrand

Yeah. What I would like to see is improvement in observability and this kind of stuff. For example, there is wait event, but for the moment there is no really details around it. For example, if you have a buffer content wait event, I would like to know for which relation, for example. And so this is kind of stuff I would like to try to contribute myself and to improve. That's an area I am interested in.

[00:11:29] - Ry

Is your goal to become a Postgres committer? Are you working towards that in your head?

[00:11:37] - Bertrand

Yeah, my goal is first to be a contributor because for the moment that's not the case. And yeah. Trying to move step by step. Yes. And to improve my knowledge and improve the product if I can.

[00:11:52] - Ry

Yeah. I think our CTO at Tembo did an analysis recently and I think he saw it was like forget it was like seven and a half years average amount of contributor time before you get to be a committer in Postgres. I'd say it's as much time as going to school to become a doctor, maybe even longer. It's pretty high bar. I think it's great in some ways that Postgres is so selective.

[00:12:24] - Bertrand

So maybe I'm already too old to reach this goal.

[00:12:27] - Ry

I don't know. I don't know. Yeah. Well that's good. I'm curious, kind of shifting away a little from Postgres, but as an engineer, what are some of the most important lessons you've learned over your career? I know this is kind of an open ended question, but does anything come to mind?

[00:12:48] - Bertrand

Yeah. So listen to others, learn from others and accept remarks and different point of view. I think it's very important.

[00:12:59] - Ry

Yeah. What do you think about the way Postgres work is managed? Mailing list and diffs and I guess patches. Does it bother you at all or do you kind of find it a refreshing difference from the GitHub way?

[00:13:20] - Bertrand

Yeah. So I am not that young. So working with email for me is perfectly fine and I think it works pretty well. I mean, the project I think is working well that way. But yes, I can imagine that maybe younger people would like to work differently than by email, but for my personal case, that's okay for me.

[00:13:47] - Ry

Yeah, I think the reason why I think it works so well is it's truly a global community. Right. Obviously if you have everybody in approximately the same time zone on a team, you can make decisions and have conversations outside of email. But I think the discussion first, it's almost like a demo first, design first sort of mentality versus Hey, A PR just popped in. Let's talk about that code. I don't know, I think it's refreshingly different. I think some other open source projects that want to be notable could adopt this methodology, even though on the surface it feels very frictionful. But I think it's intentionally that way. Let's talk about this before we do it, right?

[00:14:38] - Bertrand

Yeah, that's right. And also explain why you would find this useful and everything like. Yes, exactly.

[00:14:45] - Ry

Yeah.

[00:14:46] - Bertrand

And also there is conferences, I mean during conferences we can reach out to people and speaking about ideas and what you are working on as well.

[00:14:55] - Ry

Yeah. Do you go to the US conferences?

[00:15:00] - Bertrand

Very much? I went to PGConf, New York, not the last one, but the one before, so two years ago. And yes, that's for the US. That's the only one I did for the moment.

[00:15:12] - Ry

Got it. Yeah, I was in New York, this most recent one for the first time. Yeah. It was an interesting experience to see how internal is a lot of the vendors and builders of Postgres at the event. I think again, it's just impressive to me that Postgres has gotten to the point that it can have an event like that and that's sustainable. A lot of other open source projects I think can't pull that off without being somehow a commercial endeavor. So I think it's cool.

[00:15:49] - Bertrand

And there is also PGConf Europe. That is a very big one as well.

[00:15:55] - Ry

So talk to me, do you spend more of your. I would like to give people a chance to talk about the commercial product they're working on too. Are you spending more time on RDS or Aurora? Or are the lines between those pretty blurry in terms of how you work on your product?

[00:16:14] - Bertrand

No, I spend almost all my time on community side of things and I am still involved in some Aurora or [inaudible 00:16:23] related project because I was working in this area before I joined the Postgres contributor team.

[00:16:31] - Ry

Got it.

[00:16:33] - Bertrand

And yes, we are involved in case there is a bug or things like this.

[00:16:37] - Ry

Right, of course. Yeah. That's the idea is to have, it's nice to have a pack of tributors nearby when things are going.

[00:16:47] - Bertrand

Yeah. That helps.

[00:16:49] - Ry

Yeah. Okay. All right. That makes sense. It sounds like the role is pretty pure now. Pure community.

[00:16:57] - Bertrand

Yes.

[00:16:58] - Ry

Except when asked, what are some of the developments in Postgres ecosystem that you're excited about, whether it's other extensions or even stuff in the core.

[00:17:13] - Bertrand

Yeah. So I contributed to logical decoding on standby. I think that is now part of PG 16. I think it's a very useful feature and very asked one, so I think that's good. And now I am mostly involved in the slot synchronization. So to synchronize a slot on standby from slot on the primary so that during your failover you can resume working from the slot.

[00:17:47] - Ry

Tell me I'm new. I'm sort of new to Postgres. I'm meeting all these people, meeting you for the first time here. Who are some of your favorite people that deserve attention for their good work? I'm curious. Name two or three people and maybe potential future guests of the podcast.

[00:18:09] - Bertrand

Yeah. So, I mean, there is people working on the community side and also people not known. So it really depends. I mean, the community itself is very good and really kind to help and all this stuff. So that's good. I mean, everyone on [inaudible 00:18:26] is good people.

[00:18:28] - Ry

Yeah. So you could just randomly select someone and then you're going to find someone good. Yeah, I got it. Did someone recruit you? Who was it that got you into Postgres? Who hooked you in from Oracle Land?

[00:18:45] - Bertrand

Myself.

[00:18:46] - Ry

Oh, you just decided to defect one day, right?

[00:18:50] - Bertrand

Yeah, exactly. I was like, I really want to contribute to something open source. So doing databases, since 20 years, I choose a databases, and Postgres was the evident choice for me.

[00:19:04] - Ry

Yeah. Okay. Have you ever messed with MySQL? Did you consider MySQL at the same time?

[00:19:10] - Bertrand

Not at all.

[00:19:11] - Ry

Not at all. Yeah. Why not now? Okay. All right, so if you had a magic wand and you could add any feature magically to Postgres, what would you use that magic wand on?

[00:19:26] - Bertrand

Yeah. So, observability to improve this area, I think now it's very rich in term of feature and development feature, but observability, I think it's really an area to improve my opinion.

[00:19:43] - Ry

It's tricky with an open source project. Again, I'm sure AWS you guys do it a certain way too, but we get, I don't know how many 350 Prometheus metrics out of it right now? Would you argue that? How many more bits of information should we get? What's great observability? It seems like a lot, but I also think sometimes I'm looking for information that's just not there. So maybe a lot of low hanging fruit has been done, but not the hard stuff or. Yeah, I'm kind of curious what's missing in the observability that's available today.

[00:20:26] - Bertrand

Yeah, so I think it's really, for example, wait event. I would like to have more details like depending on each wait event, for example, data file. Wait which file for example. Also, since how long are you waiting on this wait event and stuff like this?

[00:20:47] - Ry

Do you feel like is that data for, it's really for the people running Postgres, not so much for the end user necessarily. Right. You're looking for better observability for the managed service provider of.

[00:20:59] - Bertrand

Yeah, for DBA, I think DBA, yeah. Especially the one coming from Oracle and they used to have a lot of observability that I think.

[00:21:14] - Ry

Yeah, it's interesting. I'm trying to think the role of DBA. I think the more managed Postgres there is, the less DBAs there are per se in terms of having access to those low level details. But yet in Postgres a lot of people still self host and you probably expect like that'll never go away. Right.

[00:21:44] - Bertrand

But even in managed services for the moment, we can only provide what is available. If you look at performance insight, for example, you have the wait events, but at some time you need to guess what is really happening because there is a lack of details, I think.

[00:22:04] - Ry

Yeah. So you would make that available again in your magic one scenario you would add that information and you would make it available even to end users in RDS because it can help them understand how to optimize how they're using RDS. Right?

[00:22:21] - Bertrand

Yeah, exactly.

[00:22:22] - Ry

Okay, I like that. Any other areas of observability that you think are big opportunities for improvements?

[00:22:33] - Bertrand

Yeah, for example, when a query is running, we'd like to know in which part of the plan is it running currently and pgstat activity displayed the plow ID. So I can also know which Plow ID was used in the past, which one is used currently and maybe there is a bad plan that were used and that's why it were slow before this kind of area.

[00:23:01] - Ry

Yeah.

[00:23:02] - Bertrand

Also able to export statistic and import statistic table statistic.

[00:23:09] - Ry

Do you ever run a query and it's like I would say I'm more on the novice side. I mean, I can write some pretty nice queries with all kinds of features, but I really don't understand how files are interacting and so on. But sometimes I run a query and it's like, oh, this one's taking a long time. I wonder why obviously. Has it ever been thought like it'd be kind of cool to run that same query, but get a visual of what's going on inside? Are there services or have you seen anything like that where a novice user could visualize what's going on inside of a long running query kind of as it's happening? Or does it have to be like an analytical process to figure out why it's going to go slow?

[00:24:02] - Bertrand

I think so once you have the explain of the query, there is some tools like for example one that Dalibo provides which can graphically represent your execution. And then with the hotspot and the one that are taking most of the time, it's very important to visualize this kind of thing.

[00:24:28] - Ry

So which tool did you say does that?

[00:24:30] - Bertrand

I don't know the name by heart, but it's done by Dalibo.

[00:24:35] - Ry

Okay, Dalibo.

[00:24:37] - Bertrand

Yeah.

[00:24:37] - Ry

Okay, I'll have to check that out. Yeah. Again, one of the things we're trying to do with Tembo is just bring advanced Postgres things to novice users. As you know, most people using Postgres have gone 1% deep into the depths of what's there and it's pretty intimidating for new users, I think. In a good way. I think it's nice that it's deep and not shallow, but it's a lot. True. That's true. Yeah. I mean the same thing with like Linux, right? Or git. If you actually try to understand how git works, you'd be like oh man, is there anything about Postgres that almost nobody agrees with you? Do you have any contrarian opinions on anything? I know this is like me trying to get you to fight with the community, but I'm just curious. I love hearing contrarian views.

[00:25:38] - Bertrand

No, not that I can think about. For example, when I say there is lack of ops availability. Yeah, I don't think I have heard someone seeing the contrary.

[00:25:54] - Ry

Well, where does the observability belong though? Does it belong inside of the core? I guess this could be a point. There's a bunch of extensions inside the core, you might argue, like the pgsentinel capability probably should be available to everyone without having to install an extension. But if there's 100 things like that, it bloats up. Postgres but yeah, maybe that. Do we need to put more observability stuff in the main project or does it belong outside?

[00:26:27] - Bertrand

Yeah, I was mainly thinking in the main project. Yes. Provide more details about what's going on.

[00:26:34] - Ry

And turned on by default.

[00:26:39] - Bertrand

Okay. I mean, depending of the cost of it, obviously.

[00:26:44] - Ry

Does anyone? Well, like pg_stat_statements, that's not turned on by default, right?

[00:26:51] - Bertrand

On core Postgres, no. But on managed providers, I think so, yeah.

[00:26:59] - Ry

So would you argue that should be turned on by default all the time? I know that because there's little cost. Right. Everything has a cost. That's a tricky one.

[00:27:10] - Bertrand

I think in production it has to be. I mean, at the end when something goes wrong you need to have something to analyze. If not.

[00:27:19] - Ry

Yeah, I agree it's difficult. Okay. But once again production issues. It's just a tricky thing to me because the open source project the is kind of like the inner core and things related to running it in production. Let's take for example like a Kubernetes operator. I think we could probably all agree, like a Kubernetes operator shouldn't be inside of the Postgres project per se, but that means a backup strategy can't be inside of the project either. Yeah, these edges are interesting. I think it's just like how much observability should be inside of versus outside of the core. Would you argue all of it should be inside? Like, would you say pgsentinel ought to be merged into core?

[00:28:15] - Bertrand

No, I don't think so. Because an extension is easy enough to have this kind of observability.

[00:28:28] - Ry

But people don't know it exists. Right. Unless they find it. That's the tricky part.

[00:28:33] - Bertrand

Yeah, that's the problem.

[00:28:36] - Ry

Yeah. Discoverability of extensions, that's one of the things we're working pretty hard on too, is trying to. I think it'd be great to have. We want to have metadata maybe at some point. We haven't really talked to the Amazon team about this, but I want to basically have a place where managed service providers can report which extensions are being used. So just so we can know what's popular, what goes well with something else, what pairs nicely. If I'm using five extensions, what's the 6th extension I ought to be using? Given that profile would be nice because I think digging into these extensions is a big opportunity for developers. Someone might be banging their head against the problem that pgsentinel solves. Developers might be. And I have, I've banged my head against that problem for hours, not days or anything like that. But I've thought about checked. Checked for the. Certainly there's got to be a history of active sessions somewhere in here, and it couldn't really find it or even a way to make it very easily. That was my thought. I was like, should be some way to create it. But anyway, curious. Do you listen to podcasts very much?

[00:30:00] - Bertrand

Times to times, yes. But it really depends what I am doing and if one of them pop up, depending of the subject. Yes, I can listen to one.

[00:30:12] - Ry

Yeah. What are your favorite podcasts? Even if it's non technical, I'm always looking for good new podcasts to listen to.

[00:30:21] - Bertrand

There is the Postgres podcast done by Datadab. You like that one that I like? Yeah.

[00:30:35] - Ry

Any others?

[00:30:36] - Bertrand

I am also following you all, by the way.

[00:30:39] - Ry

Great. Thank you. Yeah, it's interesting. Like, about half of my feed is like a Postgres related podcast, but I also love to get some other stuff mixed in. It's hard to work all the time, but yeah, IT's nice to have. I think it's great that you can just kind of passively gain Postgres knowledge now just by having that on while you're mowing the lawn or doing dishes or whatever.

[00:31:07] - Bertrand

Yeah. And the one done by Nikolai, by Dazarab is very.

[00:31:14] - Ry

His. Are you watching his Postgres marathon? Something marathon where he's posting content every day for a year? He's a madman. That's a lot of work.

[00:31:30] - Bertrand

Exactly. Yes.

[00:31:31] - Ry

But I appreciate that he's doing it.

[00:31:34] - Bertrand

Yeah, that's very good.

[00:31:35] - Ry

Yeah, it's really cool. Well, are you very active online where people want to follow you and what's your favorite place to interact with others in the community besides the mailing list?

[00:31:52] - Bertrand

Yeah, I am many on Twitter, and I think that's it for the social media on Twitter.

[00:32:04] - Ry

Do you think is Twitter better or worse after Elon took over?

[00:32:12] - Bertrand

I think it works. Not that good as before.

[00:32:18] - Ry

Yeah. Okay. He liked the original. Yeah. I mean, the one thing I think I like is that he's, like, playing with know and trying some new stuff. Like the fact that we're posting this podcast to Twitter first, at least it's like the world's changing and we can do that. Otherwise it would have been just clearly just drop it on YouTube. But, yeah, I kind of like that he's trying some new ideas in terms of, obviously a lot of people are pissed off that it's like paying, you got to pay to get priority, blah, blah, blah. But anyway, cool. Well, I don't know if there's anything else you wanted to chat about, but I enjoyed learning more about this, and I'm going to immediately go try to install pgsentinel and see what it does for us.

[00:33:13] - Bertrand

Yeah, no problem. I can help you with that if you need to.

[00:33:17] - Ry

Yeah.

[00:33:17] - Bertrand

Awesome.

[00:33:18] - Ry

Well, thank you. Thanks for joining and appreciate it.

[00:33:24] - Bertrand

Thank you, Ry. Thank you for everything.

· 32 min read

Search is simple in theory. In practice? Anything but. In today’s Episode 8 of Hacking Postgres, Ry sits down with Philippe Noël of ParadeDB to talk about how search is evolving, the influence of AI, and the lessons you'd tell your younger self.

Watch below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Regina and Paul for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Ry

Welcome to Hacking Postgres, an interview podcast with the people working on open source projects around Postgres, which in my opinion is the world's best open source database. I'm Rye Walker, founder of Tembo, a managed Postgres company, and today I have Phil Noel from ParadeDB, working on ParadeDB as my guest. Phil, welcome to the show.

[00:00:37] - Phil

Thanks. Yeah, thanks. Nice to meet you. Thanks for having me.

[00:00:42] - Ry

Great to actually meet you. I'd like to start...maybe you could give us a quick background, like maybe where'd you grow up in the world? And what were you doing before?

[00:00:53] - Phil

Yeah, absolutely. Happy to. So I'm originally from Quebec City, the French part of Canada. I actually grew up in this small town called [inaudible 01:02] , which is 2 hours East. Spent most of my life there, and then eventually I left to go to university. I stayed in Boston, studied computer science and neuroscience, did a few things. I started a browser company before ParadeDB, which I ran for about three years, and then after that got acquaintance with the joys of Postgres, you could say through that experience. Sounds somewhat similar to yours, maybe, although not to the same scale. And my previous co founder and I only started ParadeDB.

[00:01:35] - Ry

Nice. Tell the browser company, what was that called and how far did you get in that project?

[00:01:43] - Phil

Oh, that is both a short and a long story. It was called Wisp. We were building like a cloud based web browser, so the initial idea was to offload heavy processing to the cloud by streaming, which is like an age old idea, but we were young and dumb, so we thought we had invented something. We did that for a while, raised a couple of rounds of funding, grew a team to like 25 people, but we never quite found PMF and eventually pivoted to cybersecurity. Did that for a while, but it was a much better market, but not our battle to win. So we closed down the company. Yeah.

[00:02:21] - Ry

Okay. I'm sure we both have a lot of battle stories we could commiserate over. Do you remember when you first started using Postgres? I can't remember. That's why I ask this question, because it kind of just happened at some point in my career. But do you remember specifically when you started with Postgres?

[00:02:45] - Phil

I don't know. Actually. That's a good question. I don't know. As a user? I don't know. As a developer since earlier this year or last year.

[00:02:58] - Ry

Well, cool. Obviously as a user, you've used it for years, I'm sure. Tell me about your Postgres work. What have you built in the past and what are you building now? Or is this all just the first thing?

[00:03:12] - Phil

I guess I would say it's the first major thing. So we can keep it to that, what we're doing. So we're building ParadeDB. ParadeDB is like a Postgres database, essentially, or Postgres extension, where we're integrating native full text search within Postgres, which is something that some people are quick to jump to say this already exists, and some people are quick to jump to say the one that exists is very bad. And so they're excited about what we're doing. So it depends which camp you fall on. But the first thing we released is this extension is called pg_bm25, where we're essentially integrating proper full text search within Postgres and then combine this with some of the existing innovations that's come, like pgvector to build hybrid search. And our goal is to build this Postgres type database that is sort of the go to choice for companies where search is critical to the product they're building.

[00:04:08] - Ry

Nice. Why do you think you're working on this? What led you here?

[00:04:14] - Phil

Yeah, that's a good question, actually. This is a problem we face ourselves and that's kind of why we wanted to solve it. So after my first company, my co founder and I, we were trying to decide what we're going to do next. So we started doing some contracting work sort of left and right. I was living in Paris at the time and working with a lot of French and German company. And when we were working with this German automaker, they just really needed high quality full text search within Postgres. So the promo we were building for them, and instead we had to glue Elasticsearch or some vector database on top. And it was just a nightmare. And we were very frustrated how bad it was and decided, you know what, maybe we should just fix it ourselves. And so that kind of led us to today.

[00:05:00] - Ry

Got it. Yeah, it's interesting. I don't know. There was a famous tweet that I can't remember the exact wording on, but it basically said something to the effect that, hey, tech founders, you do yourself well to pivot to building something that would support whatever you would be building today. Basically go one level deeper than your current idea is similar to what we did at Astronomer, because we were basically building a competitor to segment clickstream data processing. And then we started using Airflow. Then we said, hey, let's just provide airflow as a service. And kind of made that switch to go down a level and that was the winning pivot for that company. So it sounds like you've kind of jumped down to more of an infrastructural issue, which I think is good. Well, what challenges? Obviously there's a lot of ground to cover, but what are your biggest challenges, would you say you face so far? Building this thing.

[00:06:12] - Phil

You mean like business wise or technical wise or process wise?

[00:06:16] - Ry

Actually, yeah. I'd love to hear all categories of challenge. Yeah, spill the beans. What's hard about this? I imagine fundraise. I mean, it's a tough time to fundraise, so that's probably not awesome. But yeah, maybe start technically.

[00:06:33] - Phil

Yeah, I can start technically. I think software is the easiest part of building a company, so I would say these are our simplest set of challenges. Not to say it's know, I was actually talking to Eric that I listened to the Hacking Postgres episode with Eric and I talked to him afterwards. At first we didn't know pgrx was a thing, and we were like, oh my God, how on earth is this even going to be a thing? And then we found out pgrx was a thing and we're like, wow, it went from near impossible to quite doable, really. So, like on the technical front, I mean, there's just so much you need to do when you're building search, right? We have covered a small ground of features of what people expect, and there are so many more that we need to build, and so we're trying to prioritize that.

[00:07:25] - Ry

Obviously Elastic is probably the big competitor. Can you estimate what percentage of their functionality you aim to replace? Is it essentially 100% and what's your progress towards that?

[00:07:41] - Phil

Yeah, that's a good question. I mean, our progress was small. It's small enough that you can use it, but Elastic is enormous, right? I don't think we aim to replace or not to replace, sorry. To match 100% of their features, at least for transactional data, which is what we need today. I think we actually had a decent bit of the way done and definitely enough to keep us busy for a couple of months. But I think that's sort of the first step of what we're looking to match beyond that, like what everything elastic does for observability, that's sort of another story, and that's another milestone that we will see depending on what our customers want.

[00:08:29] - Ry

Obviously I'm building a fresh new tech startup as well, and I understand you got to match the functionality of, I call it the table stakes features. But then you also have to come up with some differentiators or else no one will care. So what are the biggest differentiators you're thinking about with ParadeDB?

[00:08:49] - Phil

Yeah, so that's a good question. I think for ParadeDB right now, the differentiator has been, ironically, I would say the combination of two non differentiated things, if that makes sense, which is like, if you want to do search over transactional data, ParadeDB is sort of the best solution today because it's the only thing that actually truly combines the two. It wasn't done before. And so sort of our initial interest, I would say, has come from that. So typically when people wanted to do this, they would have some managed Postgres provider, of which there are many, and then they would use Elastic and then they would need to combine the two and then run the two and so on. It can be a ton of work. And so ParadeDB sort of makes that a single system and especially for small companies, which is where you've been really, it's been a lot easier to manage, I would say. And so that really draws them to us. That's sort of like our initial point initially, but we have some other things that will be coming up which hopefully will expand on this sort of initial interest.

[00:09:52] - Ry

Yeah, it seemed to me like as I think about building my next search bar in a product, I have to decide do I want to go what I can so call like the old school way versus vector search? Maybe I can get away with some sort of AI integration for search. Are people asking that question or do they have the perception that maybe like vector search somehow gets, replaces or displaces in some way traditional full text search type features?

[00:10:28] - Phil

Yeah, that's a very good question. A lot of people have, I've talked to a lot of people who weigh in on both sides of the coin. I would say my personal take is both will remain quite important just because at the end of the day people will still have data and we'll still need to be able to find instances of those data. Right. And there's one way to find it via similarity, but there's also sometimes you just want to start from a clean place. And so I think both are quite valuable. Most of the people we talk to, hybrid search is very important to them, like the combination of the two, both big and small companies. And that's one of the reasons why we built Parade the way we did today, where pgvector exists and so many others. And I would almost say vector search is kind of table stake now. We don't really have to innovate on that so much ourselves, I would say. But the traditional search is kind of the piece that's been missing, which is why we released this first extension, pg_bm25, and then we offered the combination of both as well [inaudible 11:29] with another extension called pg_search, which we're going to be publicizing more heavily soon.

[00:11:35] - Ry

Yeah, I think it seems like, well, it seems to me like everything has to have an AI story or else perhaps it's been outdated. Right. So I think even search, even full tech search, you can say I like that the hybrid answer is a good answer. Obviously then you have to figure out how to hybridize it all. And that's not necessarily trivial. But yeah, I'm curious to see what you come up with on this.

[00:12:07] - Phil

Yeah, it takes some effort. It takes some effort. I would say we definitely have an AI story, but our personal take on the whole AI world and fade is like AI is sort of a step function improvement on existing things being done, but you need a foundation of value proposition for customers on the fundamental non AI part of the product, if that makes sense. And that's kind of how we think about it. There are so many people that are trying to bring AI when there was nothing useful to begin with. Well, I really think of AI as taking something that was useful to begin with and making it that much better. And so it kind of plays quite well actually with the way we've been going about delivering this product and saying like, hey, we'll just give you really good full text search and obviously so much more can be done. But if that is not useful in the first place in Postgres, then who cares that I can add AI to.

[00:13:01] - Ry

Yeah, yeah, I agree. Yeah, I think about it. Know you can't build AI without training data, right? And you can't, you can't have training data without a product. You know, to collect some base level data. If you have a base level of product without AI, basically you have to start without AI, don't have data at all. And I think that's a challenge. A lot of founders, maybe they're thinking they're going to go straight to AI. They're like, there's nothing. You got to start basic. Yeah, I agree. What are some big milestones? How many big milestones do you have envisioned for ParadeDB at this moment?

[00:13:44] - Phil

That's also a good question.

[00:13:46] - Ry

I mean you probably have a next one, right? You usually know your next big milestone.

[00:13:50] - Phil

Hopefully you'd have that. Yeah, of course.

[00:13:53] - Ry

And you also tell me what it is. But I'm curious.

[00:13:58] - Phil

Happy to share. I mean, to be honest our big milestones was to get a thousand stars in new Hub and then some stranger posted our repo on Hacker News and somehow that happened in a couple of days, which was really exciting and rather unexpected. Now our big milestone is we're about to release our own small version of managed Postgres ish specifically for ParadeDB to test how else we can deliver value to people. Like many that have been using the product so far have been using our self hosted version, which I'm sure you're extremely familiar with how people don't want to do this, doing Tembo. So that's the next big milestone that's coming, trying to learn to manage this. And then after that we have a couple of, I think after that we'll see the exact number. We're just trying to get more people on it and see what they think. Really.

[00:14:53] - Ry

Yeah, just grow users. This is another thing I've been thinking about around just even like we have stacks at Tembo with different use cases for Postgres, and then some things ought to be workloads separated from other things. Like for example, do you keep your analytical workload separate from transactional? That's clear. But search is interesting, whether that's like an add on. Is that just work that the transactional database should be doing, or is it somehow, do you think the workload for search should be isolated from traditional transactional workloads? From an application standpoint, that's a good question.

[00:15:36] - Phil

I think it depends what product you're building, to be honest, because you don't want to slow down transactions. That's something that's very important to us. For example, I think it's possible to integrate it, actually. I think it's up to you, the way we're thinking about it. So the way we've built ParadeDB is we offer weak consistency over search. So what this means is the way we build searches, we have inverted indices that store the tokenized data that you want to search over, but those are weakly consistent. So just like if you use Elastic, for example, and something like Zombodb, and then you synchronize it, there will be some lag between transactions being updated in your database and eventually being reflected in your search engine, the same thing happens, but it happens automatically for ParadB, instead of being it can be minutes sometimes here will be like a few seconds at the very most, but it ensures that your transactions are never slowed down, which is quite critical for people. So I think if you build in that way, it's actually quite possible to centralize it and have it be integrated with the actual database. If you don't use a product where search is built that way, then I don't think you want to slow down your transaction so you have to isolate them.

[00:16:48] - Ry

Yeah, I think a lot of people just obviously just do it inside their transactional database because it seems like overkill. Spin up a new Postgres, just especially like traditional shitty full tech search that's available. But if we want to get good search, if I wanted to build really good search into an application, I would go buy Algolia or something like that. And that clearly is a microservice at that point, right? I'm sending data to them to index and they're sending me back results really quickly that are happening. It's not taxing my main database as a part of that. So I kind of like thinking of search as, let's call it great search as a microservice candidate for sure, because the better you make it, the more compute it requires. The more compute it requires, the more it's competing. Right? So I don't know, I like the idea of just taking that worldview. That great search needs to be separated from the transactional or analytical.

[00:18:03] - Phil

Yeah, I think you're right in a lot of ways. You do need to be able to orchestrate it properly at scale. This is definitely something that we will ponder more and more as we have more people using it and using it at greater scale. I do think even if it needs to be solely separate from what we have done today, one of the great things and where DBMS and Postgres managed companies like ParadeDB or Tembo show up is in the ability to just make it transparent to the customer. The actual complexity of the orchestration behind the.

[00:18:37] - Ry

So, you know, you've been writing code for a number of years now. What would you say your most important lesson has been? Kind of switching from the technical to the interpersonal or just like professional. Do you have any top lesson? If you're going to meet with a young developer, you'd say.

[00:19:00] - Phil

Specifically on writing code? Yeah, specifically on writing code. I would say only write the code that you need to write. I think that's like a big one. Our previous company, maybe I'll bother with this one. I like to think of products and startups now as like science experiments, as works of art. And I think our previous company, we thought of it as a work of art. So we came out of college, we were so incredibly into it and we wanted to be perfect. We cared so tremendously about what we did, that we always went above and beyond and it always made the product better and it involved writing more code, but it did not always make the product better in a way that meaningfully improved the experience of the user or brought us closer to product market fit. Right. And so I think if you think of a work of art like it's never finished, it needs to be perfect, versus a science experiment is more like, what's the bare minimum I can do to validate or invalidate this hypothesis I have or deliver value to a customer. And so what I would say, what I wish I could say to my younger self was, if no one has asked for this, I don't think you need to do it.

[00:20:12] - Ry

Yeah, it's a good point. So you're saying be more of a scientist than an artist, perhaps to some degree. When you're starting on something and it's not validated.

[00:20:26] - Phil

I think a lot of great products are almost like work of arts. Like, people were so obsessed over it and you can feel that. And so I don't think you want to leave all types of esthetic approach or care to it, but I think definitely early on, before you even sink so much time into it, you just want to get feedback quickly. Right. And that involves just only writing the code you need to write. Yeah.

[00:20:49] - Ry

I'll tell you a quick story. Like when we started astronomer, I have a good friend, Greg Neiheisel. He was CTO and basically my partner of crime. And we just very much mind melded on a lot of things. But our attitude following your principle was really what we were building was this beautiful lattice of glue, and everything was open source inside. So we used, for example, we used Google's material design, which had just come out, and we built a UI, like, adhering to it perfectly, and ended up being a beautiful UI because guess what? Those guys built a great framework and we had no designer on the team. Google was our designer. We would joke about it all the time, and we actually chose a color scheme for the company that was kind of Googley, had the same four main colors. So it was really easy to build something great. But that we really built the first version of our product was entirely open source components just glued together in a unique way. And that was our art. It was like this inner skeleton of orchestration, or just like I said, I can think of it like a lattice of glue that's holding all these things together. And, yeah, it was a very successful product. And it looks basically how you arrange open source components is art. It's a creative endeavor, I think, and it's enough for a brand new product. It's enough art. You don't have to go and reinvent, like, for example, the airflow UI during this really hard early project. And you don't have to go reinvent the Postgres UI. Maybe you want to, but you're like, let's handle the first things first. So did you think of a lesson that was non technical that you learned that you would tell your younger self or not really?

[00:22:49] - Phil

Oh, I have so many. I spent a lot of time thinking after our previous companies didn't work out specifically. I don't know. This is a broad question. I can go in every direction. Let me think, which one is the best one to share? Yeah, I would say, okay. One interesting lesson is I'll focus on the youth as well. And what you mentioned. When we were building our first company, our first product, we felt very righteous. We were doing something very difficult and objectively, technically difficult. And a lot of people have tried before, and we were like, they must have done it wrong. Who are they? People with more experience and context than us to know how to do this really well. And so we always started every premise of what we were doing from the premise that we were correct. Right. And then we were strived to further validate the fact that we were correct. And I think this leads to a lot of cognitive dissonance when you turn out to be wrong. Right. Because you had this idea of what you were thinking, and it's now being challenged versus now the way we think about it.

[00:24:07] - Phil

And I mean this from a personal standpoint, from the way we talk and hire people, from the way we talk to customers, from the way we try to build product now, we just think we're wrong. I always assume I'm wrong in what I think, and then I strive to prove myself correct. Right. And if we do manage to prove ourselves to be correct, then it's like a very positive moment. You're like, that's amazing. I found this great surprise that I wasn't fully convinced of before. And if you turn out to be wrong, you're like, well, kind of thought I could be wrong all along, not a big deal, and then sort of drop this and move to something else. And I think it's a mental framework that has served us a lot better in everything that we do now.

[00:24:45] - Ry

Yeah, I love that.

[00:24:47] - Phil

Yeah.

[00:24:48] - Ry

I think it's a tricky catch 22. Like, you have to, as a founder, live in a reality distortion field just to change the world. You have to have a crazy idea that most people disagree with. Those are the best ideas. Unfortunately, they're also the worst ideas.

[00:25:05] - Phil

Exactly.

[00:25:06] - Ry

Just accept that, hey, you might be. At the same time, you have to also fight for your idea. Even if you hear negative, it's like, oh, man, I built all this infrastructure for it to be righteous, and now it's hard to give up on it. I think it's wrong to give up on it a lot of times, because, again, I think that some of the best ideas almost everyone would disagree with as it starts. I agree. It's a much better mindset to understand that it's a lottery ticket sort of odds not, hey, well, if you were a good student, you know you can get an A in a class, right? You know you can get an A. It's just a matter of doing the right amount of work, and you'll get an A. But it doesn't work that way with startups. Exactly.

[00:25:59] - Phil

Yeah.

[00:26:00] - Ry

Cool. So, well, I'm curious, like, are you paying attention to the Postgres ecosystem and other developments maybe that you're excited about that you've seen since you started working on.

[00:26:13] - Phil

Have. I mean, there's, there's just so much that's happening in Postgres, obviously pretty excited about Tembo. I've been following you guys' journey. Just like, recently, there are people that I forget the name of the developer. That's very embarrassing. But someone released PG branching, which maybe you've seen, which is like, so I was talking to Nikita, the CEO of Neon, right? And he was, like, raving about their branching system, which I'm sure is world class and amazing. But it turns out other people might be able to do this someday, which is good for the community, perhaps. So I think that's quite an exciting extension that hasn't been made by us, for example, that we're very excited about. And there's many others that I could list.

[00:26:56] - Ry

Yeah, it's interesting. My first reaction I saw, like, AWS announce, oh, we have active, active replication available on RDS. It's a new extension, you know. And I responded on Twitter, is it going to be open source? And crickets. So it makes me feel like time to. We'll study that extension, and the community needs to just build the open source version of it if they won't do it. I'm not salty about how the big clouds operate. Extracting value from open source. Because I'm extracting value from open source, and you are, too. We all are. All the way down. Like I said, it's like value extraction all the way down to Linux, but they could be. Obviously their goal is to build a proprietary alternatives and I really wish there was a Postgres company out there that was big and was worrying about the whole open source ecosystem. That's what we're trying to do. Yeah, there's so much going on. I don't know if it's because I'm looking at it now or if things have changed in 2023 versus 2022. It's great that it's happening too during what I would consider like a law in the venture ecosystem as well. So I think the stuff that's happening now will be strong. They always say like the startups that start during the slow parts of venture usually end up being more successful than the ones at the hype cycle. So you and I are both lucky we're starting here. Even those. It might not feel that way sometimes.

[00:28:39] - Phil

I think it does feel that way though. We raised our seed round, things went quite well. It was a challenging raise, but challenging because people were very critical about what we're doing and I think that's good, right? Like the first company, we raised money on a terrible idea and we thought raising money was the good part. But actually if you have a terrible idea and people convince you of that early on, they're actually doing you a favor, right? Yeah, I think it's good. I'm very bullish on the companies that are going to be founded, that have been founded this year, that will be found next year and so on, because I think you're right. Overall, typically they turn out to be quite good. Yeah.

[00:29:22] - Ry

So do you have a couple of favorite people that you've met in Postgres Land? Give them a shout out if you do.

[00:29:29] - Phil

Yes, sure. I mean I met Eric, which I think will take the prize for my favorite. He. He's very humble about it. He keeps saying TCDI needed pgrx. I think they really did it beyond what was needed for the benefit of the community. So I think he's definitely gone above and beyond and he continues to go above and beyond. Another one maybe you know him or not. He's the writer of one of the PG internals book called Hironobu, who's this Japanese man who lives in Europe now I haven't engaged with him very much, but he's made this resource called I think InterDB.jp or something like that. I could send you the link if you're curious, which goes into Postgres internals at a great detail. And it's been like a good resource for us in it. So shout out to that guy.

[00:30:23] - Ry

Yeah, I definitely read the website, but I haven't spoken to the man.

[00:30:29] - Phil

Maybe we should try to email him a few times. Yeah, I tried to convince him to join us, but it was too difficult to sell.

[00:30:37] - Ry

Yeah, maybe he, I mean, I tried to convince Eric Ridge to join Tembo in the early on and he was too hard to sell.

[00:30:43] - Phil

I'm sure you did. Of course, I would have loved to convince, um, the naive part of me was like, I can't wait to convince Eric to join Parade. And then I talked to him, realized he basically is the co founder and CTO of TCDI. And I was like, okay, this isn't happening.

[00:31:01] - Ry

Yeah, I was probably the 25th person to go after him. And you were probably the 27th person to go after him. Whatever.

[00:31:10] - Phil

Exactly.

[00:31:11] - Ry

Yeah, I think it's great. I'm kind of excited that these people, I'll call them like the gods of Postgres, are kind of locked up up on Mount Olympus. We don't get to see them or talk to them very much. That's exciting to me, to know that someday we'll get there. Someday, if our companies are strong enough, maybe we can get to work with those people. But it's great to be in an ecosystem where there is a tier of people that are just amazing. I was talking to Samay, who's the CTO here at Tembo, and he did the quick analysis. To see that most people become a committer at Postgres takes like around seven or eight years of contributing to Postgres before you get the commit bit, which is like, wow, that is an amazing. You think of venture backed startups, it's like two startups worth of effort sticking around for your full vesting before you get to be up. So it has nothing to do with the company that you're working for, it's just you're going to dedicate yourself to an open source project for the core of your professional life. Because once you get there, you probably want to stick around for another eight years, I'm sure.

[00:32:27] - Ry

Yeah, it's pretty cool.

[00:32:28] - Phil

Yeah. But these people are great, and I think that's one of the things that's so great about Postgres community. Even though Eric and so on, they're amazing people and they're doing great things and they're very busy and so on. This are quite accessible, really. Eric is very accessible in the pgrx discord and so on. So I do feel like by building open source, we still get to work with them, even though not in the same organization. And some other people I forgot to shout out, which perhaps deserve a lot of shout out, is maybe like Umur and Ozgun from Citus, right? They were like truly visionaries of the Postgres world, building extensions when Postgres was a toy database, as people would like, critic being critical of.

[00:33:17] - Ry

Yeah, yeah, we're definitely standing on their shoulders. They've, they did a lot of hard work early on and yeah, it's great. If you had a magic wand, you could add any feature to Postgres tomorrow and it existed. Is there anything that you would use that wand on? Do you have any things you wish Postgres was that it isn't?

[00:33:40] - Phil

That's very interesting. I think the answer would be better distributed support. Actually, what Citus has done is amazing. But the whole reason Yugabyte exists today, for example, is because there's only so much you can do when you're an extension like Citus did, right? And like Yugabyte is basically trying to be the Postgres equivalent of CockroachDB. And I think if that was baked in natively to Postgres, it would be good for a lot of people that are building on Postgres included.

[00:34:13] - Ry

Yeah, that's a good one. Do you listen to podcasts very much?

[00:34:19] - Phil

Not very much, not very much. Only yours.

[00:34:22] - Ry

Okay, good, thank you. No, I listen to PostgresFM all the time. I'm huge fans of those guys. I do listen to a lot of podcasts. But yeah. Just going to ask you what your favorites are. But you already said. All right, well cool. It was great talking to you. Where can listeners find you online? Maybe just tell us about yourself and then ParadeDB url and so on.

[00:34:49] - Phil

Yeah, so you can find us on paradedb.com or on a GitHub is ParadeDB as well. It's open source. We welcome contributions and users and so on. Of course we're very responsive. As for me specifically, I think you're going to link my Twitter in the bio, but my Twitter is PhilippeMNoel. It's basically my handle across every social media. I feel like when you can find one, you sort of keep it. There's so many people now on the internet, so that's my handle everywhere. People can find me there and I love chatting.

[00:35:20] - Ry

Awesome. Well, thanks for joining. Love to have you on again in the future. We'll see again, it's super early in this podcast life, but track what's going on with ParadeDB and if there's more stuff to talk about, we'll get back together.

[00:35:36] - Phil

Of course. Yeah, I would love to be there. Thank you for having me.

· 25 min read

Postgres for everything? How about building a new cloud provider? In episode 7 of Hacking Postgres, Ry talks with Burak Yucesoy of Ubicloud about new clouds, getting attached to extensions, and the future potential of Postgres.

Watch below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Regina and Paul for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Ry

Hello. I'm Ry Walker, founder of Tembo, a managed Postgres company. And today I have Burak from Ubicloud. How do you say UB? Ubi.

[00:00:23] - Burak

Ubicloud.

[00:00:24] - Ry

Yeah, UB. Welcome to the podcast Burak.

[00:00:33] - Burak

Yeah, thanks a lot for having me here.

[00:00:36] - Ry

Yeah, well I'd love to start with giving us a quick background. I'm curious where'd you grow up and what were you doing before?

[00:00:48] - Burak

Well, well hello everyone, this is Burak and I work as a software developer and I guess in reverse chronological order I worked at Microsoft, Citus Data and SAP on distributed databases in all three and currently I worked at a startup called Ubicloud. We are basically building a new cloud provider and the primary thing we do differently than other providers is that well all the code is open. You can think it as open alternative to existing cloud providers. You can go to GitHub, check out the code, set it up on a bare meta server and then you can have your own cloud or you can use our managed offering of course. So yeah, this is brief history of me working professionally.

[00:01:46] - Ry

Yeah, nice. So you're open sourcing, essentially open sourcing and what AWS and GCP and Azure have done.

[00:01:55] - Burak

Yeah, definitely, I'm seeing it what Linux is to proprietary operating systems. This is what we are doing for the cloud providers.

[00:02:07] - Ry

Yeah, and you only have to build 300 different services on top of it. Right, but luckily a lot of those are open source too.

[00:02:15] - Burak

Kind of. Well, I guess our current plan is not building all 200 render services because most of the time people use only ten at most 20. If you implement 20 of them, I guess you have 80% of the use cases. So I guess this is our initial plan but we never know what would feature show.

[00:02:42] - Ry

Yeah, I think it's a great idea. I'm excited to watch it evolve and hopefully partner with you guys at some point.

[00:02:50] - Burak

Yeah, that would be awesome.

[00:02:52] - Ry

Do you remember when you first started using Postgres?

[00:02:57] - Burak

Yeah, I do. Well to be honest, when I start programming I started as a web developer and at that time Lamp stack was very common. So my first database was MySQL but then I started working at Citus Data and which is the place that I started working development part of the Postgres. So basically for people who don't know Site, the Sitestata was the company behind many popular extensions such as Citus or Pgcron or PostgreSQL HLL. So when I first joined Citus Data, I initially worked on building Citus extension and while doing that you need to dig to the Postgres code first you need to understand and then you build your extension on top of it. And then we built our own managed service. So I switched to that team to build a Postgres managed service and some of our customers were well, they were heavily using PostgresQL HLL extension and at that time the original authors of the PostgreSQL HLL extension, they went through an acquisition process and they didn't have enough time at their hand to maintain the extension. And well, we know the people and at that time PostgreSQL Community was much smaller.

[00:04:37] - Burak

So we just called them and said that, hey, we want to maintain this extension, what do you think about? And they were pretty happy to find a new maintainer. So, long story short, I found myself as the maintainer of the PostgreSQL extension and then the Citus Data got acquired by Microsoft and then well, I continued my Postgres journey on Microsoft. Like we also build a managed server stash. Yeah, I guess that's how I start with Postgres development. Well, not just start like till the almost whole journey.

[00:05:18] - Ry

Yeah. And are you taking a little bit of a break from it right now or are you working on Postgres stuff over at Ubicloud?

[00:05:31] - Burak

Yeah, well, first of all, we use Postgres in Ubicloud as like most of my professional community, like the friends, and they are from Postgres Community, so there's no way I can leave that community at all. But as a cloud provider, Ubicloud also needs to offer a Postgres service. So we are planning some things like it's not very well defined yet, but I think eventually we will have something for Postgres as well.

[00:06:12] - Ry

Out of curiosity, are you guys I'm sure this is a hotly debated topic, but Kubernetes or no Kubernetes inside of UBI Cloud?

[00:06:22] - Burak

Well, in our case right now, no Kubernetes. We have pretty simple control plane, which if you want, I think you can move it to Kubernetes. But right now we don't use it. Not that we have anything against Kubernetes, it's just I guess right now we don't need that complexity, I believe.

[00:06:50] - Ry

Yeah, well, and I imagine you'll have managed Kubernetes. You'd have to have that. That'll be one of your 1st 20 most likely.

[00:06:57] - Burak

Yeah, definitely. Because managed Kubernetes is quite one of the most demanded products, so it needs to be one of the first.

[00:07:11] - Ry

Well, so again, you were working on Citus...When you started working on it, was it an extension or was it, I don't know the full history? Was it a fork at any point and then became extension?

[00:07:25] - Burak

Yeah, at the beginning it was a fork and just before I joined the team it become an extension. So basically Citus become an extension and become an open source at the same time. And I think I joined Citus team in like one month after that.

[00:07:45] - Ry

Got it. So you never were part of that previous era. Were a lot of the hard problems already solved, would you say, with Citus? Usually. I just did a talk with the PostGIS team and they said early on is where they solved the most of the problems and it was more gradual after that. Is that the case with Citus too, or was there a big project that happened after you joined well, I think.

[00:08:19] - Burak

Most of the difficult problems were already solved and to be honest, I think one of the problems with the extension development is that there isn't good documentation about it. Like for Postgres has pretty good documentation for user facing features. But if you want to as a developer, there isn't that much resources so you usually need to read lots of Postgres code. And to be honest, I think Postgres code is pretty readable for a project at its size and that big and that old, so I think that's a huge plus. But still you need to read lots of code to understand what you need to do. And in our case, thankfully, Citrus hired one of the most prominent contributors of the Postgres Andres Freund and he primarily lead the effort to make Citus an extension. And I think at that time I believe Postgres extension framework also didn't have some of the features we need. So we had to do some hacky workaround. But eventually Postgres extension framework also got improved and we had chance to remove those hacky workaround.

[00:09:48] - Ry

Yeah, that's great. Yeah, it sounds like that's happening now with lots know a lot of the innovation in Postgres happens as a fork with the hope that in a version or two can become an extension and then maybe a couple of versions. After that it becomes a less hacky extension. Right? You can streamline it, but it's a four year journey or so. Tell me about PostgreSQL HLL. Tell me what's HLL stand for.

[00:10:19] - Burak

Yeah, well, HLL stands for Hyperlog log and it is an extension to make a cardinality estimation, which is a fancy way of saying doing count distinct but approximately. Let me first explain why approximately. The reason is doing count distinct as an accurate number is pretty difficult. Well, not difficult, but maybe unfeasible. If your data size is small, that's okay. But if you have lots of data, the usual way is keeping a hash map or hash set. Every time you see an item you put it to hash set and at the end you count number of items in it. But if you have lots of data that becomes unfeasible. If you have a distributed system, like if you are doing count testing in two different nodes, that becomes even more difficult. Because let's say you bite the bullet and calculate the counter stick in one node and the other node, it's not possible to merge the result. Because there might be common elements, you cannot just sum them up. So what Hyperlog log does is it uses an approximation algorithm. I can go into detail of it as well to have an internal representation of the number of unique elements which is both memory efficient compared to doing a hash map which is also easy to merge.

[00:12:04] - Burak

So it allows you to do like the parallel computing. And the only gotcha is it is not an exact number, it's an approximation, but it turns out that we don't need exact number most of the time. Like for example, especially in analytical use cases, let's say you want to count the number of unique users that visit your website. It doesn't matter if they are 4 million or 4.1 million, like you want a ballpark number. And also the good thing is the error rate of hyperlog log is quite small. It is usually around 2% and you can make it even smaller if you give it a bit more memory like you can make it more accurate while this hyperlog log algorithm is just out there. And what PostgreSQL hyperlog log does is it implements this algorithm for PostgreSQL.

[00:13:09] - Ry

So how much of a resource reduction would you estimate that using an approximation? So you lose a percentage or two of accuracy, but you get how much less compute required?

[00:13:24] - Burak

Well, usually it's about 1.5 KB. So the hyperlogo data structure on default it takes about 1.5 KB memory.

[00:13:38] - Ry

Orders of magnitude smaller.

[00:13:40] - Burak

Yeah, actually log log parts come from that. So if you are dealing with 32 bit integers, it can go up to two to the 32. You get the log of that, it is 32. You get another log that you get five. So you need five bits of memory to be able to store one bucket. And then what hyperlog log does is it keeps multiple buckets to increase the accuracy. So at the end it end up about like 1.5 kilobyte.

[00:14:17] - Ry

Got it. So how involved were you in that extension? Did you start it or did you inherit it?

[00:14:25] - Burak

I inherited it. So actually another startup called Aggregate Knowledge built that extension. Then I think they got acquired by New Star and at that time the project was not maintained frequently. So there were some boxes we need to be merged in and our customers were also using it. So we contacted the original authors and said that hey, we want to maintain this. And they were happy to hand over the maintainership to us. And then after that we did bug fixes, we did regular releases. I presented a few conference talks about hyperlog log in Pgcomp EU and Pgcomp US. Yeah, that's the overall story.

[00:15:24] - Ry

I'm curious, have you been able to disconnect from I imagine it's easier to disconnect from Citus as an extension after leaving Microsoft, but disconnecting from this extension PostgreSQL HLL. Are you still kind of watching that because you have that knowledge?

[00:15:47] - Burak

Yeah, I have a little bit of emotional bond to that extension. Well, for example, there were few improvements that I wanted to do, but I didn't have time while working at Microsoft and it just, itched me from time to time and it is open source. So I guess at some point in near future I'll open a pull request and hopefully it would get merged. I hope.

[00:16:19] - Ry

Yeah, but Microsoft team controls that extension as it sits.

[00:16:25] - Burak

Yeah, right now Microsoft team controls, they continue to do regular releases and every time new PostgreSQL version comes up they ensure that it works well and they update the packages, release new packages. If there's a bug report, they are the one who fixes it.

[00:16:45] - Ry

How many extensions, I mean, obviously there's the Citus extension, this one. How many total extensions would you say like Microsoft has in Postgres? I know maybe it's hard to nail down a number, but there are a bunch of others too.

[00:17:01] - Burak

There are there's Citus, there's PostgresQL HLL. There is Pgcron which is also quite popular. It is a Chrome based job scheduler.

[00:17:10] - Ry

Yeah, I just started using that.

[00:17:12] - Burak

Yeah, that's pretty cool. It is primarily developed by Marco Slot. There is one extension to ensure that Postgres works. Postgres is well integrated with Azure. So there's like an extension called PG Azure. I think it's not open source but if you start a Postgres instance from Azure and check the extensions, if there's that extension there is TopN which is also approximation based extension. It gives you top N elements of a sorted list. And if you think about it is also expensive operation to do on big data set because you need to sort them first and take the top N. And I think there are more optimized algorithms that where you can keep heap which is more memory efficient but you still need to go over lots of data at that time. At Citus we also developed this TopN extension. Actually, if you look at it, the general team is about being able to do things at scale because Postgres is already pretty good at doing things in a single node. And like the title primary use case was make. A distributed PostgreSQL and we developed few extensions to make some operations that are not feasible to do at scale and find a ways to make them more feasible.

[00:19:06] - Ry

So I'm curious, are there some big milestones in Postgres that you're looking forward to?

[00:19:14] - Burak

Yeah, well, actually I was looking forward for Postgres 16 release mostly because of the logical replication improvement. There are few and I think there will be more upcoming because I think logical replication is a very powerful concept but it's still a bit cumbersome to use it with PostgreSQL, especially when there's a change in the data, like if when you run a DDL command or when you have a failover. So there are few gotcha and I think with the Postgres 16 some of these are less problematic and maybe I hope in the upcoming versions it would be even easier to use. Well, when you have a very solid logical replication, it opens up lots of cool features. Well, one thing that I personally invested in is being able to do zero downtime failovers and when I say zero time downtime, I mean like the real zero downtime. Not just like 1 second downtime, but real zero downtime. And I think logical replication, solid logical replication would open up that yeah, I agree.

[00:20:45] - Ry

It's one thing to do that too with a single node versus a Citus cluster too right. Zero downtime gets complicated the more complicated your deployment is. But I agree on a single deployment I have this idea where we could basically build scaffolding around the existing thing, whatever it takes, like get another one working. In other words, temporarily have replication happening and then somehow seamlessly up above you have to have some fulcrum that you can so it's a complicated thing to figure out but I think it'd be great for a managed service provider to basically build up temporary infrastructure that helps the zero downtime thing happen. If that's true, then you can restart for all kinds of reasons with impunity like auto scaling is possible, stuff like that.

[00:21:50] - Burak

Yeah, and one reason I especially interested in was that in our previous managed service we do lots of operations via failover. Like for example, if you want to scale up what we would do is we create another server with higher number of cores and then we would do failover. Or if you want to like for example, we might need to do maintenance like maybe we found a security vulnerability or there is the regular maintenance. What we would do is instead of going and patching the existing server we would create a new one and then we do a failover to that one and each of those failovers. It takes some amount of downtime which is not obviously not preferable and not a good experience for customers but if they were virtually free from the perspective of customer if they don't even notice that there's a failover then you can do as many failovers as you want.

[00:22:55] - Ry

You share the same vision as I do there. I think it would be exciting to get there. So I'm curious though, if there was one thing that you could if you had a magic wand and this weekend something new would be in Postgres core, what would it be? What would you use that wand on?

[00:23:11] - Burak

Yeah, that's difficult. I want lots of things like picking one is difficult but I guess one thing that bothers me is that for high availability and backups you always need to depend the third party tool I would really love that to be sold in Postgres. Like for example, for Redis it comes with very good default settings that you can use for high availability. But for Postgres there are solutions. There are good solutions but I would love them to be in the core.

[00:24:03] - Ry

Yeah, I get that for sure. It's tricky when you have to buy a product or let's say you adopt a product and you immediately have to adopt x additional products right off the bat and that's not a good feeling. It feels complex, right? Yeah, that's cool, I would say. Do you have any opinions about Postgres that almost nobody agrees with you about? Are you a contrarian in any area that you can think of?

[00:24:35] - Burak

Let me see. I don't think so. Well.

[00:24:45] - Ry

Maybe that's one of the areas that put that you think backrupt should be inside.

[00:24:48] - Burak

There is that. But I know there are people who also share that opinion. Yeah, I'm not sure it's okay.

[00:25:04] - Ry

Yeah, I was just curious. It's always fun to talk about those things, if they exist. Give you a soapbox.

[00:25:14] - Burak

Actually, there is one thing, but that's also even I think Tembo kind of agrees with me on that one, is that I think many different use cases can be implemented in Postgres. So instead of having a lot of specialized databases, you can have Postgres. And with some configuration and maybe few extensions, you can implement, like you can implement Kafka in Postgres, you can implement Redis, you can implement like the NoSQL in Postgres. I guess if I said this maybe two, three years ago, probably I would get more raised eye growth. But now I think more people start thinking to think like that. I think Tembo is also thinking things along similar lines, right?

[00:26:14] - Ry

Yeah, I think there's openness to it. I talked to a lot of people, and the question is how people are Postgres for everything. Or question is what percentage of developers are actually on that bandwagon? Obviously on Twitter, it just takes one person, and it's a big community, too, especially if it's a contrarian view. But I'm kind of curious. One of the things I want to find out over the next few months is what percentage of developers would actually if this Postgres for Everything was real, would they actually use it versus still saying "Ehh." And I think it all comes down to like I think, yeah, you can do Kafka, like work on Postgres right now, but it doesn't feel as clean as buying Confluent. That seems like a very safe decision. And doing something like Kafka on Postgres seems like you're just kind of stringing together a Rube Goldberg machine and it doesn't feel like a solid. But the question is, if those solutions were solid, would people use them? And that's our big thesis, is that if they were solid, people would use them. But I just don't know what percentage of people would do that.

[00:27:38] - Ry

A big percentage or a small percentage?

[00:27:40] - Burak

Yeah, I'm not sure. But there is one interesting thing that come into my mind, is, well, today Postgres supports JSONB type, but it was not like that all the time. So in the earlier days, if you want to store JSON data, we had an extension called hstore, which we still have, but not as commonly used as before. And what hstore does is, and this is one of the very powerful part of PostgreSQL extension framework, hstore defined the data type, and on top of it, they defined how you can hash this data type and how you can compare this data type. And when you do this in Postgres, Postgres allows you to create index on that data type. So suddenly you are not only able to store JSON data, but you can index it. And at that time this is kind of rare things even for NoSQL database. So I think it's a bit funny. And also it shows the power of PostgreSQL extension framework is that suddenly you are able to do what NoSQL database does but better. I mean, not in all perspectives, like connection scaling was still a problem, but being able to index NoSQL data, being able to index JSON, it was a rare feature even for NoSQL databases.

[00:29:19] - Burak

But you had it in Postgres. I don't know, maybe some of these other databases or other use cases Postgres might have something unexpected that would make it better.

[00:29:36] - Ry

An unexpected advantage. Yeah, it's the same way with pgvector right now the great thing about doing vector embeddings inside of Postgres is that you don't have to move the data out of Postgres as part of the process. Right. You can just add a column and keep it where it is, whereas anybody else has if it's an external vector database that's specific for that use case, you have to have data pipelines and all that kind of machinery. Which that's to me, one of the big benefits of keeping it all in Postgres is less data movement. And less data movement can mean much like no data delays and all that kind of stuff go away. So yeah, I agree with you, there's a lot of unexpected benefits for keeping things together.

[00:30:25] - Burak

Yeah, I guess since Postgres provides pretty strong asset guarantees, it allows you to build things on top of that. And when you have asset, then you can be much more free to develop complex features. Because what I realize is like while developing software, most of the time as a developer, I try to ensure that hey, what I'm doing is atomic or what I'm doing does not it is isolate. So it's not caused any problems if something is coming and not in the expected order. But you have chance to delegate all this to Postgres, I think that gives you quite a bit of advantage.

[00:31:21] - Ry

Well, one of the things I love too is that because the Postgres core team is quite separated from the commercial products, is that I just think it seems like a very stable chassis to build these things on top of. And you really can't if it's more of a captive, open source project, say, like what Kafka is to Confluent. They can move Kafka quickly if they need to for commercial to help their commercial product, but that could introduce more instability. I just don't see this Postgres team doing anything very risky, which to me is a great counterbalance to people developers trying to move fast and build crazy cool new things. It's just nice to have that as a stability factor, I think, inside the product.

[00:32:17] - Burak

I think so, yeah. Well, I guess historically I think the Postgres community kind of divided into camps and some of them would want to implement new shiny thing and some of them would try to hey, just let's get stabilized. And I believe this JSONB support comes from the part who wants to innovate and try new things. And at the beginning, I think they got some hesitation from the other company. But at the end, I guess what they do proved itself to be very valuable. And then now JSONB support is one of the most widely used features of Postgres. Yeah, so I guess there is some sort of balance to try risky things and also try being stabilized.

[00:33:10] - Ry

If you look at the number of the change log for Postgres 16 they did a lot of things. It's almost more than anyone can keep in their head. That's what I'm saying. The good stuff gets through, it's just the bar is high and then with a lot of assurances that they didn't break any part of Postgres in the process. I really appreciate that part of this thing and it's one of the reasons why I'm so excited to be building a product on top of it.

[00:33:41] - Burak

Yeah, well, at the same time it is sometimes bit frustrating because sometimes you have a feature you want it to be merged in like you might be author or you might be just someone watching from the site and desperately need that feature. And then you see that there is a huge discussion going on and people cannot convince each other and it falls to the next cycle, which is like the one years later. So that's a bit frustrating but I guess yeah, it is kind of cost of having a quite stable system.

[00:34:18] - Ry

It's the cost. And like I said, well, obviously I haven't been here for the 26 years watching the mailing lists and maybe I'm jumping in here relatively late in a cycle and I just appreciate all the efforts and all the debates and all the fights that have happened there because I think it's created such a great core engine. All right, well, so where can listeners find you online? I imagine you're on X. Yeah, they.

[00:34:50] - Burak

Can find me at X, at LinkedIn. Yeah, I guess those two would be the places they could find me, but mostly at X. My alias is at BYucesoy. So basically my first letter of my name and my last name. Great.

[00:35:10] - Ry

Well, we're excited to see the outcome of as you guys are shipping Ubicloud. Excited to see that. And yeah, appreciate you joining us today.

[00:35:24] - Burak

Thanks a lot for having me here, it was great talk, I enjoyed a lot and I'm definitely looking for the other episodes to release so that I can listen.

· 31 min read

In this episode, Ry, Regina, and Paul talk about geospatial development, the challenges of creating and maintaining an extension across multiple Postgres development cycles, and what they’re hoping for in the future of Postgres.

Watch below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Regina and Paul for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Regina

Hi, I'm Ry Walker, founder of Tembo, a managed Postgres company. And today I have Regina Obe and Paul Ramsey, two who are working and have been working on PostGIS for quite some time. Welcome to the show, both of you.

[00:00:28] - Regina

Thanks, Ry.

[00:00:28] - Paul

Thanks, Ry. Nice to be here. Great. Yeah.

[00:00:32] - Ry

So I think you guys have all been using Postgres for quite some time. Maybe. Paul, when did you start using Postgres?

[00:00:42] - Paul

I started using Postgres, and I'm sure I can give you a version number, I think like 6.4, which I think puts me in like the '98, '99 era. And yeah, I was using it as a consultant. My first career was in consulting and working for the provincial government. And we were doing a big data analysis project and it was a geospatial project, but all the geospatial processing was being done by an external piece of software and we used Postgres as the backing store for what was a very long, like, 20 day compute cycle to spit out all the results for the entire province.

[00:01:17] - Ry

Nice. How about you, Regina?

[00:01:18] - Regina

I started out as a SQL Server person, so yeah, Paul dragged me into Postgres. So my first introduction was via PostGIS in 2001, I think it was. So it was like 7.1 or 7.3, I can't remember somewhere between 7.1 and 7.3.

[00:01:39] - Paul

It was 7.1 because that was the first version of Postgres that you could actually do a geospatial extension. Oh, nice.

[00:01:45] - Ry

And do you regret leaving SQL Server?

[00:01:49] - Regina

Well, I still consult for SQL Server, and what's interesting about that is I think I've gotten to be a better SQL Server expert knowing Postgres because Postgres would always introduce things first, like all the lead functions, all the window functions, Postgres introduced those before SQL Server had them and the CTEs and everything.

[00:02:18] - Ry

Yeah, got it. What's the state of geospatial in Microsoft SQL Server, would you say compared to Postgres? I don't know too much about that ecosystem, but I'm curious, how much parity have they achieved?

[00:02:32] - Regina

Well, I think their geography is still better than ours, but in terms of the geometry support, they're way behind, which is what most of the state people care about.

[00:02:42] - Ry

Awesome. Cool. All right, well, I'm going to kind of go off my standard script because how long have you guys been working on PostGIS for? This is one of the oldest extensions, or is it the oldest extension?

[00:02:59] - Paul

Yeah, well, I mean, so when we started extension, this didn't exist. It's a pre extension as a package concept. It's just like the idea of runtime, adding things at runtime to the database which was there from the very beginning. That was a stonebreaker original. Like, this is one of the things Postgres can do. PostGIS is from 2001, as Regina mentioned, she started using it within three months of the initial release. So, yeah, may of 2001. So yeah, we're at 22 years.

[00:03:28] - Ry

So how has it, has it changed much? Would you say since? Has it been pretty steady progress towards what it is now? Or were there any big milestones, big changes that you can remember that are super noteworthy?

[00:03:42] - Paul

It has not been like one linear rise. It's definitely been punctuated equilibrium. And as these things go, the most new features and new capabilities happen right at the start, because we started with nothing, right? We started with first release. It had geospatial types, it had a geospatial index. It had eight functions, I think only one of which was analytical in any way. But that was like release 0.1 in 2001. I think we got to 0.8 in two years, like 2003. And that had the first full set of spatial functions where you could really test all the relationships between the geometries. You could do constructive geometry operations like buffering intersections and unions and so on in the geometries, that was the really big like, you could have stopped there in many respects. If you look at what SQL Server has, that's kind of what they did until 2008. They had no spatial at all. And then they came out with their spatial extension, which was basically what they have now because it was a capable, complete vector spatial database, full stop. But since then, we've kept adding stuff. Maybe Regina will take another tour of what she thinks the biggest things to happen since 0.8 were yeah, so I.

[00:05:08] - Regina

Think PostGIS just improved as Postgres improved. So they introduced Gist and all our indexes changed from the old R-tree to the Gist index. They improved aggregation in Postgres, which I think was a huge milestone for us because a lot of the processing we do involves aggregating geometries together. So we would see something like from a tenfold speed improvement in terms of aggregation of geometries. And then the other things. I think CTEs were pretty useful. Now, nobody does any queries in Spatial without using a CTE anymore.

[00:06:01] - Ry

Would you say, like those advances in Postgres? I don't know. Let's say that they did 80% of the work towards the thing working ten times faster, and you guys had to do 20%. Or what was the ratio of effort from where were you getting great improvements virtually for free without any effort on your side?

[00:06:21] - Regina

Yeah, I think we were getting great improvements for free, pretty much.

[00:06:25] - Ry

Oh, that's great.

[00:06:25] - Regina

And then there's the whole KNN thing, which they drastically improved from 9.2 to yeah, I was making fun of Bruce, and then Bruce said, I can't believe you'd have to do that. We need to fix it. And so after 9.2, the KNN became real KNN instead of just box KNN. But yeah, in all those cases, there wasn't that much we needed to do to get the improvement.

[00:06:49] - Paul

Parallelism.

[00:06:50] - Regina

Oh, yeah, parallelism.

[00:06:51] - Paul

Another one of which is like a huge freebie. Like, look at that. We do parallel processing.

[00:06:57] - Regina

Yeah, the only thing we needed to do is tell people how to configure their Postgres comps to take advantage of parallelism.

[00:07:05] - Ry

Curious, what kind of challenges did you guys face? I'd say early on building this.

[00:07:14] - Paul

The biggest thing? Well, I don't know. There are a lot of things. As a geospatial extension, our code isn't packed like into one hunk of code that we control. We use a lot of library dependencies. And so we end up having, and we still do having a really complex build almost right out of the gate compared to other extensions, because other extensions like, hey, we're a self contained thing. You just type, make and you're done. Whereas with us it was always like, hey, guess what? You get to track down three or four different other libraries in addition to the thing that we have and make sure they're the right versions, and here's the configuration, so on. So we end up with this really naughty configuration setup. And initially, like, if you're going to start using this stuff back in the day, step one was always to build it. They weren't prepackaged. Postgres itself wasn't prepackaged, everything was from source, so it put this fairly steep entryway in for new users.

[00:08:14] - Regina

Yeah, although early on, we didn't have any other dependencies except Postgres, so it was much easier.

[00:08:19] - Paul

Well GEOS after two years and then, well, proge at the same time. I mean, we had GEOS and proj almost from the start, and then started picking up format libraries after that.

[00:08:28] - Regina

Do we have proj?

[00:08:29] - Paul

Yeah, I don't remember projection because proj already existed, so tying it in was a pretty straightforward thing to do.

[00:08:35] - Regina

Okay, I don't remember that. Or maybe it was an optional dependency, so I just never built with it.

[00:08:43] - Paul

Everything was optional.

[00:08:44] - Ry

Tell me what GEOS is. I've seen that mentioned, but I didn't dive in. Is that powering other geospatial solutions besides PostGIS?

[00:08:55] - Paul

Yeah, it is. GEOS is an acronym of course, stands for Geometry Engine open source. It provides the computational geometry underpinnings for a bunch of sort of the key functions. I mentioned a buffer, that's a GEOs function at the back end. Intersection is a GEOS function at the back end. The Boolean predicates intersects contains within those are all GEOS functions at the back end. Some of the fancier stuff that we've added to GEOS is now exposed in PostGIS. So if you ask for Delaunay triangles or [...] polygons, you get those from GEOS and then GEOS, because it's like this useful Swiss Army knife of computational geometry is used by other programs in the geospatial ecosystem. So if you use the Google library, it backs out of GEOS for its geospatial operations. And like most prominently, if you use the QGIS desktop GIS, you'll find that the Q just backstops its geospatial algorithms.

[00:09:54] - Ry

So did you guys refactor code into that, or did you just end up replacing some stuff you had built early on with that library later on?

[00:10:01] - Regina

Well, Paul started GEOS too.

[00:10:04] - Ry

Okay. All, it's okay. So it sounds like refactored into that.

[00:10:09] - Regina

Well, not so much refactor. I think it was always kind of a separate thing. Right. It was always intended to do more.

[00:10:15] - Paul

And as an external library, there's actually a slight penalty to using it because you got to take your data out of the Postgres memory space and copy it over into the JS memory space to work with it. And that's just expensive enough that for simple things, we still actually kept the original native post, just side implementations for things like area or length or the most complex one that we kept is distance. Yeah.

[00:10:45] - Ry

Are there any milestones going in the future for PostGIS that you're looking forward to or is it just kind of stability and continuous improvement?

[00:10:56] - Paul

You go, Regina.

[00:10:58] - Regina

Oh, no, you're not going to ask me that. I think speed is always good, and my concern, I think, is mostly improving Raster. And I'm looking forward to Toast's API changes that are coming along and how we could leverage those.

[00:11:16] - Paul

That's funny you bring up raster. Maybe we should talk about this in the part we have a little get together, because it's one of the things which in front of me lately is like the underlying infrastructure we have for handling rasters was built well, no, 2010, anyways. It was built, like, in an era when the idea of, say, doing cloud based raster processing was kind of not what would be done. It was built around the idea that you would store your own rasters kind of locally in your local area network, that increasingly organizations just don't do that. They still want to have raster access. And while you can do remote raster access with what we have, it's kind of clunky. It's not optimized for that at all. I feel like just like a relook at what the raster use cases are and kind of a reevaluation of whether how we handle rasters is right is due. I was looking at the API that Alibaba offers on their cloud databases and thinking, yeah, that's kind of an interesting way of tackling rasters.

[00:12:17] - Ry

Yeah, you're talking about like Alibaba is like a they have a proprietary database or not, their Postgres.

[00:12:25] - Paul

They don't say. They're really cagey about it. So I'm not sure whether it's back then Postgres or not, but their take on rasters is very much all their own. I haven't seen anyone else go at it that way.

[00:12:35] - Regina

Oh, I haven't seen that. I should check that out. But yeah, that's one of the complaints that people have, at least the clients I have, that the outdb is much slower and yeah, that could be improved.

[00:12:48] - Paul

Yeah, great.

[00:12:51] - Ry

I'm curious if you're paying attention. I don't know the difference. I've not studied how PostGIS works or raster, and it is but I'm curious, is the vector search stuff happening that's happening in the ML space? How closely related is the math of that to the math of I'm sure you've kind of paid a little bit of attention to that space. Is it wildly different or is it kind of remarkably similar or neither of those?

[00:13:23] - Paul

Yeah, well, I mean, insofar as a 2D vector is the same as a [...] vector. At a conceptual level, they're the same. But from a practicality point of view, the practicalities of handling super high dimensional stuff are just different. Like one of the first things you learn, even like we go to four dimensions, even at four dimensions, the indexing properties start to break down if the kind of sort of standard R-tree stuff just doesn't work as nicely. You don't have to get very high up into a complex dimensional space for that form of indexing to be like, it doesn't work, it's not doing what we need to do. And you can really see that in just how different the indexing approaches are for ML vectors compared to sort of 2D vectors.

[00:14:10] - Ry

So, like, high dimensionality just requires an entirely different mindset and solution set. So I'm hearing you say.

[00:14:17] - Paul

Yes, totally.

[00:14:18] - Ry

Just curious if it somehow scales into that or not. That's cool. Yeah. Well, tell me, I guess, real quick, I'd love to learn a little bit more about the commercial products, I guess. How does this manifest? How does PostGIS manifest commercially for both you and Regina?

[00:14:41] - Regina

For my side, it's mostly consulting. Yeah, I don't have any commercial things around.

[00:14:52] - Paul

I make sideline in talking about and studying the economics of open source. One of the things that's kind of obvious once you start looking at this stuff is that there's like a project size threshold before you start to see enough money around a project to support full time maintainership or people whose jobs are mostly around the project. And PostGIS is interesting in being like one of the few Postgres extensions which has received achieved that level. But even at that level, it's quite small. So you got Regina, who has a good consulting business. I work for Crunchy Data, which is a professional open source support company, which is to say they sell support contracts to Fortune 100 companies and US Federal government and increasingly a number of international organizations, but of similar size and scale, big organizations. And then also has a software as a service called Crunchy Bridge, which is basically database in the cloud of the sort that everyone's gotten used to. So, I mean, in that respect, I'm kind of like Regina. I work for Crunchy because they value my expertise as a PostGIS committer and my ability to help their customers who deploy PostGIS.

[00:16:20] - Paul

So it's still very much like skills for hire. No one has wrapped it up, has wrapped PostGIS itself in particular up as a specific product.

[00:16:28] - Regina

Yeah, I mean, others have, it's just.

[00:16:30] - Paul

We haven't and then yeah, other members of the development team are also still sort of on the consulting bandwagon and that's how it bundled.

[00:16:41] - Ry

I'm not familiar with anyone who's bundled it up as a product per se, who's done that?

[00:16:47] - Regina

I mean, it's not so much a product, but like all the cloud providers so Amazon has it, Microsoft has an installable extension. Yeah. As an installable.

[00:16:59] - Paul

From the point of view of ubiquity and market spread.

[00:17:03] - Regina

Yeah. CartoDB used to be the closest though. But do they still use Postgres or did they switch to something else?

[00:17:09] - Paul

Which DB?

[00:17:10] - Regina

CartoDB. Carto.

[00:17:12] - Paul

They still use Postgres. Yeah. They haven't moved up. That would be a good sort of like example of a productization of PostGIS. Certainly in their earliest incarnation they had a software as a service which did a very good job of allowing you to put data in, visualize it in a whole bunch of ways. And that exposed like SQL as the language for customization of what you were seeing. And it was all sitting on top of PostGIS, but it was marketed as CartoDB. So they had productized around a software service that more or less made the database not invisible, but the actual brand of the database was irrelevant.

[00:17:51] - Ry

Do you see real old versions of PostGIS surface? Like, I'm sure you probably don't see 0.8 anymore, but no. How good are people at staying up on the latest, would you say? I have not as good as you.

[00:18:08] - Regina

Haven't seen any 1.5 recently. I think there might have been one.

[00:18:16] - Paul

Your standards are different from mine, Regina, because I,...I'd freak out if someone brought me a 1.5. I'm shocked at how many version two is just still in the wild. Let me account back that first digit is worth about a year. So we're at 3.4 now. So 3.0, so that's five years. So yeah. So some shows up with a two point something. It's a five year old installation.

[00:18:35] - Regina

Yeah.

[00:18:35] - Ry

And they're probably five versions of Postgres old too, right?

[00:18:39] - Paul

Yeah, exactly.

[00:18:41] - Ry

What's the biggest jump you'll do? Will you take someone from eleven to 15 or Postgres? That is the equivalent?

[00:18:48] - Regina

Yeah, because the latest wouldn't even run on that. Well, in theory, anybody from PostGIS 2...

[00:19:00] - Regina

should be able to go straight to 3.4 without any issue as long as they upgrade their Postgres too. What happens to their development, their applications? Breaking that's on them.

[00:19:15] - Ry

Well, it must be nice to be relying on Postgres. I think you can criticize, if you would like, various aspects of how Postgres is built, but I think that it's really great how stable, I want to say slow that the progress is made because it gives you a very stable and reliable chassis to build this on top of. I'm sure you guys agree with that.

[00:19:44] - Regina

Yeah, I think in terms of slowness, they're actually much faster than other relational databases.

[00:19:49] - Paul

Much, much faster in terms of how fast things change. Yeah, I guess there's always that push particularly thing. New SQL dialect stuff has come in pretty quick.

[00:19:58] - Regina

Yeah, because I remember when I be talking to my SQL Server friends and they're still waiting for the lead function. That happened like five years ago back in the day. But yeah, I think it moves in terms of the SQL standard a lot faster than the others. I think even faster than Oracle, though I don't have too many Oracle friends to talk about.

[00:20:20] - Paul

Yeah. I'm frequently, frequently surprised by how much internal churn there is because I always feel like, oh, we're the supermature extension. We don't reach like super deep into the extension, like into the into the core. We're not hooking into like executor hooks or planner hooks or stuff like that. And yet there's always this there's always this medium to long list of things that have to be tweaked when you move up to a new, like in terms of our code or for it to build and still run correctly, that have to be tweaked at each major Postgres release.

[00:20:57] - Regina

Yeah, because I can't think of any release we've done that we didn't have to tweak it for the in development major release of Postgres. So they changed it enough that it always affects yeah.

[00:21:10] - Paul

Yeah.

[00:21:10] - Ry

So even if the outer shell seems pretty stable, the insides are changing and you guys have some stuff poking down at least to the middle, if not the bottom. That's great. Yeah, like I said, I think it's to me a really perfect pace because we do get that kind of like annual innovation and if there's something that's really important, it'll get taken care of, I think. I'm curious, are there things happening in Postgres core that you guys are excited about? Maybe for we could talk about 17 or whatever, 18 future? Is there anything in particular that you guys are excited to see?

[00:21:53] - Regina

Well, there's bi directional seems to be making its way. I don't know if it's going to make it into 17, but it looks like they're putting in the hooks for it anyway, so that's kind of exciting.

[00:22:05] - Paul

Yeah, for me, that's what's exciting. I've been watching the decade long crawl towards being able to do a replicated OLAP system where you get full push down to all the nodes. So every release, the Postgres FTW folks from Toshiba and other places add a couple more turns of stuff that you can push down, which is exciting to me because we're that much closer. We're that much closer to be able to do big multi node queries because there's OLAP workloads and LTP workloads for spatial databases. Really, the bias is towards OLAP for sure. In terms of what you see, how you see people using the database. I mean, they like transactionality, they like to be able to manipulate their data. But when it comes time to ask, what is this for? It's like, oh yeah, we run big analyzes. So the ability to push stuff out to multi node as that gets more mature, as it gets more possible, like, that becomes something that's really exciting on the spatial side. So I watch every little tick of the gears towards that endpoint and get very excited. So it's been pushed down in the last couple of releases.

[00:23:22] - Paul

Last release had good stuff around parallelization in the planner and executor as well for partitions, which is like, that's a big deal because the holy grail is you've got your big table partitioned out across the remote nodes. So each of your partition is actually a foreign table. And when you run the query, the planner says, oh, look, I can just ask all the partitions to run it simultaneously, it gets back the result, assembles it and says, Here you go, we're getting close to that. Which would be a big deal because there's a whole bunch of workloads that you can just turn into sort of like a standard OLAP star schema. One big fact table, a bunch of dimensions and you'd be able to do lots of really cool spatial stuff. That right now, not so much. I don't know when we'll ever get to the holy grail, which is to be able be able to do shipping data between nodes in order to allow nodes do things like join across these across two big fact tables. That might never happen. I don't see anyone doing that. But that's the one that would like having me tear my clothes off and dancing in the street.

[00:24:34] - Ry

Yeah. So that's interesting. You're talking about like I haven't followed the development of Postgres FDW. I used it recently and it's pretty powerful at low scale, but you're talking about at high scale, at OLAP scale, just optimization across.

[00:24:53] - Regina

And Paul has this fantastic foreign data wrapper, a spatial foreign data wrapper, which can read how many formats? 800?

[00:25:03] - Paul

Yeah, no, there's not that many formats in the world, but I don't know several different yeah.

[00:25:08] - Ry

Formats of geospatial formats. Is that what you said? Or other types of formats.

[00:25:14] - Paul

Geospatial formats. But to an extent, geospatial formats is a category which also includes non geospatial things because all you have to do is not have the geospatial column and then what do you call it? Oh, it's just a format. So Excel can be a geospatial format. Got it. And it's something that you can read. SQL Server.

[00:25:32] - Ry

What's the name of that extension?

[00:25:34] - Paul

Or that extension is called OGR. Under more FDW.

[00:25:37] - Ry

OGR.

[00:25:38] - Paul

It stands for well, OGR is like it's not even worth trying to unpack it. OGR is just the vector side of the GDL library, which also has an acronym which is not worth unpacking anymore because it's a 20 year old acronym, but it refers to the library that it's binding. This is the OG library for vector formats? Yeah.

[00:25:58] - Ry

Cool. Yeah, I'll check that out. That's cool. Kind of wrapping this up on the core. Like if you had a magic wand and you could add any feature to know. Of course, if that was added, then you immediately have work to do in PostGIS as well. But what would your magic wand manifest this weekend in Postgres if you could pick one thing?

[00:26:20] - Regina

Yeah, I can't think of any.

[00:26:21] - Paul

I know what Regina's is. Oh, come on. I'll tell you what yours is then, Regina, if you're not going to say.

[00:26:30] - Ry

You guys can answer for each other if you want.

[00:26:32] - Paul

It's handling extensions in a slightly different way or sorry, had an extension update in a slightly different way. So extension versioning.

[00:26:40] - Regina

Oh, yeah. How did you read my mind? I completely forgot about it because I gave up hope on it.

[00:26:47] - Ry

Yeah, tell me about that.

[00:26:48] - Regina

So Sandro, who's also on the PostGIS team, he's been working on an update to the extension machinery that would allow us to reduce our upgrade script to one file because right now we ship like 500 files, which are all pretty much just SIM links to the same thing. And so it's just a way to have the extension machinery understand that, hey, this script can be used to upgrade this version, this version, this version and this version, instead of having to itemize every single version upgrade.

[00:27:30] - Paul

Yeah, the extension upgrade machinery is very clever, but it starts with an immediate assumption, which is that as a developer, you will manage your extension upgrades as incrementals between versions and that it will cleverly find the path through the incrementals from the version you have to the version you're going to, applying all the little slices on the way.

[00:27:55] - Regina

And it uses Dijkstra for that.

[00:27:58] - Paul

Yeah, it's super clever. Even like in the best case scenario where you'd already been doing that, it's probably not super ideal for any project where the development path isn't linear. So Post just has, I don't know, 25 or so minor releases like X, Y, and then within those, there's maybe five patch releases across each of those. So we'll have a whole bunch of parallel version trees, right? You're on 2.3.5, or you're going to go to 2.3.6, but then you might want to go to 2.4.8. And that means you have to have all these individual even if you're doing things like one tiny step of time, you would have all these individual little hops across the branches. If you have just this one line, it kind of worked. You just sort of chained together this one little line. You don't have very many files when you have all these little branches, all of a sudden you need all these extra little hops across the branches. And it's made worse because all of our management of the SQL side of it, the SQL side of the extension where you define the functions on SQL land and say, oh, it's over here in the dynamic library.

[00:29:24] - Paul

We've been managing that since pre extension world. And our way of managing it was to have a SQL file which can always be cleanly applied against any previous version. So it's quite long because the SQL file has every single definition of every single thing in it. So how you handle the incremental thing from 1.2 to 1.3? Well you have one copy for 1.31 copy for the upgrade as well. So every little upgrade has a full copy of that fairly large SQL file. On Unix systems now we just ship sip links instead of syncing the whole file. But you end up with just a huge pile.

[00:30:05] - Regina

Yeah, actually we changed from that to just a file that has nothing in it. Right. To just on any version which kind.

[00:30:13] - Paul

Of the chain eventually arrives at the full one.

[00:30:17] - Regina

So now that's the same across. But I think ours is more complicated too because for each version we support multiple versions of Postgres and we also enable new features. If you're on like twelve you get something, if you are eleven you don't get that something.

[00:30:40] - Paul

Certainly something which is not contemplated by the original designers is our file. Our input file actually goes through, we put it through the C preprocessor before we give it to Postgres because we have a whole bunch of if defs against what Postgres version you're on. Living inside the SQL file that have to be pre processed before it's usable.

[00:31:02] - Ry

Yeah, I understand. Was maybe you're saying like the current design is just naive thinking that you're not going to try to support multiple versions of Postgres with one of your versions of the extension and there's not home for that information, I guess for what the range is to some degree.

[00:31:22] - Paul

Yeah, I mean although the extension framework does contemplate the idea of versioned extensions, again, it doesn't really contemplate them as anything except for a linear chain. And once you have a more complex situation than that it's kind of hard. Like we for a very long time supported being able to run different versions of PostGIS inside the same Postgres cluster. We still do actually support that, but it's a feature that it seems like mostly OA developers use. So it's optional now and we default just to like one version of Postgres or PostGIS for each Postgres cluster. But that functionality was always there. But the extension facility did not grok that.

[00:32:05] - Regina

Yeah, and packagers did not grok that either. So they always, always ship one.

[00:32:10] - Ry

Great. I'm curious, try to wrap up here a little. I realize now I've been kept you here for quite some time, but do either of you listen to podcasts very much?

[00:32:20] - Paul

I do all the time. It's my gym thing. I go down to the garage gym and that's what keeps me from going crazy with me.

[00:32:28] - Ry

Your give me some of your favorite podcasts.

[00:32:31] - Paul

I tend to go on the current affairs side, so I listen to the Ezra Klein show from New York Times a lot and Odd Lots from Bloomberg, a little bit of financial news.

[00:32:40] - Regina

Yeah, you can tell he's the son of a politician.

[00:32:42] - Paul

It's interesting stuff.

[00:32:44] - Ry

Yeah. How about you, Regina?

[00:32:47] - Regina

No, I'm not a podcast person. I go swimming. But yeah, I can't really hook up a podcast.

[00:32:55] - Ry

Yeah, underwater is probably possible, but not super comfortable. Right, great. All right, well, so where can listeners find you online? Maybe share your websites or Twitter? Mastodon handles.

[00:33:13] - Paul

On the site formerly known as Twitter. I'm PWRamsey. I'm also PWRamsay at Mastodon Social and on the blog world, I'm at cleverelephantCA.

[00:33:25] - Regina

Yeah, I have how many blogs do I have? I have BostonGIS.com Postgresonline.com. Twitter is just Reginaobe, I think that's and my website, Paragoncorporation.com, I guess those are. Oh, and my book site PostGIS.us.

[00:33:47] - Paul

Yeah, the book site.

[00:33:49] - Ry

What's that? I always say PostGIS, by the way. I got to learn it's PostGIS. But do you have a book for around it, or have you written many books?

[00:33:57] - Regina

Oh, yeah. So I wrote "PostGIS in Action." I'm working on "PG Routing." I'm also supposedly working on a Postgres book, both of which I'm very behind on. And I did "SQL in a Nutshell." And let's see, what else. Is that it? Oh, and "Postgres Up and Running." That's a pretty popular book. Surprisingly popular.

[00:34:25] - Ry

Yeah, it's sitting right there. I own that one.

[00:34:28] - Regina

Oh, really?

[00:34:29] - Ry

Thanks for writing it.

[00:34:29] - Regina

Oh, yeah.

[00:34:31] - Ry

Thank you both for joining. Appreciate you, all the work you've done for Postgres and PostGIS, and appreciate having you on the show.

[00:34:42] - Paul

Thanks for having us, Ron.

[00:34:44] - Regina

Thanks, Ryan.

[00:34:44] - Paul

Thanks. Bye.

· 24 min read

In this episode, Ry and Alexander talk about OrioleDB (and the challenge of fighting bloat), fuzzy and vector search, and the challenges behind database management. Watch below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Alexander for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Alexander

Hi, I'm Ry Walker, founder of Tembo, a managed Postgres company. And today I have Alexander Korotkov from OrioleDB as my guest. Alexander, welcome to the show.

[00:00:24] - Alexander

Hello, thank you very much for inviting me. It's a pleasure to participate.

[00:00:32] - Ry

Awesome. Maybe I'd like to start by having you give us a quick background, like where did you grow up and what were you doing before you started working on Postgres?

[00:00:41] - Alexander

Even before Postgres?

[00:00:43] - Ry

Yeah, way back. Were you only like ten years old then?

[00:00:49] - Alexander

Yes, I started actually from web development. Yes. And for web development, Postgres become my favorite database management system. That time there were basically two popular open source database management system, MySQL and Postgres. And Postgres behavior looked way more consistent for me. And this is why Postgres become my favorite. And also thing which attracts my attention was generalized indexing in Postgres that Postgres have even that time had GiST and gene indexes which you could apply to different data types, different search operators. And that was very interesting for me. And I also have studied for PhD in university and I was interested in fuzzy search and features like this. And in Postgres I found fuzzystrmatch complete model. And that model contained Levenshtein function which defines editing distance between two strings, number of editing operations. And I found that it doesn't work with multibyte encoding, with multibyte encoding. So it just compares two strings, byte per byte. And my first patch was to just fix this bug. I actually didn't know if I could produce the I just downloaded sources. Thankfully that time compiling a Postgres already wasn't a problem. So I just had Linux on my desktop and just cloned it.

[00:03:08] - Alexander

Probably that time it was CVS repository, I cloned it and it wasn't difficult to locate the place in the source code responsible for this function. And I just have to find which functions are responsible for getting lengths of multi byte string in Postgres and stuff. Work was very easy, and I have submitted my first patch to Postgres, but then Robert Haas picked my patch to work on it to review and commit. And that time I get that it's not so easy this process community, because we have quite long thread studying possible performance regressions and on, and I rewrote patch this small patch many times. But finally we find a way when this patch not only fixes the problem of multibyte encoding, but also doesn't produce noticeable overhead when it's single byte encoding or strings, just uses single byte characters. In both of these cases, the overhead was negligible. And then Robert did commit my patch.

[00:04:40] - Ry

That's amazing. Yeah, it's interesting. How many lines of code was the patch, I wonder?

[00:04:46] - Alexander

I don't remember exactly. You could find it on the list, but this patch should be in dozens of lights, probably 20 or something like this, really small. But also with these fix, I started to work on improving this function, improving the performance because if you use the Levenshtein function you are typically looking for strings which are close to your string. Right. And the thing is you basically don't care about large distances. For instance, you are looking for strings with editing distance three or less. And that means that if it would be four you don't care how much bigger is that distance. And if you need only this, then you could easily, not easily, but you could significantly accelerate function Levenshtein calculation in many times and this functionality took even more work for me and Robert. But you could use it, it's Levenshtein less equal. Nice function.

[00:06:09] - Ry

That's great. Yeah. I also came to Postgres from being a web developer prior and I've realized in recent weeks actually, that the reason why I went to Postgres and not MySQL is primarily I was using Ruby on Rails, which just kind of was Postgres first and then I really didn't like PHP. Hopefully you weren't a PHP developer, but the lamp stack had MySQL and PHP together and I always just like I don't want to go anywhere near anything near PHP. That's really not a great reason, but it's just a fact.

[00:06:48] - Alexander

Yes, probably one of the great feature I have discovered in podcast in early days was DDL transactions. So that if you need to do a database schema migration, you can wrap all the changes into transaction and if something go wrong you can roll. Yes, that's just amazing. And it's interesting that even old and mature database management systems like Oracle or MSS SQL Server lacks of this functionality. I'm not sure about the present day, but at time indefinitely they all were lacking and this was very useful.

[00:07:36] - Ry

Yeah, that's great. So you built obviously you started with that first patch and worked on other fuzzy search stuff. Have you worked on any Postgres extensions?

[00:07:52] - Alexander

Yes, I continued to work on Postgres. Then I have found that I get known with Oleg and Theodore who was in Russian contributors of Postgres and I get familiar with their work. Some of their work I already know Gist and Jin. But for others work, it was interesting for me that we could accelerate also search for like patterns. Not just perfect patterns, but imagine you looking something that could be in the middle of the string. And there were that time Pgtrgm module. But at that time it only supports trigram similarity. Search using indexes. But I found that it's pretty easy if you decompose string with trigrams it's pretty easy to implement like search. So you could just extract trigrams from the like patterns and search for them. And thanks that trigrams are extracted all over the string so you can find this substring anywhere. And that time I would say my feeling was just great. So it's just amazing. So with quite small patch you could teach database server with some amazing advanced functionality and I get even more encouragement when I would be sorry if I confusing with names hugh Bert I don't know the guy from blogger from Poland who was posting to the planet PostgresQL waiting for and he also posted waiting for this feature like Word for Pgtrm.

[00:10:15] - Alexander

This was one of my first patches. Another thing that during my PhD research I also researched split algorithms for R3 and I have found more advanced algorithm and I have pushed this algorithm to Postgres core and to PostGIS that took time. Yes, because communities are quite conservatives, but it was also good.

[00:11:00] - Ry

Yeah, great. Well, I want to talk about OrioleDB, but before we do that, I was thinking are you paying attention to the whole vector search since you spent so much time on Postgres search features or I don't know, at least sometime. I'm curious, are you kind of tracking what's going on with vector search?

[00:11:20] - Alexander

Yes. That's interesting. That's an interesting subject. Yes. While I have researched the split algorithms, I have also experimented with cube concrete model which supported basically multidimensional rectangles of different number of dimensions up to one dimensions. And what I have found is that if you have low dimensionality two, three, four or five, then when you are producing split, your split will be good for one dimension, but almost don't differentiate other dimension. So you can imagine this if you have two dimensional space filled with points and you need to split them into two rectangles, there is just two good way to do this split vertically or horizontally. All other ways your rectangles would have a huge overlap. But thing changes when your dimensionality increases. There is a so called Woodman quadratic split algorithm and this algorithm actually do clustering. So it just tries to find to divide the set of points into two clusters. And it doesn't work well for low dimensionality, but for high dimensionality of space it becomes better than if you just pick one axis and split in this dimension. And that was interesting for me and I got familiar with cures of dimensions so that if you have low dimensionality space, you can have some guaranteed and quite good search performance.

[00:13:36] - Alexander

Right? So imagine the best example is unidimensional space. You can index this just b three and you have some guaranteed or logarithm n time for search for point request. Right? But when you have high dimensionality, that becomes a problem and uniform random data, which could be good, which you could handle very well with low dimensionality, in high dimensionality it becomes almost unindexable. So if you have 1000 of dimension vectors and you need to search for a similarity, then search in uniform data would be almost impossible to accelerate because you can't identify which particular dimension could give you a match. You could just eat just cumulative results of cumulative results of all dimensions and it's almost impossible to accelerate. I have participated a bit in improvement of PG vector and with some developers of Supabase, we really found that indexing methods, if they applied to uniform data, then they give nothing. So indexing methods, when you have a lot of dimensions, they based on some clustering. The idea is to find, how to say, find the low in the distribution, ununiformity of distribution, and then exploit it. And this is how indexing works. There are different interesting methods for indexing multidimensional space, but I think the favorite is Hnsv method.

[00:16:05] - Alexander

And I have read about this method in the scientific paper way before this AI Hype ten years ago. And that was interesting how far these methods from all others. So it's very self.

[00:16:26] - Ry

It seems like just vector search is just on the exact edge of search. And the LLM AI, like you said, it's kind of blending both sides. Well, that's cool. So I wanted to obviously chat about OrioleDB, I imagine. Are you spending trying to spend 100% of your time on OrioleDB, or? It's probably hard to I mean, it sounds like there's lots of great things happening in Postgres to distract you from your own work.

[00:16:57] - Alexander

The thing vector search definitely distract me. And with this AI Hype, it's hard to not be yeah, yeah, I know.

[00:17:06] - Ry

For say, I would just love to kind of have the top three or four things you're trying to accomplish with Oriole maybe and maybe give us a progress report.

[00:17:20] - Alexander

Yes. But before this, I'd like to mention about how I use AI in my work.

[00:17:26] - Ry

Oh, yeah, sure.

[00:17:27] - Alexander

So I have written the blog post, no more vacuum, no more bloat. You may be aware it was on top of news and interesting that I have mostly generated this with Chat GPT. So I just wrote the short item list and asked Chat GPT to write me a blog post. Then I have corrected a little bit, then I wrote some graph, then add some graph, asked Chet GPT to add another paragraph with them, and then it was done. And I was curious that I expected that I would be revealed and blamed for this. But actually the comments I get was, oh, this blog post is so much well written.

[00:18:23] - Ry

I know, it's funny. I agree with you. I'll take just like, notes. Like, for example, here I've got like a list of 20 questions that I'm trying to ask you during this interview, possibly ask you. I guarantee you if I sent this list of 20 questions to ChatGPT and say, hey, make these questions better, it would crush it, give me all kinds of much better questions. But anyway, yeah, I agree. People who aren't using it are really missing out on a great assistant. All right, so back to Oriole. What are the top three or four.

[00:19:02] - Alexander

Things I'd like to first say about the thing which bring me to the Oriole? When I have studied Postgres, it was just like a magic how MVCC works. So you can run multiple sessions in parallel and each session will have its own snapshot of the data and that was just amazing. And I was very interested what is under the hood, but how this works from user size was perfect, but I always wondered how it implemented internally. Because when you in progress, when you're doing an update, then you just have to mark old tuple and insert the new tuple in the heap. And if hot update is not applied, then you also have to insert all the index tuples even if index set values are not updated. And I wondered if we could do this better. And I think I have this background thoughts for years. So I have studied how MVCC implemented in MySQL as and then how it's implemented in Oracle. And it was very interesting for me to get how it's implemented in Oracle because I heard that Oracle have block level undo. Yes, and then I have thought how could it be on a block level?

[00:20:54] - Alexander

Because if two transaction modifies the same tree page and this page got splitted and then one of transaction could be rolled back, then it's not linear list of log records which you could just apply one after another. Because if you need to roll back some change which was before the page split, you need to do some adjustments with it. And then I understood that I need to go deeper and get how it works and then I learn more things, how rider had log working and so on. And I think in 2017 I have started to work on design of my own storage which could work around the most of shortcomings, which I see in Postgres engine. For sure there is no plain wins, in some situation this engine works worse. But I would just like to design some new trade off which could be better in the typical situation. And I can highlight some things in OrioleDB. So the first thing which it is fighting is bloat and in order to eliminate bloat, it has undo log. So if you update a row, then you just have to add a new version of row into the undo chain and you only need to update indexes if their values are updated.

[00:23:03] - Alexander

And also indexes are versioned trees as well. In OrioleDB I have heard that it is so in Oracle and I found that this is very attractive idea. Thanks to that index only scan becomes very simple because your secondary index already contains all the information and you doesn't need to look into any secondary structures. Because I have heard that in Postgres index only scans, index only scans is amazing until you run in some problem because if you have some intensive workloads and so on, you might have significant part of visibility map zero it and your query could get into trouble. Yes, that also could happen. And this is why I found that if secondary index version that's attractive idea and OrioleDB have a mixed underlock containing both row level and block level records and block level records allows to handle some changes like eliminating of dead tuples of the page. So for instance, your page contains some tuples which are deleted but still visible for some transaction. And using block level undo lock you can issue a new version of this page and get rid of all dead rows and reclaim their space. But the transactions which need old version can traverse page level underlock and find the tops that they need and another to eliminate bloat is automatic page merging.

[00:25:12] - Alexander

Anyway, if even you have undo lock it could happen that your workload is so that you have a lot of sparse pages so you have a lot of data inserted but then the data was deleted and AudioDB supports automatic merge of sparse page to eliminate bloat. Okay, and you had some questions.

[00:25:34] - Ry

Yeah, I was just remembering that it was a blog post or a presentation where you like here are the ten worst things about Postgres. I don't know the exact phrasing you chose but yeah, basically here are ten problems with Postgres.

[00:25:52] - Alexander

That was a blog post of Rick Branson, I hope I spelled the name correctly and yes, I found that was a quite popular blog post. The first popular blog post is definitely Uber blog post about why they moved from Postgres to MySQL. But this blog post from Rick was very interesting for me because the issues he highlighted was very good fit to my vision and to things which I'm going to improve with OrioleDB.

[00:26:36] - Ry

You'll probably never be done with OrioleDB because products are never finished, but how's your progress been through your target initial list of things you wanted to accomplish with Oriole are you making good progress through that or give us a progress report.

[00:26:53] - Alexander

Yes, I'm making a good progress but currently the main target is not to add new features or handle more things, the current target is stability to get more users. So it doesn't work to have amazing products if nobody uses. And I would say that Database World is quite conservative because obviously people don't want to lose their data. And this is why, before using some new database technology, you need to ensure that it's really stable and mature. And this is the main challenge for new database management system or new storage engine and especially when it's OLTP and OrioleDB currently mainly targets OLTP because OLTP is typically the source of your data, right? So where your business operation happened, for instance, OLAP could be not so important. So you pull your data from OLTP system and put to OLAP for analysis and if it will disappear in a OLAP system you can just repeat the procedure but that doesn't work for OLTP which is initial source of the data. So this is the main difficulty I think, but we already did very good work with the team, very good work on eliminating the bugs but we definitely need more better testers.

[00:28:46] - Ry

Yeah, I mean, it's a big challenge. It's similar for us as well, but you have to create enough differentiation to get people to want to try it. But you can't go so far away from like you're saying the further the distance is from what they consider stable, the riskier it is. But if you're too close to the same thing, there's not enough value to do the switch.

[00:29:12] - Alexander

Right.

[00:29:12] - Ry

So it's a tricky catch 22 that you have to do something a little bit dangerous to get the attention. If it's too close, then people will be like, if it's 10% faster, who cares? Right?

[00:29:28] - Alexander

Yes, exactly. This is why I am highlighting the cases where OrioleDB is in times faster or even dozens of times.

[00:29:42] - Ry

Have you now released kind of the stable candidate? You said you'd need more testers. Is that's kind of your stage now?

[00:29:51] - Alexander

Yes. We are currently discussing with our team and advisors when we can make market release candidate on the release. At some point, this is just how to say at some point, this is decisions of the will. So there is no strict procedure you can go through and say, okay, your product now should be better, or your product now should be generally available. You just need to decide for yourself and make this decision weighing all the risks.

[00:30:30] - Ry

Yeah, it was funny. I'll tell you a quick story about that. With my previous company, Astronomer, we were 0.1, 0.2, 0.3. I think we got to like 0.17. And it was like there was nothing big happening in any release that would cause us to say, oh, this should be the 1.0. And then we are also thinking like, this is one of me thinking this, because I was like, let's just make the next one 1.0, goddamn it, but let's save the 1.0 for a big marketing push. I don't know that they ever did it. We waited so long that we never had a 1.0 of that platform. By the time we got to it was like, oh, let's rewrite it. I think it's a tricky thing to ship, but I'm a big fan of shipping early and often. And just mark your thing as an RC candidate. It doesn't matter. It will attract more attention with that RC dot. That zero dot.

[00:31:35] - Alexander

Yes. Actually, the number in the version is marketing you can find in the shop. The price for the good could be $20. But you go away, you come back, and now it's $30 crossed $25.01 of advices. I heard about version numbering was never number. If you have new products, never number your version 1.0.0, nobody will trust it's stable. So you should number something like 1.3 point eleven.

[00:32:21] - Ry

Yeah, well, I'm kind of a big fan of CalVer, where the version of your software should be like, how many releases? It's 23 dot whatever.

[00:32:35] - Alexander
[00:32:35] - Ry

And here's our 6th release in 2023. And that kind of gets rid of all of the traditional semver stuff. But again, that's kind of hard to do as well. Well, I had a bunch of extra little questions I want to ask you just to get to know you a little better. But if you had a magic wand that you could add any feature to Postgres and tomorrow morning we wake up and it's there, what would it be? What's the most important thing? Or would you be most excited to see?

[00:33:06] - Alexander

My question...it could sound probably selfish, but I have a patch set to improve Postgres table access method API. This patch set definitely needs some work and if I magic went I would like this work to be just done and everything is perfect and in the Postgres core perfected. And that would make OrioleDB become a pure extension. But not only that, I think it could clear the path for more table access method implementations.

[00:33:48] - Ry

Got it. That's good, that's a great one.

[00:33:50] - Alexander

And one thing if you have a few minutes, the Postgres Extensibility is probably the idea which comes from Postgres early design and remains Postgres differentiation feature till now. Because all our ideas from Postgres design today it sounds weird and it's not.

[00:34:19] - Ry

There, but Extensibility is it's it's honestly, I always just thought MySQL and Postgres were just two different. Yeah, I didn't realize how much more Extensible Postgres at least tries to be and has been and how many extensions exist for it. Once you start looking at that list, it's pretty amazing, all the work that people have done to extend it. Is there anything, would you say about Postgres that almost nobody agrees with you about? Do you have an opinion that's controversial or just it's hard for people to agree with you about?

[00:35:03] - Alexander

Probably I don't know this opinion. So there are opinions where we have disagreement in community, but I can't remember the opinion which leaves me alone with that. So probably not. Yeah, there are opinions where I appear in a minority, but not more.

[00:35:28] - Ry

Yeah, a lot of times that's just because you don't understand everything yet, right? People come in with a hot take and then the mailing list will educate them on other factors.

[00:35:40] - Alexander

Yes, there are disagreements because we are writing patches to Postgres and we understand these patches as improvements. But actually there is never a pure win. If you achieve something, you always lose something. That might be not obvious, but I don't know. So if you are adding too many SQL comments and your parser get complicated and parcel state machine cannot fit to processor cache and get slower and your source code also becomes bigger and harder to understand and so on. And finally we are just a group of people which needs to make some decision of will, of where to go. No way is a clear win. We are trying our best to don't discourage existing users, but even the best offers of them, even with best efforts of them. There are some disprovements for sure. At the end of the day, we need to negotiate and make a decision.

[00:36:54] - Ry

Where can people find you online? And in particular, what's the best way for them to get involved with testing OrioleDB?

[00:37:04] - Alexander

Yes, any person can reach me via email and testing Oriole, you just go to GitHub, download and compile sources or even easier, go to Docker Hub and get the docker image and just go experiment with your data and your workload and raise an issue or discussion and share your experience and.

[00:37:32] - Ry

Refresh us one last time. Where are the use cases where Oriole is going to outperform vanilla Postgres most dramatically I guess yes.

[00:37:43] - Alexander

So the situations are when you have a lot of updates, very update intensive workload, then OrioleDB could outperform things to undo log. Another case is huge multi core machines and OrioleDB eliminates a lot of bottlenecks and can perform in times faster and also OrioleDB implements role level write ahead log and if you have a huge write ahead log stream and you need geographical replication, OrioleDB can also help a lot. Awesome.

[00:38:31] - Ry

Okay, great. Thank you. Yeah. If there's anything else you want to share, feel free, but otherwise it was great chatting with you.

[00:38:40] - Alexander

No more from me. Thank you very much for inviting. It's always very lovely and friendly talk with.

· 17 min read

In this episode, Ry and Pavlo talk about pg_timetable, about the value and place of risk in the ecosystem, and about building for the Postgres core. If you haven’t seen or listened to it yet, you can watch below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Pavlo for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Ry

Hello, welcome to Hacking Postgres. I'm Ry Walker, founder of Tembo, a managed Postgres company. And today I have Pavlo Golub. Did I say that right? Pavlo? Yeah, you can fix it from Cybertec as my guest. So Pavlo, welcome to the show.

[00:00:30] - Pavlo

Yes, thank you for having me. Hi there.

[00:00:32] - Ry

Hi there. Yeah, so maybe you could start by giving us a quick background, like where you grew up and what were you doing before you started working on Postgres.

[00:00:41] - Pavlo

Okay, so I was born in Soviet Union so I'm a little bit old, but yeah, I lived in Ukraine. And the first time I saw Postgres, when I was in my high school, it was like fourth year of high school and at that time it wasn't very popular. Everybody we're talking about MySQL like the only database with open source and yeah, at that time I saw the PostgresQL, I tried to install it and run it. It was hard for me to do that because at that time the version was like 7.4 something and at that time there was no installer for Windows and we were mostly like Windows guys at that times. So, yeah, I need to install Linux and then install Postgres. Yeah, I managed to do that, but I wasn't happy about it. The whole process is like I feel like cumbersome or something like that. So I left it for like a couple of years and then my first job in IT I was a support guy, so it was directly connected with the Postgres. So I started my IT career with Postgres and I'm happy that I had this chance.

[00:02:36] - Ry

Yeah, that's crazy. I don't personally remember when I first started using Postgres. I think it could have just been when I started using Ruby on Rails, I don't know. But it's funny, I don't remember when I first used Linux or a lot of things, but it's great that you have that distinct memory. You started in support, you've built some stuff. Now on top of it, tell me about the body of work you've done on Postgres or with Postgres. I'm curious to know what you've built.

[00:03:13] - Pavlo

So right now at Cybertec I'm a consultant, so my position called consultant, they called me consultant, but I prefer to think about myself as a developer. I'm not an Admin, I'm not a DBA, I'm a developer. So I'm mostly doing developing things for Postgres. I had a couple of small patches to these source, but I don't consider them as major things. So I did some client libraries for Delphi if somebody still knows what it is, and later I switched to C, Go, Python, that kind of stuff. And right now I'm mostly develop with the Go language. And at the moment we have a couple of utilities made in Cybertec, so I'm in charge of them. So one of them is like pg_timetable, which is a scheduler for PostgresQL, and another is pg_watch which is a monitoring solution for Postgres.

[00:04:30] - Ry

Yeah. Yeah. What caught my attention was pg_timetable because I come from a company that my previous company was Apache Airflow. So the idea of doing dags jobs that are chained off of other jobs, to me, there's no other way. Trying to schedule jobs and hope that they finish by the time of the next one, that's dependent is something that people used to do and they should never, ever do anymore. But I'm sure it still happens. Right. So certainly if people are using pg_cron, no way to chain them, as far as I can see.

[00:05:11] - Pavlo

We had a client and we have a process, like, lasting, like 14, 15 hours consisting of many steps which are dependent in some complicated ways. So we used a make file to split them into the jobs and to specify that this stage must be executed after that. And that and that worked fine. It's cool to have a special utility to run that kind of jobs or chains, whatever.

[00:05:56] - Ry

Yeah. Well, one of the things I think about a lot is how Postgres, the Postgres server is doing lots of jobs, right. I don't know how many processes it's dealing with, someone with better internal knowledges, but I know it's certainly vacuuming. It's certainly processing queries. It's doing a lot of things right. And then question is I'm a fan of SRP single responsibility principle and programming in general, but I also like monoliths too. We all have conflicting ideas in our heads. And having a queue managed inside the Postgres, you're kind of like straddling, right? Because you have a container next to it. So some of the work is happening over there, but some of it is inside of Postgres too. Why not have it all be inside? What are you thinking about for future versions of pg_timetable? Why not just do all the work inside of the running Postgres cluster?

[00:07:00] - Pavlo

So, yeah, thank you for the question because that was like, the second thought when we released the pg_timetable. So the idea is that PostgreSQL uses a separate process for everything. If you have a new connection, you need a new process. So if you have a background worker for cron or for another scheduler, you will need a background worker. A separate process. Yeah. It's possible. We can create an extensions with Go. It's possible, yes. But right now, I'm not sure what benefits we will have.

[00:07:58] - Ry

It also feels kind of gross, right, to think of throwing it all together. But I have this conversation all the time with my team. If you have a system that's dependent on the Postgres, let's say there's an API, right? And the API, if the Postgres is only invoked when the API happens you could theoretically put all that compute into one bucket and just let the API talk to I'm thinking, like, say, take it's a Rust API. Should you serve that Rust API as a separate container and have the separation of concerns? Or should you just create a custom Postgres extension and let the Rust run inside the Postgres cluster and there's zero latency to the data in that regard, but it's a little bit of a Frankenapp at that point. It's a tricky problem.

[00:08:52] - Pavlo

I will tell you that, that I want to implement this scheduler as an extension for Postgres, not because I think that we will have some benefits, but because this is very interesting task for me as a developer, first of all.

[00:09:08] - Ry

Yeah. Is pg_timetable supported by any cloud providers right now?

[00:09:16] - Pavlo

Yeah, we can run pg_timetable against any cloud provider. If you want, you can run it in Docker and Kubernetes as a standalone binary, or even you can try to grab the source code and put it into this AWS fancy one time run and forget I don't remember how they called it.

[00:09:46] - Ry

Yeah, but because it has the sidecar, it probably can't be like a standard extension in RDS, for example.

[00:09:54] - Pavlo

Right, it might be, but I think to make this happen, we need to somehow communicate with AWS and make that happen. It's more like politics, not yeah, well.

[00:10:10] - Ry

I think it's also just like I don't know if any of the other extensions that they offer have a sidecar requirement. Right. So that's one of the things we're thinking about at Tembo is like, let's allow sidecars if you allow sidecars to extensions in our managed service yes. It's like, again, it's a lot more dangerous in the sense that there's more things that could break, but it's also more powerful. And if we had the rule that dangerous things are bad and you wouldn't have a combustible engine inside of a car where little explosions are happening all the time to power it.

[00:10:50] - Pavlo

On the other hand, you can build your own image, right, and somehow to limit the binary in it to do some dangerous things, et cetera.

[00:11:02] - Ry

Yeah. I joke sometimes and say we're building the dangerous Postgres just because it's not really we support 150 extensions now, so it's like, yeah, our support footprint is increased and scarier than if you had 75, but I still say it's worth it to the developer to have the capability. So I'd love to get pg_timetable as a standard option. I want to use it, basically. What's funny is I'm building a data warehouse right now for our company, and I started with pg_cron, and I have these four jobs that all start at the same time and it's very early, so it's not a problem. But I would love to have them be chained. Like I said, I could do that fake with some spacing things out, but I really feel like I've just stepped back 20 years into the world of early data processing. So I appreciate what you built and use it soon.

[00:12:08] - Pavlo

Okay. Let me know if you're happy with it.

[00:12:11] - Ry

Yeah, I think it's a good idea and it could also help developers avoid having to go pick up some other tool to take the next stage of their data processing, which to me is a big win. And we're trying to make the Meme "Postgres for everything." I know you are too, in a sense, because you built the extension. But yeah, I think there's so much it can do, and it can do a lot more than just writing and reading data.

[00:12:54] - Pavlo

One more thing about having the pg_timetable as an external application is that you can use your own binaries inside jobs like baseql or pg_dump or something, like to grab something, to convert something, et cetera. So we want to have a possibility for developers to create their own docker images with the binaries or tools they only need and pg_timetable. So in this way it sounds like a Swiss knife, right? So I need a pg_timetable and a couple of utilities for grabbing something to manipulate with files, et cetera. And then we pack it and that's all. We don't need to think how we should allow the application to install it or to have it on the system version, et cetera, et cetera.

[00:13:58] - Ry

Yeah, basically what you're saying is it gets even more dangerous quickly, right? Like the base use case is one thing, but then when people start doing some more wild things, including their own, it could be their own packages, right? Their own, of course, completely custom. Nobody knows what's inside of it. It could be a bitcoin miner, it could be anything, right?

[00:14:23] - Pavlo

Yeah.

[00:14:26] - Ry

Are you working on the next version of it or is it pretty stable right now and doesn't require a lot?

[00:14:32] - Pavlo

So at the moment we are stable. So we are open for ideas and then requests, et cetera. But at the moment we are fine, I believe. So it's like in maintenance mode, so we update versions, packages, et cetera. So no new box. I'm happy with it.

[00:14:56] - Ry

Did you build it? Was it sort of a small team there or did you kind of build it solo?

[00:15:02] - Pavlo

So the initial idea was by Hans, our CEO, and yeah, that was my first project in Go. So I was trying to learn a new language to see how can I do this? And we are now here. That's cool, I believe.

[00:15:36] - Ry

What were the biggest challenges would you say you faced while working, while building it? Was it hard to, for example, understand internals or was that part easy? I'm kind of curious what was toughest.

[00:15:49] - Pavlo

For me, I would say that the new tools that you need to use, like if you are familiar with your favorite IDE and then for Go language, you need to switch to something like Vs code or Go land or whatever. This is completely new tool for you. And you have no this muscular memory for shortcuts, et cetera, et cetera. It's kind of difficult. Like six months I was trying to get used to new tooling, but after that I'm pretty happy.

[00:16:26] - Ry

Yeah, that's so when was it that you started working on timetable? pg_timetable?

[00:16:35] - Pavlo

Maybe three years ago, something like that.

[00:16:39] - Ry

Yeah. Okay, well, that's great. So have you guys implemented it many times now for customers or how has it been? I know it's not a commercial product, but I assume it should be helping the consulting business, right? Yeah.

[00:16:58] - Pavlo

So it usually goes in a package with something. Usually if we have a new client, we are talking about high availability and we are talking about monitoring, and we are talking about that and that, and then we say, okay, we have a scheduler if you want it, and usually people want it. I don't remember we have like a standalone cell for a pg_timetable. But yeah, it's usually like packaged into something bigger, usually with a high availability and monitoring.

[00:17:33] - Ry

Got it. It’s like a sweetener, right? Cool. Well, maybe I'll shift into some more general topics. I'd love to know what are some of the developments in Postgres ecosystem that you're excited about?

[00:17:50] - Pavlo

So the Postgres itself, the core, I always want to constantly want to do something for the core, but.

[00:18:03] - Ry

It's like you're racking your brain around there.

[00:18:07] - Pavlo

The process itself is not like user friendly. First of all, you need to communicate through the email lists. No GitHub, no issues, no pull requests. And you need constantly update your patches because the development is very active, and you need to defend your patch against many people in the community because why you did it, what purpose, why you do it that way, not another. So that might take some time for.

[00:18:46] - Ry

Who are your favorite people in the Postgres ecosystem? You hate to play favorites, but people I should interview get to know anybody come to mind?

[00:18:56] - Pavlo

Yeah, a lot of them, frankly speaking. Well, I think that the community is the reason, number one, why I am still doing Postgres things. I was working like, for five years without knowing nobody from the community. And then in 2011, I was first time on the PG conference in Europe, in Amsterdam, and that changed my mind. Absolutely. I saw what the community is and what are the main things and why are those working that way and not another. So if you want names, so Bruce Momjian, for sure. Magnus Hagander, Ilya Kosmodemiansky, Vik Fearing, Andreas Scherbaum, a lot of them. Like Lætitia Avrot . A lot of them.

[00:19:59] - Ry

Well, I'll reach out to those people. It's obviously a very big community too, certainly, considering the extent, anyone who's touching it is very excited. First, I'm going to the know, the New York City conference here in a couple of months. Yeah. So I'm excited to meet a bunch of people there. And we went to an event in San Jose last spring, which was fun. All right, so here's another one. If you had a magic wand and you could add any feature to Postgres? What would you change about it or what would you add?

[00:20:39] - Pavlo

I don't think that I can change something in a way that it will be a brick and change but I would like to see if we can implement Postgres using threads, not the processes. I would like to see if that might be better solution that we have now. I'm not sure.

[00:21:04] - Ry

It's being debated, right?

[00:21:07] - Pavlo

Yeah, absolutely.

[00:21:08] - Ry

Yeah, we'll see.

[00:21:09] - Pavlo

It's a hot topic.

[00:21:10] - Ry

Hot topic, yeah. I did join the mailing list for a short period of time. I'm starting a company too and it just was too much. I was like, oh my gosh, this is a fire hose and someday I want to be able to have the time to consume it. But today wasn't that day. Is there anything about Postgres that no one agrees with you about? Do you have any contrarian views about any part of it that you can think of?

[00:21:38] - Pavlo

Yeah, probably my views about the Windows because I was a Windows guy and I'm still a Windows guy and I'm always saying that we need to support Windows as well as we support Linux. So that means installers for Windows and extensions and tools, et cetera, et cetera. And people usually say, no, Linux is fine and that's enough is enough.

[00:22:08] - Ry

Yeah, it's interesting, it's fun. I'm interested in a lot of different areas of programming. I was looking at, say, Unity, you know, game development and it's C# and I'm like "Ugh" and it seems like very Windows centric and I'm like because I'm a Mac user and whatever. For me it'd be like Mac, Linux, Windows in terms of my priority but I'm like certainly not C#, right? But anyway, yeah, the religious wars will never end there, I guess, in terms of platforms.

[00:22:47] - Pavlo

The bad thing about our approach right now that I think that we don't have enough good GUI programs for PostgreSQL every time we are using psql to show something to demo. And I think that for newbies it would be much easier to follow a demonstration if you use some fancy GUI application.

[00:23:15] - Ry

Well, it was great chatting. Where can listeners find you online? Are you on social media anywhere?

[00:23:23] - Pavlo

Yeah, I'm like on every kind of social media. Like GitHub, LinkedIn, Twitter, Instagram, Facebook, Blue Sky, Mastodon.

[00:23:36] - Ry

Great. All right, well, like I said, someday soon I'm going to try pg_timetable and I'll give you some feedback and maybe some ideas in the GitHub and we'll see you there, I suppose.

[00:23:47] - Pavlo

Sure. Thank you.

[00:23:48] - Ry

Great, great chatting with you. Thanks for joining.

[00:23:51] - Pavlo

Thank you. Thank you.

· 19 min read

In this episode, Ry and Eric talk about ZomboDB, the complicated road that led to the development of pgrx, and what it means for the future of extensions within Postgres. You can watch it below or listen to it on Apple/Spotify (or your podcast platform of choice). Special thanks to Eric for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Ry

Hey, Eric. What's going on, man?

[00:00:14] - Eric

Hi, Ry. How are you?

[00:00:15] - Ry

Long time no talk.

[00:00:16] - Eric

Yeah, it has been a while.

[00:00:18] - Ry

How are things going?

[00:00:19] - Eric

Well, I think.

[00:00:21] - Ry

Well, maybe let's give our listeners a quick intro on what you've done. I mean, Eric Ridge is pretty much a household name in mean, in case there's someone listening who doesn't know what you've been working on. You want to give us a quick background on your involvement in postgres over the years?

[00:00:36] - Eric

I think a household name is absolutely not true. First of all, I'm Eric. I am the reason that this postgres extension name ZomboDB exists. And over the years of developing that, something that's now called Pgrx kind of fell out of all that work that lets us develop extensions for postgres using the Rust programming language. And these things have been where my interest has been as it relates to postgres. I guess past seven, eight years now. Things ZomboDB has been out since 2015, so we're right at eight years.

[00:01:14] - Ry

Do you know how many it's at? An uncountable number of Rust extensions now, probably you probably at one point knew how many there were, but now you've lost track.

[00:01:24] - Eric

Yeah, that's a good question. It's hard to know. There's no real clearinghouse around extensions.

[00:01:30] - Ry

If only there was, that would be a great thing for someone to build.

[00:01:33] - Eric

Yeah, it would. I wish somebody would get on that. Right. Yeah, that's a good question. There's some bigger names using Pgrx, and the Timescale is one. Supabase is one. We recently released plrust, and that's found its way out on RDS. So that's exciting for us. What we see popping up are small extensions that are written in Pgrx, and we see a lot of people writing I don't want to call them one off extensions, but one off extensions that are just, like, specific to their business use case and solve some interesting data domain problem that only they would have. But now they have the ability to build that business logic or data manipulation or what have you into their database and into the database itself.

[00:02:27] - Ry

Without having to go find a C programmer that's willing to do such work. Right?

[00:02:31] - Eric

Yeah. And that was one of the motivators for developing Pgrx, was I guess I had been working on postgres extensions in some form or fashion since postgres 8.0, and I think that was even before the extension packaging and create extension command and all that came into existence not long before. But I think that that was before and yeah, I mean, just over the years, I got tired of working in C. When you're developing a postgres extension in C, you're not only putting on your C programmer hat, but you're trying to put on the hat of a postgres hacker. Right. You need to understand postgres sources just as much as you do the language itself. So Rust was becoming a popular a number of years ago. So I was like, you know what, I'm going to try to do this.

[00:03:22] - Ry

Well, it's interesting. It's like how a lot of games, if they have mods, the language for the mods has to be easier than the language for the core game. Right. So a game might be written in something like Unity, but then they give you, like, Lua as a tool or JavaScript or something simple to build. But in Postgres, that wasn't true. Now, is Rust as simple as Lua or JavaScript? Maybe not, but it's still like very beloved people who try to learn it. In my experience so far, it's 100% like anyone who tries to learn Rust learns it and loves it. I haven't found someone yet and I have a small sample size, but, yeah, seems like it's a pretty good hit rate in terms of the perfect mix of control and speed with the ease of onboarding.

[00:04:10] - Eric

If you're not a C programmer, then writing the postgres extension in C is going to be difficult again, because C and then also because you're really programming Postgres's version of C, which is great, but it's not JavaScript, it's not Ruby, it's not Java. If you're not a Rust programmer, first you've got to learn Rust, and Rust is different enough from all the other languages to make that a point. But what we're trying to do with Pgrx, and it's going to be a multi year effort to do this, it may never get done, but we're trying to present the postgres internals in the most idiomatic way possible to Rust. Right. So that if you are a Rust programmer or you are beginning to learn Rust, it'll be pretty obvious what you need to do in order to make your postgres extension useful.

[00:05:10] - Ry

So you said you were working on extensions as early as Postgres Eight. Were you a Postgres user or our developer even earlier than that, or was that sort of when you got involved?

[00:05:21] - Eric

Yeah, Postgres user, I think we started using Postgres at work in the year 2000 and that might have been I mean, is that 7.1, 7.2? It's quite a long time ago and we've been using it every day since. And, yeah, somewhere along the way we needed to do some work inside the database. So here we are today.

[00:05:46] - Ry

Yeah. For me, when I think about when did I start using Postgres, it's like thinking about when did I first start using a laptop, it's like I don't remember exactly when, but definitely transitioned to having a laptop at some point. But, yeah, it didn't seem monumental to start using Postgres back then, but obviously it's come a long way and it's a pretty exciting ecosystem right now. So, yeah. What do you think the most powerful thing about extensions are? Do you have any thoughts on what are the most powerful things you can do with extensions, out of curiosity?

[00:06:19] - Eric

It's an interesting question.

[00:06:21] - Ry

What's the coolest part of, I guess what API? Or maybe another good question is what hasn't been exposed yet that should be or needs to be in Pgrx? That would be powerful.

[00:06:32] - Eric

Yeah, I'll start there and then work backwards. There's a lot of internal APIs within Postgres, and there's a number of interesting ones. There's the index access method API, which is really easy to implement. We haven't added safe wrappers around that to Pgrx, but it's real easy to implement. It's really just a handful of functions. There's the Foreign Data Wrapper API, which Supabase released a project that they call Wrappers that'll let you create foreign data wrappers using Rust backed up by Pgrx. So kind of the next big API around that genre of internals. There would be the Table Access Method APIs, and we haven't exposed these at all yet. They're actually difficult to get at even through our bindings just because of the way they're set up in the postgres sources. So when we get around to doing that, and I hope that we will, there's going to be some development effort on our part to safely expose these or safely expose the Table Access Method APIs. But that's the API that would let you write your own heap storage engineer system in Rust.

[00:07:49] - Ry

Tell me about I mean, probably not everybody knows what ZomboDB is. Maybe quickly describe what it does and what it doesn't do and what's coming next with it.

[00:07:59] - Eric

Yeah, it's the worst named software product.

[00:08:03] - Ry

First off, why'd you call it Zombo, everyone wants to know.

[00:08:06] - Eric

I think, Ry, you're old enough to remember Zombo.com. If you don't remember it, then that just means you weren't playing on the internet. You're probably doing real work. It still exists so listeners can go and check it out, but the name just came from that. So ZomboDB is a postgres extension, and it's been through numerous iterations over the past eight years, but it has always been postgres extension that lets you create an index on a table that's backed by Elasticsearch. Typically the usage pattern is you create the index indexing all of the columns in the table and you point it to an Elasticsearch endpoint over Http or Https, and the ZomboDB keeps that index up to date. It keeps the index MVCC correct. All of your search results are you only see results that would otherwise be visible within the executing transaction has its own query language for doing full text queries, so users don't have to learn the Elasticsearch query DSL, which is pretty intense.

[00:09:14] - Ry

Can you use the Elasticsearch DSL with it, though?

[00:09:17] - Eric

You can, and there's a number of ways to do that. The query language itself allows you to just insert as part of an expression. You can insert a JSON block to represent a query DSL node. There's a full set of SQL builder functions, so you can build query DSL through SQL statements. You can mix and match these things together. It's pretty sophisticated on that front, and it also exposes nearly all of Elasticsearch's aggregate endpoints to enable you to do really interesting aggregations with your full text. And since all the fields of a table are also stored in the Elasticsearch index, you can do some really interesting things that involve full text queries and aggregating on various metadata fields and do all this with a large distributed elastic search cluster behind the scenes and do it in near real time. It's really slick. The downside to ZomboDB is that you now have an Elastic search cluster running alongside your postgres cluster. And ES is a living breathing system. It requires significant attention and maintenance in production. It gives you a lot of capabilities to scale out horizontally very easily, but it is its own living breathing system there.

[00:10:42] - Ry

Yeah, it'd be great if you can build a version of this without requiring elastic.

[00:10:46] - Eric

Yeah, I know a guy that's working on that. I know a company that's working.

[00:10:50] - Ry

Yeah, someday that'd be possible.

[00:10:53] - Eric

Yeah, well, you had asked where the name came from and I said all that to say that the name came from. Once we kind of got ZomboDB working, we were like, whoa, it's really amazing what you can do with this and how easy it is to use. And one of the mantras from Zombo.com is the only limit is yourself. And ZomboDB really kind of makes it feel that way because it makes querying your data, especially real dense text data, just super simple.

[00:11:26] - Ry

And so you use Zombo pretty aggressively within TCDI, which is where you work, right?

[00:11:33] - Eric

Yeah, TCDI is my employer. Yeah. We've had Zombo in production since 2013 and we didn't open source until 2015. Yeah, I don't want to give away numbers, but we have more ZomboDB installations and Elasticsearch instances than we probably want.

[00:11:53] - Ry

Is it sort of a competitive advantage for the business? The power that I know, you guys, it's like legal documents. I might get it wrong, but like searching legal documents to some degree, right? Is it a competitive advantage for the business? The fact that Zombo is inside, would you say?

[00:12:08] - Eric

Yeah, absolutely. Our industry is ediscovery litigation support and among many other areas within that domain, searching large volumes of text is a key thing. And we're the only player in our industry that has text search capabilities to the degree that we do. And I'm not going to go name or other competitors or anything, but their abilities to do complicated text search that involves fuzzy queries plus regular expressions, plus boolean expressions, plus long tail proximity queries hit highlighting it's our abilities to do that just far exceed our competitors because of ZomboDB, because of postgres, and because of Elasticsearch.

[00:12:55] - Ry

Do you use the extension in both transactional use cases as well as analytical? Or is it mostly just powering the search bar inside the product or is there some back end? I guess I call it like analytical type use cases for it as well.

[00:13:12] - Eric

Yeah. So with postgres. Right. Everything is transactional.

[00:13:15] - Ry

Yeah.

[00:13:16] - Eric

We expose search in a very complex manner. I guess, to our users. Users, it's much more than just a search box. They can search by particular fields and design proximity queries, and they can do term lookups. They can create real time dashboards and whatnot through our software. That's all backed by ZomboDB. One interesting use case is in the world of litigation. The plaintiffs will provide the defendants with a list of keywords that need to be searched in order to find documents that are responsive to particular to these topics. And these keywords are sometimes just literally a keyword one word. Sometimes they're phrases, sometimes they're wild card terms. Sometimes they're phrases with embedded wildcards. Sometimes they're patterns that need to be recognized. And so, like an analytics side of what we use ZomboDB for is taking these ginormous lists of things and kicking it off and getting back a list of the 6 million documents out of the 200 million that might match this set of keywords that the plaintiffs have requested production for.

[00:14:25] - Ry

Yeah, that's intense. Could see the power of that. What are your other favorite extensions? Probably all the other ones you've made, but give me your top five postgres extensions. zombos, one. What are a couple of others that you like?

[00:14:39] - Eric

I don't use PostGIS or PostGIS, but I am super impressed with it. I think of it as kind of like the gold standard of what a postgres extension ought to look like. ZomboDB does integrate with it, but the data world that I live in doesn't involve the need to do geographical type queries. But I've always been really impressed with it. I think the postgres foreign data wrapper as an extension is really great, and that comes from the core community, but pretty impressed with that. Citus has always been interesting to me. I guess they've long since been bought out by Microsoft and are still going strong. I think that Citus really shows off the power of what a postgres extension can do, and it also shows off the power of having committers on the hackers and hackers to be able to help move the extension system in postgres forward so that you can then implement something like Citus on top of it.

[00:15:36] - Ry

When they were building that, did they have to add things to extensions in order to build that? I don't know the whole history of Citus, but was that kind of like.

[00:15:46] - Eric

Yeah, they've had a number of changes in postgres, I think, especially in the planner and executor to be able to support the ability to extend it better. And I think some of that came from the original Citus company people. That's great, right?

[00:16:03] - Ry

Yeah, that's awesome.

[00:16:05] - Eric

What I like to see are people using Pgrx to just solve a problem that they have with their application. I see things like somebody just needed to talk to an S3 bucket, and so they put together a little Pgrx extension to do that. Little things like that are what get me excited about Postgres extensions.

[00:16:26] - Ry

Yeah, it's pretty timely. I have a co-op that started literally today. Shout out to Jay and I told him to build a clerk FDW using the Supabase Wrappers SDK. And he's like, what the hell do you say to me? Just try read the docs, we'll see where it goes. But I'll report back how that goes. But yeah, Clerk, we're using Clerk for Auth for our company, and I just need a simple read access to three collections. So I figure he should be able to get it done by the end of the day, but I might be pushing too hard.

[00:17:00] - Eric

Well, it's possible. And that's a great example of where having a simple framework for making an extension is really powerful. Right. I mean, once you figure out the documentation and how to tie the two things together, it probably is just an afternoon's worth of work. And now you've got something that solves a problem and you can move on.

[00:17:21] - Ry

Yeah, I figure his second one maybe will take an afternoon, first one might take a few days, but we'll see.

[00:17:26] - Eric

Well, sure, yeah.

[00:17:28] - Ry

Well, that's great. Yeah, we're super excited, obviously, by all the possibilities that Pgrx brings to Postgres, and the great work you guys have done historically to get us to this point is much appreciated.

[00:17:42] - Eric

Well, thanks. That's one of the reasons why we, TCDI donated Pgrx to the PG Central Foundation is to provide it a long term home for the Postgres community. TCDI is a litigation support company. We're not a database company, so to speak. So we wanted to get this stuff in the hands of a foundation that can help shepherd it forward and make sure that it can survive.

[00:18:09] - Ry

I imagine you have some open issues there that you could use help with. They're probably pretty hard to do, right? Like building a framework is a lot harder to building a thing with a framework. But are you looking for additional help from people to build out Pgrx?

[00:18:23] - Eric

Yeah, for sure. I can't even tell you off the top of my head how many contributions we've had over the past four years that Pgrx has been alive. And it seems like every couple of weeks somebody new comes by with a new contribution. But yeah, one of the things that we're focused on right now is I guess it's two things kind of concurrently. One is we're improving Pgrx, is testing fundamentals, and it's both tying in more interesting things in CI like Valgrind, to different approaches to unit testing, which includes we're doing a big push on property testing right now within Pgrx. I think there's going to be some pull request landing in the next week or so with that. And we're also focused on improving some of the memory safety and soundness impedance mismatches between Postgres internals and Rust's compiler. So it's a big concerted effort to take a look at these issues.

[00:19:25] - Ry

Well, it seems like contributing to Pgrx could be like a gateway drug towards getting closer to helping with postgres itself. Right. Because you need to understand the internals to do anything in know, at least unless you're working on tests or yeah, I mean in general, adding new features to Pgrx means understanding postgres well enough to create that interface, which is big challenge.

[00:19:48] - Eric

It does. And we myself and the others that on the decor development team there, we spend probably as much time in the postgres sources as we do actually writing code for it in.

[00:20:03] - Ry

You guys. If you come up with a big new release of Pgrx or any other extensions that you can think of, I'd love to have you back. Eric, it was great catching up with you.

[00:20:12] - Eric

Yeah, sure. Likewise. Thank you so much for having me.

· 19 min read

In this episode, Ry and Adam talk about developing extensions for Postgres, being in the trenches of the modern data stack sprawl, and the future of what’s possible with Tembo Stacks. If you haven’t seen or listened to it yet, you can do so below, or on Apple/Spotify (or your podcast platform of choice). Special thanks to Adam for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:13] - Ry

Welcome to the show, Adam. Excited to chat with you today. How many postgres extensions have you authored in 2023?

[00:00:23] - Adam

Oh, it at least two.

[00:00:28] - Ry

At least two?

[00:00:29] - Adam

Well, in early 2023, I was just like, getting started and trying to figure out how to write one. And I know I wrote like a handful of them that didn't really do anything. I think the first one was just put like a rest server inside an extension and see what happens. Kind of didn't work very well.

[00:00:57] - Ry

Yeah, I remember that.

[00:00:59] - Adam

I think I made like, a prometheus exporter, and then recently there were two ones that are kind of more useful.

[00:01:05] - Ry

Yeah, that's great. Well, so obviously we know each other. We both worked together at Tembo. I mean, I could ask you questions as if I don't already know the answers, but since I do, I'm not going to do that. I was going to say getting into postgres, though, joining Tembo wasn't your first touch of postgres. Do you remember starting with, like I can't really quite remember when I started. But can you remember when you did?

[00:01:33] - Adam

Well, the first time I heard of postgres I'm pretty sure was like in undergrad, like database management systems class. And it was like, there's Oracle, there's these companies around it, and then there's open source stuff, postgresQL, you know? And I remember thinking, like, oh, that's And I probably Googled it and saw the elephant. And then I know I started using it in undergrad. I was doing some scientific computing stuff and I just needed somewhere to I didn't want to just keep writing to CSVs. I wanted to put it somewhere that was easier to search. And I just used postgres for it. Didn't really know what I was doing. It was just like, dump it into this database.

[00:02:21] - Ry

Nice, nice. And then as far as hacking postgres, that just started recently, or have you messed with the internals of postgres prior to this year?

[00:02:35] - Adam

Definitely not prior to this year. It's mostly focused on the application layer, building stuff on top of postgres. This year still getting deeper into the internals of postgres. I had installed some extensions before this year. Definitely didn't even think about how do you write build an extension, but it's.

[00:03:00] - Ry

Pretty easy to build them?

[00:03:01] - Adam

Yeah, that's very recent for me.

[00:03:02] - Ry

Pretty easy to build them. Right. Would you say?

[00:03:04] - Adam

Yeah. A shallow one is easy if you can learn rust. Yeah, it's a challenge. I think it's hard.

[00:03:17] - Ry

Yeah, I was kind of kidding.

[00:03:19] - Adam

And then there's like a learning curve, and then once you get over a curve, you can kind of get moving with it.

[00:03:27] - Ry

Maybe. Let's talk about the extensions that you're, I guess, most proud of or have the most traction so far. What are those?

[00:03:39] - Adam

Pgmq, I think, has the most traction message queue extension. It's a lot like SQS Amazon's simple Queue service or Redisimple Message queue except on postgres. Well, we wrote it for Tembo to help run our cloud. We needed a message queue between Control Plane and Data Plane, so we wrote it for that. And then just in the last couple of months, we started kind of talking about it in the community. And after we wrote a blog about it, we've had a few people from the community that didn't know before who are now consistently contributing to the project. So Pgmq is definitely the one that I think has most traction.

[00:04:36] - Ry

And I think you're also working on well, you kind of have a stack of them, right? Because your other extension relies on Pgmq. True?

[00:04:48] - Adam

PG later.

[00:04:50] - Ry

Is that true?

[00:04:51] - Adam

Yeah. Okay. Yeah, PG later, that's another extension that lets you submit a query to Postgres and then forget about it and come back later and see the results from that query. And that's built on top of Pgmq. So it's like a stack of extensions.

[00:05:20] - Ry

Nice. I guess you're in the position now to be the extension mentor for others in the company who may be building extensions.

[00:05:29] - Adam

I know.

[00:05:29] - Ry

I just hired a co op who's working on one, and I imagine you're mentoring him to some degree.

[00:05:37] - Adam

Yeah, Jay talking about? Yeah, Jay's. Great. Hey, Jay. Jay, if you're listening yeah. Jay's been working on the another extension written in Rust, Clerk FDW. So it's a foreign data wrapper around Clerk, which is like our identity provider. We just want to be able to build this data warehouse that has our users and their organizations and have that data persisted in our data warehouse. So we built well, Jay is building that foreign data wrapper around Clerk, which is pretty cool.

[00:06:17] - Ry

Yeah, we probably should give a shout out to Supabase for the Wrappers project that we're using to build that with. Sounds like it's been a pretty nice experience. Jay's a co op that just joined, like, last week and so already being very productive. So I'm excited. I think that there's the possibility of an explosion of new extensions for Postgres now that Pgrx allows us to use Rust. You agree with that?

[00:06:51] - Adam

Yeah, it's pretty powerful because take this Clerk FDW, that extension that Jay's building. We can bootstrap the software together the same way as you might do outside of Postgres, but we don't have to spin up another resource somewhere, manage a pod or another VM, can package it up into an extension, use Supabase Wrapper. So that's a software that's already built. It's tested. We don't have to reinvent the wheel on that. We write the extension in Rust. So we pull in Rust libraries that are already built, already tested. Then we write our own tests on top of these things and then package it into our extension and then deploy it as part of the database. And then we can kind of just monitor the data there and not have to manage another service that could crash.

[00:07:54] - Ry

Yeah, it's pretty nice. So I was going to change a little bit to...you've also been using a lot of other people's extensions. Again, probably a lot more since joining Tembo than before, but what are some of your, I don't know, favorite extensions that you've been exposed to?

[00:08:15] - Adam

Yeah, you said PG Cron a few different places. There's another extension that I'm working on. It's, like, really early, but it's kind of wrapping up kind of some of the stuff that Langchain does in Python, but wrapping it up into an extension, pulling in PG vector into that, pulling in pg_cron, pulls in Pgmq and PostgresML as well. So it's like a bootstrapped extension. So all of those are pretty good PostgresML, pg_cron, pgvector, and they all have pretty good docs, so you can get up and running with them quickly and kind of figure out what you need to do.

[00:09:09] - Ry

Yeah. So have you found that you've had to study the internals of postgres much as part of the process of building these extensions?

[00:09:23] - Adam

Not really too much. Only when I'm having trouble getting it up and running for the first time do I need to really look at postgres. But a lot of these extensions don't. They're not like touching with hooks into and replacing core functionality of postgres. They're extending things. So a lot of what PostgresML does is just give you functions that go off and make rest calls or download data from elsewhere in the Internet and put it in your database for you, give you access to some pre trained machine learning models. And it's not changing fundamentally how postgres runs. It's still like the normal postgres. It's kind of like a layer on top of yeah. Yeah.

[00:10:19] - Ry

I think some extensions can do surgery and some are, know, laying on top of and not messing with the core functionality.

[00:10:28] - Adam

Right? Yeah, it's an area I'm kind of new with that, so I haven't gotten super deep into replacing core functionality of postgres quite yet. But soon we'll probably start working in that space a little bit more.

[00:10:47] - Ry

What's, like the testing story? If you're a dev that likes to write TDD, test driven development, are you able to do that with extensions in particular, I guess, with Pgrx. What's the test driven story there look like?

[00:11:03] - Adam

Pgrx, I think, is pretty good. They have a way for you to say unit test or integration test for the extension that you're writing, so you can say, execute some queries and then make assertions on the outputs of those queries. So kind of like, have normal assertions that you would do in whatever language that you're testing. Pgrx has some tooling around like, oh, I want to spin up postgres 15.3, spin up a new environment for that, install your extension into it, and then run your test suite on it. Yeah, I guess it can get a little bit trickier, I think could be because depending on how complex the extension is, you could have system dependencies, um, like, oh, I need a specific version of GCC or something, or OpenSSL version something has to be installed. I haven't really quite found a good way to test all of those things to make it super portable to say, like, yep, all these test pass. And that means my extension is just good for everybody.

[00:12:28] - Ry

For the test running on your local machine. I know they're running on your local machine, but in a docker container or just natively.

[00:12:41] - Adam

You could do both, I guess. Kind of like my workflow for it. I guess I would do, like, run the test locally and then have the extension in a GitHub repo and have a CI pipeline that runs the same tests, but in Docker or within a GitHub workflow to be like, hey, before you merge this, you need to pass all these tests, kind of gate yourself a little bit.

[00:13:10] - Ry

That's great. Well, cool. So I was going to ask, as far as your background prior to Tembo, you were doing little ML engineering too. Did you use postgres much in that role? I guess the question is, did you use any of the special parts of Postgres or was it really just like a standard transactional store with little knowledge of the extension ecosystem?

[00:13:39] - Adam

Nothing too crazy. I've always tried to do a queue on postgres, so be like, build up this giant queue of things that a machine learning model needs to make predictions on, or make a bunch of different predictions, dump them into a queue, and then have another process. Look at all these predictions and pick the best one before giving it back to the user. But a lot of OLTP like high transaction workloads in kind of machine learning space. Yeah.

[00:14:18] - Ry

So you said you've been trying to build queues historically, a lot of times in bigger companies, there are less tools for various tasks. For example, use redis if you want to build a queue, or use snowflake if you want to build a data warehouse. I don't even know that these things were officially declared by any sort of powers that be. It's more like the sales reps at those companies decided that this company is going to be a salesforce company or a snowflake company. Right. It's kind of a slow boil that you suddenly realize, oh, I guess we have kafka now. I guess we're a kafka shop and starts with a POC and then ends up you have all these tools. We call it the modern data stack. But yeah, I'm curious how your take on that in particular? Around start with the queue since you've just implemented it in postgres and probably had to use other tools in the past.

[00:15:16] - Adam

Yeah, a couple of places that I've worked at, it's like, hey, we need to pass messages between two services. And any engineer will be like, all right, well, we're going to evaluate what we have here. We want to try to do things like, oh, let's keep things simple, not make things too complex. And then depending on the size of the organization, it'll be like, well, we use Kafka for messaging, so use Kafka. And so it's like, okay, so make progress. A lot of times it's better to not fight that and just be like, we want to make progress, we just need our application to work. It's not like Kafka wouldn't work for this, but it's definitely overkill for a lot of situations. So then it's like, okay, we're going to build this in Kafka and maybe the rest of the application is in postgres, but this message piece of it, we're going to put it in Kafka. And from the application perspective, it's kind of no difference. Like I said, it's an overkill tool for a lot of use cases. But then when things start to go wrong and need to troubleshoot it, it's like, okay, now we have to bring in the Kafka expert.

[00:16:35] - Adam

And if it's a big company, it's probably a handful of people who are on this Kafka team and they got a lot of other stuff going on. So it's like, what do you have to do? You have to learn Kafka. Yeah.

[00:16:47] - Ry

And you're a small use case, right? You're like a small unimportant use case. And so, yeah, you can't get their attention. You kind of accepted the fact that, okay, I can learn. It's kind of fun to learn new technologies and try new things out, right? That's the day. Zero joy of learning something new. But then now all of a sudden you've got to support it, right?

[00:17:09] - Adam

Yeah.

[00:17:10] - Ry

You support what you make a lot of. Yeah, you got judoed into that.

[00:17:16] - Adam

In my career, it was fun to learn Kafka and there's some things I really like about it, but then at the same time, it's a very complex tool and it does take a team to run and manage it. Same with RabbitMQ. If you're going to do those things, you need some people dedicated to making sure that they're functioning the way you expect it to.

[00:17:45] - Ry

Yeah, well, I think kind of leads into one of our core missions at the company, which you're leading at Tembo, which is our stacks. I guess it probably would make sense for you to say a few words on what we're trying to accomplish with stacks at the company.

[00:18:05] - Adam

Yeah. So stacks are workload optimized postgres clusters. So the message queue stack is one that we have and our goal with that is if you need to run a message queue, we want this to be the most optimized way to do a message queue on postgres. Of course, there'll be a point in time when it's like, hey, your use case has grown so big that maybe that stack's not going to fit you, but that stack will be to the very edge of what postgres can do for that workload. We're taking that same approach with our OLAP stack that we're building right now. We have an OLTP stack, there's a machine learning stack. So each one of these stacks is do it on postgres and we're going to make it be the best possible squeeze every last piece of juice out of postgres that we possibly can for that workload.

[00:19:08] - Ry

Yeah. And you're curating extensions for each stack. What else are you doing besides that.

[00:19:13] - Adam

Yeah, extensions are a big piece. A lot of that has to do with there are certain types of developer experience that we think people want around workloads. And then there's the postgres configuration itself. So like, what should shared buffers be or how many parallel worker processes and how should the auto vacuum vacuum error be tuned for that workload? That's a whole class of things that are unique to every stack. Of course, there's like a user interface component of every stack as well. So if you want to come up and to look and see and observe the stack, there'll be user interface metrics are kind of really tightly related to the UI, so there's different metrics for every stack as well. Some of them are going to be the same across stacks. But for example, current number of messages in a queue, that's like a metric that you can monitor, you can write alerts around that metric and it's mostly unique to the message queue workload.

[00:20:32] - Ry

Yeah. And if the message queue stack is competing against a commercial queue product, they probably have some sort of visualization of the state of each queue.

[00:20:45] - Adam

Right?

[00:20:46] - Ry

And so a postgres backed queue ought to have that same UI, that same monitoring tailored monitoring system. It makes it, I don't know how many times harder versus just configuration and curating some extensions, but I think it's all worth it to the user to be able to really defend their choice to use postgres for this use case against one of the modern data stack alternatives.

[00:21:20] - Adam

Right? Yeah. I think something really useful that I like about having stacks and having postgres aligned to specific workloads, the complexity of an overall application can really come down a lot by running everything on postgres. You could still have separate postgres clusters for workloads, like a certain set of CPU and memory dedicated to mission critical process A and a separate one for this other mission critical process. But still when it comes to troubleshooting these things, you're troubleshooting postgres and it's not like, hey, I have to switch and be like, okay, now I'm troubleshooting Kafka, or jumping to Redis or jumping to Mongo or Snowflake. It's still like the context switching I think, for the developers is big time minimized when it's still the same underlying data store between all the different workloads, same technology. Yeah.

[00:22:27] - Ry

And we have this vision of potentially having some built in stack connectivity, right, where these databases, if they're all kind of sitting back to back to back so you have five different stacks they could and should be able to communicate really well with each other. And you should be able to write queries across them in the same way that Citus allows you to write queries across an array of postgres clusters very efficiently. You should be able to do the same thing here and pull data from multiple stacks with a very nice user experience, again, without having to move data around. So that's one of the exciting things for me as a former airflow company founder. All these data pipelines are very painful between modern data stack companies. And one of the things I'm excited about is the possibility that we can give developers the option to not have to create all those pipelines.

[00:23:36] - Adam

Yeah, I'm really excited to work on that problem when we start doing that, but that'll be I think, a really big differentiator is to say I have ten different machines running postgres and I have a single pane of view across all of them. I think it's definitely doable. It'll be challenging.

[00:24:02] - Ry

Yeah, it's a dream we're chasing. Yeah. Good. Well, I think it was great chatting with you, Adam. I'm sure we'll have you on the show again. I know that, again, you've got a lot more extensions coming and appreciate the work you've done for the community so far and yeah, looking forward to seeing your future work and talking about it.

[00:24:28] - Adam

Sounds good. Thanks a lot, Ry.

· 26 min read

Episode Notes

In this episode, Ry and Marco talk about the early days of Citus and its development, creating pg_cron (on a plane!), and the new possibilities on the horizon for extensions in the Postgres landscape. If you haven’t seen or listened to it yet, you can play the video below, or listen on Apple/Spotify (or your podcast platform of choice). Special thanks to Marco for joining us today!

Want to know more about something they mentioned? Here’s a starting point:

Did you enjoy the episode? Have ideas for someone else we should invite? Let us know your thoughts on X at @tembo_io or share them with the team in our Slack Community.

Transcript

[00:00:12] - Ry

Welcome to Hacking Postgres. Today we have Marco Slot joining us. I'm meeting Marco for the first time today. I'm super excited to be meeting you, Marco. Welcome to the show.

[00:00:25] - Marco

Hey Ry. Yeah, very nice to meet you. Yeah. So I'm Marco, I work at Microsoft and I've been working on Postgres extensions for about nine years now, starting with Citus, pgCron, and a bunch of other ones. Yeah.

[00:00:44] - Ry

That's awesome.

[00:00:45] - Marco

Yeah.

[00:00:45] - Ry

So this is not really supposed to be an interview. It's more just us geeking out about Postgres, extensions, what they mean to users and opinions about anything and everything related to our work is fair game. Yeah, sounds good. So what'd you do before you got into Postgres stuff, out of curiosity?

[00:01:09] - Marco

Yeah. My background is more in distributed systems, so I spent a few years at AWS. It was quite early days in AWS when I think I joined. There were three Web Services. S3, SQS and EC2.

[00:01:24] - Ry

Wow.

[00:01:25] - Marco

And then we built number Four, which was CloudFront, which gives people started using S3 for websites, which it wasn't really built for. But then we were asked to solve this and we built the CDN, and then we also built Route 53 for the DNS.

[00:01:40] - Ry

Wow.

[00:01:41] - Marco

And then yeah, I did a PhD in sort of self driving cars, but specifically like, cooperation between self driving cars, like, what if they can communicate with each other and sort of coordinate? So it's sort of advanced distributed systems in a way as well. Yeah. So that led to me joining Citus, which is sort of founded by former Amazon folks.

[00:02:05] - Ry

Got it. So building Citus was probably easy compared to all other stuff, actually.

[00:02:12] - Marco

Yeah. If you do self driving cooperation between self driving cars, which is this kind of like life critical system distribute where your nodes kind of just move around and they move away from each other, it does make things relatively easy. But then yeah, building databases is never actually very easy.

[00:02:31] - Ry

No, I know. How good do you want to make it right? Is how hard it becomes.

[00:02:38] - Marco

Yeah. I guess the challenge is always everything sort of in the word relational. Right. Like everything relates to everything else. There's not like a feature you can implement without considering all the other aspects of the database and all the things that can happen concurrently with the operation that you're working on and all the possible ways in which things can fail. Especially if you're in a kind of distributed setup.

[00:03:01] - Ry

Yeah, I think that's really interesting perspective. You think of all of the configuration options, too. Adding a new configuration option adds complexity, adds flexibility, adds potential complex, and it's tough.

[00:03:20] - Marco

Yeah, definitely. I think one thing we learned is there's a sort of certain danger in creating modes where every setting you add, which can be on/off now you have two modes in your system. You add another setting that can be on/off. Now you have four modes in your system. It's kind of this exponential explosion of possible ways in which your database might be running. And so that's hard learned lesson that's, like, don't introduce major modes. Sometimes you have to because of backwards compatibility reasons, maybe, but you want to get rid of them over time.

[00:03:57] - Ry

Were you one of the first team members on Citus? Actually, I don't really know the full history. Did it start as a fork and made its way as an extension, or was it always an extension?

[00:04:10] - Marco

Yeah, I was one of the first, though I think Samay was already there. I think he's joined Tembo. So but yeah, we started out as a fork and then we called it CitusDB, and then we made it an extension and called it Citus. Though I think the name CitusDB kind of stuck around for...I still hear it sometimes. But yeah, it was very pretty early days for extensions. Like 2015 PostGIS obviously existed. It was one of the first major ones. But yeah, we were lucky at the time. We had Andres Freund working for us. Well, he's also now working for Microsoft, but he's like one of the top Postgres committers, and we put him in a room for a while, know, can you turn this into an extension, please? And he came up with some terrible, terrible hacks. But over time, those also drove some of the extension APIs in Postgres itself, where things that used to require really terrible hacks are now kind of more structured and you can introduce your own sort of data structures into Postgres without doing lots of weird stuff.

[00:05:22] - Ry

Well, I was thinking probably not everybody knows what Citus is. Maybe if you could give a quick I'd love to hear a quick overview of what it is and then maybe a couple of things that maybe you can think of that are unusual, that people might not know about it, that are part of it. I don't know if that's a stupid question, but yeah, if you could try no.

[00:05:41] - Marco

Yeah, so it comes from basically this denotion that Postgres is ultimately limited to a single machine. I mean, you can read replicas, but you definitely cannot scale the writes. Usually if you have read replicas, they'll kind of end up having the same stuff in memory. So you cannot really scale the memory either. And so Citus is a solution to basically adds sharding to Postgres in a very transparent way, where you can have tables that are sort of transparently sharded across many nodes, you still connect to a Postgres server. There's just a table, looks, walks, and talks like a table. But if you insert into it, it actually gets rooted to a particular node. And if you insert another value, it might get rooted into another node that uses this hash distribution. And that way you can scale across infinitely many machines and have some users have petabytes of data in their Citus cluster part of the advantage is also that queries can get paralyzed on these tables because we can use all the nodes at the same time and multiple cores per node. But more and more it's also actually being used for LTP workloads.

[00:06:55] - Marco

And particularly the most popular use case is actually multitenancy, where you have this kind of B2B app where you have lots of relatively independent users and you kind of want to just transparently spread those across different machines without having to actually manage that yourself and figure out how to move data around. Because that was all very kind of automated. You can just ask it to rebalance your data and it does that using logical replication. And so then you have the advantage of just as much memory as you need to make your application fast and as much disk IOPS as you need by just adding more machines with more disks and so that helps you scale. And then we have a sort of managed service in Azure around that. But yeah, we use every hook that's available in Postgres to achieve things like starting with the planner hook. That's one of the interesting things about Postgres. It's like you just replace the whole planner with something else. You get a parsed query tree and then you can do whatever you want with it. You can just do some like okay, look at this query and log something interesting and then just go to the regular planner.

[00:08:12] - Marco

Or you could just do something completely different. And so we use all these hooks to kind of create this facade around Postgres tables where it's like Postgres table is still there, you just cannot touch it. We just intercept everything that's going on. The most recent and most terrible hack we did it was like the give it as this hack in Postgres so we can figure out so we have shards. Like the writes go into shards but then if you do logical decoding you see writes on shards, you don't see writes on these what we call distributed tables. But the user doesn't even know about shards probably. So how can we fix that? And there wasn't really any kind of hook we could use to change the way that the logical decoding process works except that we realized that the client specifies a decoder name usually just hard coded like Pgoutput or Wal2json or test_decoding and that actually refers to the name of a .so, like a shared library. And we realized we could create a .so with the same name and put it in a different directory and then change the dynamic library path to be prefixed with that directory so it would load our library instead.

[00:09:34] - Marco

This is kind of the worst thing we did. But it works nicely actually. I mean, that new .so file calls the original .so file and it kind of makes some changes to hide the sharding from the coding. But yeah, it is a hack. It's very much a hack. But it's nice that Postgres ultimately lets you do these kind of things.

[00:09:59] - Ry

I think, like Citus says, there's like a requirement that Citus loads first as an extension. Is that true? I'm trying to remember if I'm making that up.

[00:10:08] - Marco

It is true. I mean, maybe it's a bit of laziness on our part, but it's also.

[00:10:13] - Ry

Because...

[00:10:16] - Marco

We generate very weird plans that wouldn't make any sense to other extensions. So if you have multiple layers of planner hooks, you kind of want Citus to be but by putting it first, it actually becomes the last because everyone else overwrites Citus. And then hopefully when you go through the chain of planner hook, Citus is the last one remaining and it can produce some quirky distributed query plan that doesn't make sense to anyone else and pick that up.

[00:10:48] - Ry

Are there known extensions that aren't compatible with Citus that you're aware of or as far as you know? It kind of works with most everything.

[00:10:56] - Marco

Yeah, I mean, Citus is a little bit of a paradigm shift because now you have many servers and some extensions just don't make sense if you're on many servers because they keep their state on one of them and then they're not aware of others. But most things just work. But I kind of feel like there's this notion of a deep extension, if you will, like timescaleDB or Citus. They kind of go really deep into planner hooks and change behavioral characteristics. Whereas something like PostGIS and a lot of extensions that just introduce new types and functions. They're sitting on top of this much more clean interface and usually the interoperability of those things is pretty good. But Timescale has this notion of a hyper table inside as a distributed table and you cannot really make a distributed hyper table. But yeah, compatibility, it comes down to sometimes also the individual features. It's not like if you install one, then the other doesn't work anymore. I mean, that happens for some extensions, but for Citus, most other things just kind of work with it. Yeah.

[00:12:08] - Ry

So let's talk about pg_Cron. How early were you involved in that project?

[00:12:13] - Marco

Well, I created it. I used to have this thing where I am much more productive on flights and I flew to San Francisco a lot for a while and then I had a few of these projects that I was just working on on flights and pg_Cron was one of them. And so at the time we had customers who did kind of this real time analytics scenarios on Postgres on Citus. And those involve a lot of materialization. So a bunch of raw data keeps coming in like time series data. And at some point you want to take all the new data and kind of pre aggregate it inside of the database and that needs to happen periodically. I think the background worker concept was pretty new at the time. I don't know. This was many years ago. And so I realized you could do something like Cron. It maybe also comes from the Amazon background because Amazon, at least at the time, was all like glued together with Cron, Perl and R Sync. There were Cron jobs that would R sync metadata files to other servers and it was all pretty hacky. But anyway, I figured it would be pretty useful to have this kind of thing in Postgres also for just vacuuming or calling certain stored procedures that, I don't know, delete your old data.

[00:13:42] - Marco

Like those kind of maintenance tasks. It was just much easier if that's like you run a command once and then it works forever, rather than I need to maintain this separate infrastructure that periodically connects to the database and does that. So that became quite popular. I think pretty much all major managed services have pg_Cron now. Yeah, I kind of built it as a very I mean, it was definitely my side project. So I also built it as a very low maintenance thing that I wouldn't have to constantly fix bugs in. Also a little bit selective about adding new features. For a long time I resisted adding people want like, can I run a job every few seconds? And normally Cron is like at the minute, granularity. But recently I caved and I added doing it every few seconds because I realized it's kind of feature if there's some kind of issue, some kind of memory leak or whatever, if you run it every few seconds, it's going to be way worse than if you run it once a minute. So you need to be more careful in that case. But you can now also do like every second, for example, which is kind of useful.

[00:14:56] - Ry

That's awesome. Yeah, I was looking at we're just building a new data warehouse, trying to use Postgres to do that. I'm exploring all the columnar extension as well, which sort of is part of Citus. I don't know if you did any work on that, but yeah, looking at pg_cron versus pg_timetable, have you had the question of what about have you looked at timetable and its features and yeah, I'm kind of curious to get your assessment of the two.

[00:15:28] - Marco

Yeah, I've looked at it a mean for a long time. It wasn't so much an extension. I think nowadays it is more or less an autonomous extension, I think. Yeah, I mean, it's somewhat more complex in a way. I don't feel like competitive of like you should use pg_cron or use pg_timetable. I think pg_cron is just intentionally simple such just this very simple thing that just works and does what you'd expect it to mean almost unless your time zone is in a different unless you're not in GMT, then it sort of doesn't quite do what you'd expect it to anyway. So yeah, I think Cron is just simpler. But if you need some more specific, I guess it comes down to do you need the extra features that pg_timetable adds? But I think for most people, pg_cron is good enough because you can also just I've also seen people do their own kind of job queues on top of pg_cron, where you just have this thing that runs. Every few seconds or every minute, and then looks are there some jobs to do? And it actually executes the stuff that's in the job state.

[00:16:39] - Marco

So it's kind of composable as well. You can build your own things on top if you want. Yeah.

[00:16:45] - Ry

And especially if you could trigger it every second to check for work. Yeah, that's just fine. Yeah, I'm coming from my previous company was doing Apache Airflow and so it's like kind of getting back to some of the stuff. I mean, obviously that's a very complicated system with a lot of capabilities and dag like chaining of tasks and fanning out and all that kind of stuff. But yeah, I think it's interesting to try to do some of that stuff inside of Postgres without requiring we're trying to go with the mantra of like just use Postgres for everything.

[00:17:29] - Marco

It's not a bad mantra. You've got, I guess, different levels of commitment to that idea of how much of our backend can we shove into Postgres? And there's these GraphQL extensions where you pretty much put your entire backend in Postgres. Yeah.

[00:17:51] - Ry

Your API, is there?

[00:17:52] - Marco

Yeah, I guess Supabase is going pretty long way in that direction, but at some point you want to debug stuff or you want to have certain rollout procedures. And sometimes Postgres is not as mature a tool as just shipping Python files somewhere. There's just better CI tools for most normal programming languages than for Postgres. So you have to find the right balance, I think.

[00:18:24] - Ry

Yeah, we're building and maybe this shouldn't have happened, but apparently yesterday we were working on a new FDW and I think it hit an exception that wasn't handled and it appeared to shut down the Postgres server. This is kind of fun. Should that have happened? Should that be possible? Should an extension error take it down? But we're still investigating that. But yeah, I think there's always risk with trying to do a lot inside, but I think with a mature extension that's well tested and battle tested, it should be. I mean, if you think about everything that's happening, there's no SRP in Postgres, right? It's not just doing one thing, it's got a vacuum. There's like lots of processes running and so then the question is, is it a sin to add one more thing? Especially if it has the capability to have a background worker? It's just like staring you in the face. Like use me.

[00:19:29] - Marco

Like the extension APIs are slowly evolving from just a complete hacky approach for people to try out their Postgres batches before merging them into core. To some, I guess Extensibility was there from the start. But that's more too for custom types and custom functions than it is for planner hooks and background workers and those kind of things. Those are a little bit more what if we add this function pointer to see maybe extensions can do something interesting with it. And it's not very extremely well structured, but I mean the whole rust like pgrx and building extensions in rust, it does create an opportunity to have a little bit more of a well defined method of developing extensions.

[00:20:25] - Ry

Yeah, I think what's interesting to me about Postgres is like the history of forks and it sucks to be on the true fork, as I'm sure you were aware. Creating the extension framework helps people out of the land of forks, but it does create the possibility for even more risk extensions in terms of all that. So it's like a double edged sword. I think it's great though, that to me is the reason why Postgres is doing so well now, is all that freedom that the core team has given the community in terms of how to use it. It's unprecedented, almost dangerous, but I also think empowering to developers who want to have some fun with their database.

[00:21:21] - Marco

Yeah, definitely. What we also often find is just an extension versus an external tool. At least we're sort of in the business of running a Postgres managed service. But even if you have just your own Postgres servers, it has a fairly straightforward deployment model, right? Like you don't have to create a whole separate cluster with its own configuration that connects to your Postgres database and does things with it. You just shove it into your existing Postgres server and it'll restart the background workers and you don't have to worry about all those things. It does have a lot of merit, actually, to develop a software as a Postgres extension. I mean, it comes with its own protocol. I think the most interesting thing is just like you get all this synergy between all these different extensions, that the sum of the parts is greater than the or what is it? The whole is greater than the sum of the parts. But also things like PostGIS and Citus, they kind of layer like you can distribute it, geospatial joins, whatever. But then also maybe you have some interesting function and you can run it periodically using pg_cron.

[00:22:39] - Marco

And it's like all these things kind of give you this little platform where you can just by running SQL queries, do increasingly interesting things. Maybe you have an FTW that does an Http call and you run it using pg_cron, for example. And now your Postgres server is not just pg_cron, is not just Postgres cron, it's actually anything cron. Like you can reach out to other systems. So that plugability.

[00:23:10] - Ry

I literally have a co-op that started last week and his first task was, I said create an FDW for this API and use pg_cron and ingest that data. And he's like, well, how do I do it? Just brand new to data warehousing in general, but he's got the FDW built and now he's working on if he has any problems, I'll send him your way. Kidding. I won't. Kidding, kidding. I won't do that.

[00:23:41] - Marco

If there's bugs open issues like I'd like to know, of course.

[00:23:44] - Ry

But there's no bugs. No. So do you have any long flights coming up where you have some new extensions coming?

[00:23:51] - Marco

No, we had a baby last year. It's like my flight budget is very low at the moment. That's true.

[00:23:58] - Ry

Your latest extension is human.

[00:24:01] - Marco

Yeah, but yeah, I'd like to play more with Rust because that the problem with extensions in the past. Like developing them in C has always been like c doesn't have any good sort of dependency framework. There's these old established library like, I don't know, LibZ and LibXML or something, like something that was developed 20 years ago and now, okay, every system has it, you can use it, but for anything newer than ten years, it's extremely annoying and hard to use a C library. And then even if you can figure out your configure and your make files, the memory allocation is probably going to not play nicely with Postgres. So that's where Rust is kind of interesting. Now there's this whole ecosystem of open source libraries that can potentially become extensions and we're still sort of at the very start of that, what's going to happen?

[00:25:05] - Ry

It's kind of scary because it could be very much of a Cambrian explosion. What happens if there are 300 new extensions that are worth adding? It's sort of a pain for the managed service providers.

[00:25:19] - Marco

Yeah, definitely. Yeah. And they can have funny incompatibilities. It helps that then Rust is a little bit more safe and a little bit more not managed, but the memory allocation is a bit more sort of safe as well. But then, yeah, you can have a lot of weird bugs if you combine all these unknown bits of code.

[00:25:46] - Ry

That's great. Are there any extensions that you've run across either recently or in the past that you love, that you would mention as exciting? I mean, you mentioned PostGIS and anything else that you can think of. It's no big deal if not.

[00:26:04] - Marco

Yeah, there's many. I'm sort of intrigued by Postgres_ML. It's like this just machine learning extension that they seem to be doing a lot of interesting things. I don't have a good use case for it myself yet, but I really like what you're doing. Also in Rust, I think MIT licensed, so it can be used in a lot of places. There's of course a lot of buzz around pg_vector and more recently pg_embedding, which is sort of an interesting kind of development because now I think they both added this HNSW indexes which are faster than the IFV flat indexes that pgvector provided. But then it also means they are now incompatible with the latest versions. You can create one or the other, not both, which is a little awkward. But I think that is where, of course, one of the most fastest moving areas and I think some of these things will be figured out.

[00:27:06] - Ry

Have you gotten those extensions added to the Citus managed service?

[00:27:11] - Marco

Yeah, we definitely have pg_vector. Yeah, I mean, I think that took the landscape by storm. I think they're pretty much on every managed service now.

[00:27:19] - Ry

It's interesting, we're working on this thing called Trunk where we're trying to maybe I don't know if we can get you guys to participate, but the idea would be to send some metadata to some sort of central repository that we'd know, like which extensions are trending, waning, completely dead or not. I just think it's interesting as a new person coming into the Postgres ecosystem, there's just this library of potential extensions that it's pretty expansive and kind of trails off in a know because you can find them loosely here or there on GitHub. But I think having a better directory of them would be good for the community.

[00:28:04] - Marco

Yeah, I guess. One thing that's also tricky about extensions, it's like you'd want it in an ideal world, you write it once and then it works forever. In the real world, every Postgres version potentially breaks your extension, so someone has to actually go and fix it. For new Postgres versions, sometimes it's okay. I think with pg_cron, I had like once or twice. It's like I think Postgres 16 didn't actually no, it did require some fixes. I think there was one Postgres version which didn't require any fixes. Maybe PG 15, I was already happy. But usually C is a bit of this wild west of programming languages. But yeah, sometimes just some function header changes, there's an extra argument and then none of the extensions that use that function compile you have to have some level of maintenance behind each extension. Well, not every extension does well. Great.

[00:29:03] - Ry

I mean, it was great to chat with you. Happy to have you back on again. If you have a big new release of Citus, I don't know, do you guys have any big plans for it or has it reached a certain point of stability where it is what it is? Yeah, I'm kind of curious.

[00:29:19] - Marco

Well, our most recent major release added this kind of notion of Schema based Sharding, where so far it's always been like you have distributed tables and you have a distribution column and you need to pick which columns you use. But the Schema based Sharding is just like every Schema becomes its own group, own Shard, essentially, so it might be on a different node. So for apps that use Schema per tenant, that's a very nice model. And so we're investing more in that at the moment.

[00:29:49] - Ry

Well, great. Good to meet you. Looking forward to continuing to get to know you. And thanks for joining us on the show.

[00:29:56] - Marco

Yeah, it was great. Thanks for having me.