Hacker Newsnew | past | comments | ask | show | jobs | submit | more craigkerstiens's commentslogin

So many thoughts on this. The community has definitely ebbed and flowed, on this for a while. A few varying pieces of insight with no intention other than to share a bit more on the PG community. And I'm sure some current and former colleagues already in comment threads are going to correct me on nuance of a lot of this.

For several years there were no new committers at all. In recent years the team has tried to be a little more intentional about adding new ones and culling those no longer involved.

About 15 years there was a phase of letting a lot of younger people earn their commit bit. I can recall 3 people by name that all got a commit bit before the age of 25, and they may have actually all been under 22. One of those three shortly after moved on to work outside of the Postgres community, another quietly was busy on other things for over 10 years before coming back, and the third was actively involved going forward. I suspect there was some unease of folks getting a commit bit and then sort of falling off a cliff so it slowed for a few years on adding new folks. Edit - sounds like it was less age driven but maybe still slightly related to some folks falling off that there was a slow down in new committers – tldr - you're not getting a commit bit right out of college for Postgres.

What to me would be interesting but likely hard to gather is what age to people become a committer to Postgres. It wouldn't surprise me if the average age of getting a commit bit is closer to 45 than not. Many folks contributing come to Postgres after other systems work or just don't consider contributing to they're a bit more seasoned because it feels intimidating–I mean patches sent on a mailing list who does that any more? Postgres thats who.


I have the honor of working with a Postgres ~committer~ contributor who was just over 25 when they first contributed! The story about their first commit is great:

They were testing SQL behavior for Materialize and thought to check that both systems handle interval functions identically. Being thorough, they tried something like:

  select interval '0.5 months 2147483647 days';
You can try it yourself on dbfiddle[0] Instead of erroring, Postgres returned a bogus value `{"days":-2147483634}` you can read why here[1]

So naturally they decided to fix it in Postgres, which is why they contributed and why it's handled properly in 15+ [2]

[0] https://www.db-fiddle.com/f/ijT76fsmL99bHvXxhAtf7j/0 [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... [2] https://www.db-fiddle.com/f/i3KikCb72AN1EZpywErZvr/1


> I have the honor of working with a Postgres committer ...

That's not a committer, that's someone who submitted a patch that got committed. A committer is the one who actually applies the patch and can push the branch into the mainline repo. Committers decide if something is worthy of being merged.

Now that aside, yes this plus reviewing patches to get a wider feel for the codebase is how you eventually become a committer.

Best way to eat an elephant is one bite at at time.


This is a common source of confusion for a ton of folks. Anyone can submit a patch, but commit bits are reserved for a much smaller list. The attitude is something like you commit it, you maintain it–so if bugs come in you'll spend your time fixing those for whatever time it takes vs. working on the next shiny feature that you're excited about for the next release.

There was sort of a fuzzy "major" contributors (https://www.postgresql.org/community/contributors/) which were people that contributed major features and then a list of other contributors. Depending on who you talk to this is either dated or a pretty close attempt at reflection of reality but not perfect. In recent years they expanded the contributors to include others that were contributing in non-code ways though it's still a decent place to find people contributing to major feature sets.

Of course this is not to be confused with the core team–which is more like a steering committee. But not so much steering committee of code and feature sets.


Ahh thanks for clarifying - now I better understand the significance of the OP's point about the rarity of younger COMMITTER's.


The thing about becoming a PG contributor is that the barrier to entry is fairly high.

I love Postgres so much I have a PG tattoo, but from the perspective of the two ways you can contribute:

- As a random user, in your free time: There's not a ton of "Good first issue" type tickets. Where you can ease your way into PG dev by working on something that doesn't require you to have context on many parts of the PG architecture and at least a little historical knowledge on why things are written the way they are. Also, it can be a bit intimidating to have your patches reviewed by the likes of Tom or Andres.

- As a developer for a paid PG company like EDB/PG Pros/Crunchy etc: It's a sort of Catch-22 scenario here, where it's difficult to get hired as a junior without having previous PG hacking experience, but the path to doing that is not the easiest thing in the world.

If I was going to work somewhere that wasn't $CURRENT_CO, it'd be somewhere doing PG work, but there's not a lot of viable avenues/inroads there.


PostgreSQL isn't that special as a codebase. Every codebase has its quirks, every project has its own processes and there's a learning curve. When you switch to a new job as a software engineer, you pick it up. PostgreSQL is no different: you can hire an engineer to work on PostgreSQL.

I'm not sure how well that path works in growing new contributors, though. In a usual company setting, the goals are better defined, and the company is in control. Once you reach the goals, mission accomplished. With an open source project it's more nebulous. Others might have different criteria and different priorities. You are not in control. Choosing the right problems to work on is important.

Other storage or database projects would be a good source of new contributors. If you have worked on another DBMS, you're already familiar with the domain, and the usual techniques and tradeoffs. But to stick around, you need some internal desire to contribute, not just achieve some specific goals.


The biggest hurdle I see is that it is a C project, unfortunately something we can do nothing about. It is so much harder to trust a random code not have to have serious implications for the database. It will take ages for someone to get comfortable with the pg-code-base way of handling errors, basic string manipulation, memory alloc/free etc.

I want to highlight the difference in "making a non-core contribution" to "understanding database internals". I am highlighting it is not the latter, but the former that is the first hurdle.

I wanted to reuse builtin pg code to parse the printed statements from logs - I ended up writing a parser (in a non-C language) myself which was faster.


Couple of points in this post, so will address a few of them:

  "(Paraphrased) C is bad, and it takes forever to pick up the PG-specific C idioms"
There's probably not a productive conversation to be had about C as a language. I will say that as of C23, the language is not quite as barebones as it used to be and incorporates a lot of modern improvements.

On the topic of PG-specific C -- there are a handful of replacements for common operations that you use in PG. Things like "palloc/pfree", and the built-in macros for error and warning logging, etc.

I genuinely don't think it would take a motivated party more than a day or two to pick all of these up -- there aren't that many of them and they tend to map to things you're already used to.

  "I wanted to reuse builtin pg code to parse the printed statements from logs - I ended up writing a parser (in a non-C language) myself which was faster."
It's true that the core PG code isn't written in a modular way that's friendly to integration piecemeal in other projects (outside of libpq).

For THIS PARTICULAR case, the pganalyze team has actually extracted out the parser of PG for including in your own projects:

https://github.com/pganalyze/libpg_query


libpg_query is a godsend of a library. I spent a lot of time writing a custom parser before I found it - was very happy to replace the whole thing. A major boon was the fingerprinting ability - one of my needs was to track query versions in metadata.


I disagree on this. Yes it's C. But I've heard people comment "I don't like writing C, but I don't mind Postgres C".

The bigger hurdle which Peter mentioned in another thread is simply building up enough expertise with the system and having the right level of domain expertise.


> Yes it's C. But I've heard people comment "I don't like writing C, but I don't mind Postgres C".

While "Postgres C" might be wonderful, in practice learning the project's unique idioms is yet another hurdle for newcomers to overcome.


Every project has unique idioms. Let alone ones that are 30+ years old.

Idioms are a baked in cost of learning to contribute to any project.


I found that I learned a lot when trying to write a logical decoding plugin. So I guess if you are a user of Postgres and there’s some small friction you could reduce by writing a plugin, it’s a good way to get started. Scratch your own itch, you don’t have to publish the results :-)


I don't have the data for the average age, but I was recently in a conversation around how long does it take to become a committer since getting involved in Postgres by writing code for it.

So, I wrote a couple git commands like below [1] to figure out when someone was first named in a commit message vs when they made their first commit (as a committer) for the last 10 people who became committers.

The average time of involvement was ~8.9 years (just comparing month / year), with the lowest being ~6.5 years.

Obviously one could do better analysis but my goal was just to get an approximate understanding.

[1] git log --grep 'Name' --format=%cs | sort | head -1

git log --author 'Name' --format=%cs | sort | head -1


How much bigger (in lines of code) is Postgres now versus the one from 15 years ago?

Maybe it was more approachable for a 22yo then, you could figure out more of it.

Also, C was a standard language back then, today the kids are more likely to program in Rust than in C.


> How much bigger (in lines of code) is Postgres now versus the one from 15 years ago?

I was curious as well and wrote a, very crude, script to measure it:

  for t in $(git tag -l|grep -E 'REL.*_0$|REL[67]_[0-4]$'|grep -v REL2);do echo -ne "$t\t"; git ls-tree -r $t --object-only |xargs git show |grep -a -v '^\s+$'|wc -l;done
  REL6_1          270033
  REL6_2          320297
  REL6_3          386532
  REL7_0          630771
  REL7_1          843219
  REL7_2          986991
  REL7_3          1363668
  REL7_4          1492418
  REL8_0_0        1649775
  REL8_1_0        1702325
  REL8_2_0        1806170
  REL8_3_0        2017685
  REL8_4_0        1924918
  REL9_0_0        2011704
  REL9_1_0        2225796
  REL9_2_0        2290872
  REL9_3_0        2405598
  REL9_4_0        2487304
  REL9_5_0        2527906
  REL9_6_0        2632559
  REL_10_0        2534653
  REL_11_0        2771914
  REL_12_0        2697892
  REL_13_0        2822066
  REL_14_0        2980221
  REL_15_0        3054963
  REL_16_0        3351147
This is counting non-empty lines. It's definitely not a good measure of overall code size, as it includes things like regression tests "expected" files. But as that's true for all versions, it should still allow for a decent comparison.

8.3.0 was released 2008-02-01, with 2M non-empty lines, we're now at 3.4M.


I suspect you'd get much more useful results by checking out the version tags and running `cloc` - https://github.com/AlDanial/cloc


great contribution here from Craig, in terms of the ebbs and flows and useful history. i had no idea about that cluster of folks under 22 with commit bits.


Just coming to say hi Paul! I recall you being very excited and getting the email after you submitted your first patch and loved it.

I think a ton of the current set of "next generation" came to Postgres a good bit later. Even Tom himself will tell you he did some stuff with images for a few years which undersells himself (tiff, jpg, png - in some form involved in creation of each of those), then found this Postgres thing and started working on it.


Craig, you really made me feel welcome in the Postgres community. Maybe people have. It's such an amazing group. Thanks for all the pg advocacy you've done!


I definitely recall being on the talk committee one year for DjangoCon, and there was some rough discussion. (Context: The conference was generally a two track conference but keynotes and a few other sessions were single track). One of the single track talks was about Postgres. The discussion was roughly "If we have a Postgres talk we should have another talk like Mongo or MySQL" and the response was roughly "Everyone in Django is using Postgres and if you're not you should be at the talk to learn why you should".

Way more Rails apps used MySQL or other databases, it was largely Heroku winning Rails that led to the strong adoption amongst that community.


Having dealt with ETL tooling that starts as "simple" and next thing you know is a few dedicated hires for what was supposed to be a simple pipeline, the phrase resonates a lot. If you've already built out a team to do this, tried multiple different tools, and then have something that just works I'd be all for it and it would have been just that to me frustrating to have gone down some other path first.

As a vision statement to me it resonates, now curious to give it a try and see if it fulfills on that vision.


At the moment some of even the salesforce tower space is being sublet and used by weworks. At the moment it's really a great time to take advantage of some of the wework setups if you want to be in SF. The rates haven't changed much, yet the general space and quality of amenities to me seems not far off from some of the posh setups of Heroku or GitHub (maybe not quite that level but not far off).


https://www.craigkerstiens.com/

Been up and down on cadence over the years, but a few posts that have shown up here.

- Give me back my monolith - https://www.craigkerstiens.com/2019/03/13/give-me-back-my-mo...

- Why Postgres - https://www.craigkerstiens.com/2012/04/30/why-postgres/

- Unfinished business with Postgres - https://www.craigkerstiens.com/2022/05/18/unfinished-busines...

- A guide to PR for startups - https://www.craigkerstiens.com/2015/07/21/a-guide-to-pr-for-...


I'd dispute that MOST of the revenue is from Postgres, though it is likely a very large double digit percentage of revenue.

Their managed Postgres got an unfair advantage early on by being the default. With a new rails app you were just given a Heroku Postgres database. Heroku's Postgres offering was over 5 years ahead of RDS Postgres and a large part of the reason Amazon added support for Postgres for years they fought it, but eventually had to give into the constant customer request. Unfortunately that innovation of Postgres has stalled out a bit over time due to Salesforce starvation.

I do admit for what people think of as PaaS the Postgres revenue is significant, the part people would probably be surprised about is how much revenue the add-ons marketplace constitutes.

Interestingly New Relic got the same privileged experience in the early days and it resulted in the same type of revenue and mass adoption for New Relic.


Yep, still around, maybe should do a bit more personal blogging, but still here and not going anywhere any time soon :)


For being "boring and stable" technology, Postgres 16 includes 200 features, which is on par with each of prior years releases. If you're looking for the full set of features it's available here - https://www.postgresql.org/docs/16/release-16.html.

Personal favorites on the list include:

- load_balance_hosts, which is an improvement to libpq so you can load balance across multiple Postgres instances.

- Logical replication on standbys

- pg_stat_io which is a new view that shows IO details.


Could you elaborate on the load balancing?

Is this a replacement for PG bouncer and similar?


IMO the biggest reason folks use pgbouncer is not for load balancing (which it can do, -ish) but instead for connection pooling. Postgres connections are expensive for the db server (one process per connection not one thread) so if you have say thousands of web application pods you need to use pgbouncer or similar as a proxy to multiplex those thousands of connections down onto a more manageable number (~200). So no, not really.

(EDIT: if you don't know this already - the _establishment_ of connections is also super expensive. so another reason to pgbounce is to keep connections persistent if you have app servers that are constantly opening and closing conns, or burst open conns, or such like. Even if the total conns to pg doesnt go super high, the cost of constantly churning them can really hurt your db)


Out of curiosity, if the problem of connections being expensive is solvable by PGBouncer-style connection multiplexing, why doesn't Postgres just do that by itself?


If it were easy to do well in-core, I think we'd do it immediately. Unfortunately the interesting pooling modes in pgbouncer also break a few things - which would likely not be acceptable in core postgres. Avoiding such breakage requires non-trivial architectural adjustments.

Historically connection state and "process state" have been tightly coupled, for good server-side pooling they have to be divorced. While good pooling is doable with the current process model (passing the client file descriptor between processes using SCM_RIGHTS), it's much harder with processes than with threads - this is one of the reasons I think we will eventually need to migrate to threads.

Eventually I want to get to a point where we have a limited number of "query execution workers" that handle query execution, utilized by a much larger number of client connections (which do not have dedicated threads each). Obviously it's a long way to go to that. Ah, the fun working on an complicated application with a ~35 year history.

There also are use cases for pgbouncer that cannot be addressed on the server-side - one important one is to run pgbouncer on "application servers", to reduce the TCP+TLS connection establishment overhead and to share connections between application processes / threads. That can yield very substantial performance gains - completely independent of server side pooling support.


Because pgbouncer's transaction-based pooling, which is what the previous poster was referring to, breaks a few postgres features. This is fine for most applications, but not all. See the table on https://www.pgbouncer.org/features.html


I really wish there was a way to handle transaction pooling without losing session pooling features.

Essentially, a "please serialize everything (temp tables, SET GUC values, etc) from this session to disk and load it back when necessary".


yeah good callout. I'd would be totally fine with a server mode, or connection option, that opts you into pooling but then disables the incompatible features. You choose pooled mode if you need it, and the tradeoffs are documented.


I believe there has been talk and perhaps progress towards building it into PG, but i cant find the dev thread right now, and i dont know what the status of that is.

interestingly enough this is what Oracle does AFAIK. They are also process-per-conn & have an optional sidecar proxy thingy that you can run on your oracle host to do the pooling. I would rather it be built more tightly into the rdbms but thats not a terrible solution.


There were changes made to idle sessions in postgres 14.0 that were supposed to reduce the resource usage of open but idle connections.

Crunchydata mentioned it on their blog a while back (https://www.crunchydata.com/blog/five-tips-for-a-healthier-p...) and the pg 14 release notes mention a few changes to idle sessions (https://www.postgresql.org/docs/release/14.0/)

I don't know if they were sufficient that pgbouncer is no longer necessary, haven't had a need to try it.


Postgres forks an OS process for each connection, which is relatively heavy weight. Oracle has a similar architecture to Postgres, and they solved it with a "multi-threaded listener". MySQL, in contrast, uses threads, which makes connections lighter weight.

It's fundamentally a question of how the connection listener communicates with the rest of the database, e.g., using shared memory or some other IPC mechanism, work queues, etc. Having too many connections results in problems with concurrent access and lock contention independent of how heavyweight the actual listening process is.


That's pretty much been my take as well. Been looking at CockroachLabs (CockroachDB hosted cloud), and been kind of hard to get any detail if pgbouncer is beneficial for this, since I'm planning to connect via node_compat via Cloudflare Workers, and have kind of decided to punt the issue and connect directly for now in development, and closer to prod release do some tests with/without pgbouncer at each cloud location for the larger prod cluster instances for release.

Not the same service(s) as PG even if it's the same protocol, so I know it's really beneficial for connection queueing WRT my scenario for PG, but no idea on the CDB side.


Postgres 14 already made huge improvements on this front.

https://techcommunity.microsoft.com/t5/azure-database-for-po...


Personally love awesome Postgres - https://github.com/dhamaniasad/awesome-postgres


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: