I posted this in a comment already, but the results here line up with the original BOLT paper.
“For the GCC and Clang compilers, our evaluation shows that BOLT speeds up their binaries by up to 20.4% on top of FDO and LTO, and up to 52.1% if the binaries are built without FDO and LTO.”
"Up to" is one of those "technically correct", it's probably more genuine and ethical to give a range in the same circumstances. If 95% of binaries get at least 18%. but the remaining 5% get much less than that, and that's important, then say that, maybe.
When i see stuff like this, i usually infer that 95% gets a median of 0% speedup, and a couple of cases get 20.4% or whatever. But giving a chart of speedups for each sort of thing that it speeds up (or doesn't) doesn't make for good copy, i think.
Would the profiles and resulting binaries be highly CPU specific? I couldn't find any cross hardware notes in the original paper.
The example's I'm thinking of are CPU's with vastly different L1/L2/L3 cache profiles. Epyc vs Xeon. Maybe Zen 3 v Zen 5.
Just wondering if it looks great on a benchmark machine (and a hyperscaler with a common hardware fleet) but might not look as great when distributing common binaries to the world. Doing profiling/optimising after release seems dicey.
Interesting question. I think most optimizations described in the BOLT paper are fairly hardware agnostic - branch prediction does not depend the architecture, etc. But I'm not an expert on microarchitectures.
My first instinct is that the effect is too large to be real. But that should be something other people could reproduce and verify. The second thought is that it might overfit the benchmark code here, but they address it in the post. But any kind of double-digit improvement to Postgres performance would be very interesting.
I agree the +40% effect feels a bit too good, but it only applies to the simple OLTP queries on in-memory data, so the inefficiencies may have unexpectedly large impact. I agree 30-40% would be a massive speedup, and I expected it to disappear with a more diverse profile, but it did not ...
The TPC-H speedups (~5-10%) seem much more plausible, considering the binary layout effects we sometimes observe during benchmarking.
Anyway, I'd welcome other people trying to reproduce these tests.
I looked and there is no mention of BOLT yet in the pgsql-hackers mailing list, that might be the more appropriate place to get more attention on this. Though there are certainly a few PostgreSQL developers reading here as well.
True. At the moment I don't have anything very "actionable" beyond "it's magically faster", so I wanted to investigate this a bit more before posting to -hackers. For example, after reading the paper I realized BOLT has "-report-bad-layout" option to report cases of bad layout, so I wonder if we could identify places where to reorganize the code.
There’s a section of the article at the end about how Postgres doesn’t have LTO enabled by default. I’m assuming they’re not doing PGO/FDO either?
From the Bolt paper: “For the GCC and Clang compilers, our evaluation shows that BOLT speeds up their binaries by up to 20.4% on top of FDO and LTO, and up to 52.1% if the binaries are built without FDO and LTO.”
I've always wondered how people actually get the profiles for Profile-Guided-Optimization. Unit tests probably won't actuate high-performance paths. You'd need a set of performance-stress tests. Is there a write-up on how everyone does it in the wild ?
You might be surprised how much speedup you can get from (say) just running a test suite as PGO samples. If I had to guess this is probably because compilers spend a lot of time optimising cold paths which they otherwise would have no information about
Yeah, getting the profile is obviously a very important step. Because if it wasn't, why collect the profile at all? We could just do "regular" LTO.
I'm not sure there's one correct way to collect the profile, though. ISTM we could either (a) collect one very "general" profile, to optimize for arbitrary workload, or (b) profile a single isolated workload, and optimize for it. In the blog I tried to do (b) first, and then merged the various profiles to do (a). But it's far from perfect, I think.
But even with the very "rough" profile from "make installcheck" (which is the basic set of regression tests), is still helps a lot. Which is nice. I agree it's probably because even that basic profile is sufficient for identifying the hot/cold paths.
I think you have to be a bit careful here, since if the profiles are too different from what you'll actually see in production, you can end up regressing performance instead of improving it. E.g., imagine you use one kind of compression in test and another in production, and the FDO decides that your production compression code doesn't need optimization at all.
If you set up continuous profiling though (which you can use to get flamegraphs for production) you can use that same dataset for FDO.
Yeah, I was worried using the "wrong" profile might result in regressions. But I haven't really seen that in my tests, even when using profiles from quite different workloads (like OLTP vs. analytics, different TPC-H queries, etc.). So I guess most optimizations are fairly generic, etc.
That's not how it works. BOLT is mainly about figuring out the most likely instructions that will run after branches and putting them close together in the binary. Unlikely instructions like error and exception paths can be put at the end of the binary. Putting the most used instructions close together leverages prefetching and cache so that unused instructions aren't what is being prefetched and cached.
In short it is better memory access patterns for instructions.
For distros, you're probably talking about small programs with shared libraries. I talked to the Bolt guy at an LLVM meeting and Bolt is set up for big statically linked programs like what you'd see at Facebook or Google (which has Propeller). It may have changed but even though they were upstreaming Bolt to LLVM, they didn't really have support for small programs with shared libraries.
based on what "fishgoesblub" commented, building - read: `emerge -e @world` - a gentoo system with profiling forced, and then using it in that "degraded" state for a while ought to be able to inform PGO, right? if there's a really good speedup from putting hot code together, the hottest code after moderate use should suffice to speed up things, and this could continually be improved.
I'm also certain that if there were a way to anonymously share profiling data upstream (or to the maintainers), that would decrease the "degradation" from the first step, above. I am 100% spitballing here. I'm a dedicated gentoo sysadmin, but i know only a small bit about optimization of the sort being discussed here. So it is possible that every user would have to do the "unprofiled profiler" build first, which, if one cares, is probably a net negative to the planet, unless the idea pans out, then it's a huge positive for the planet - man hours, electricity, wear/endurance on parts, etc.
It would be difficult as every package/program would need a step to generate the profile data by executing and running the program like the user would.
I wouldn't want to support it, but similar things have been done before.
Alexia Massalin's Synthesis[0] (pdf) operating system did JIT-like optimizations for system calls. Here's a LWN article[1] with a summary. Anyone who's interested in operating systems should read this thesis.
HP's Dynamo[2] runtime optimizer did JIT-like optimizations on PA-RISC binaries; it was released in 2000. DynamoRIO[3] is an open source descendant. Also, DEC had a similar tool for the Alpha, but I've forgotten the name.
completely out of the loop here so asking, what is BOLT, how does it actually improve postgres? what do the optimizations do under the hood? and how do we know they haven't disabled something mission critical?
On the subject of completely free speedups to databases, someone sent a patch to MySQL many years ago that loads the text into hugepages, to reduce iTLB misses. It has large speedups and no negative consequences so of course it was ignored. The number of well-known techniques that FOSS projects refuse to adopt is large.
I posted this in a comment already, but the results here line up with the original BOLT paper.
“For the GCC and Clang compilers, our evaluation shows that BOLT speeds up their binaries by up to 20.4% on top of FDO and LTO, and up to 52.1% if the binaries are built without FDO and LTO.”
“Up to” though is always hard to evaluate.
"Up to" is one of those "technically correct", it's probably more genuine and ethical to give a range in the same circumstances. If 95% of binaries get at least 18%. but the remaining 5% get much less than that, and that's important, then say that, maybe.
When i see stuff like this, i usually infer that 95% gets a median of 0% speedup, and a couple of cases get 20.4% or whatever. But giving a chart of speedups for each sort of thing that it speeds up (or doesn't) doesn't make for good copy, i think.
Up to 10000% I think
https://xkcd.com/870/
Would the profiles and resulting binaries be highly CPU specific? I couldn't find any cross hardware notes in the original paper.
The example's I'm thinking of are CPU's with vastly different L1/L2/L3 cache profiles. Epyc vs Xeon. Maybe Zen 3 v Zen 5.
Just wondering if it looks great on a benchmark machine (and a hyperscaler with a common hardware fleet) but might not look as great when distributing common binaries to the world. Doing profiling/optimising after release seems dicey.
Interesting question. I think most optimizations described in the BOLT paper are fairly hardware agnostic - branch prediction does not depend the architecture, etc. But I'm not an expert on microarchitectures.
My first instinct is that the effect is too large to be real. But that should be something other people could reproduce and verify. The second thought is that it might overfit the benchmark code here, but they address it in the post. But any kind of double-digit improvement to Postgres performance would be very interesting.
(author here)
I agree the +40% effect feels a bit too good, but it only applies to the simple OLTP queries on in-memory data, so the inefficiencies may have unexpectedly large impact. I agree 30-40% would be a massive speedup, and I expected it to disappear with a more diverse profile, but it did not ...
The TPC-H speedups (~5-10%) seem much more plausible, considering the binary layout effects we sometimes observe during benchmarking.
Anyway, I'd welcome other people trying to reproduce these tests.
I looked and there is no mention of BOLT yet in the pgsql-hackers mailing list, that might be the more appropriate place to get more attention on this. Though there are certainly a few PostgreSQL developers reading here as well.
True. At the moment I don't have anything very "actionable" beyond "it's magically faster", so I wanted to investigate this a bit more before posting to -hackers. For example, after reading the paper I realized BOLT has "-report-bad-layout" option to report cases of bad layout, so I wonder if we could identify places where to reorganize the code.
OTOH my blog is syndicated to https://planet.postgresql.org, so it's not particularly hidden from the other devs.
10% - 20% performance improvement for PostgreSQL "for free" is amazing. It almost sounds too good to be true.
There’s a section of the article at the end about how Postgres doesn’t have LTO enabled by default. I’m assuming they’re not doing PGO/FDO either?
From the Bolt paper: “For the GCC and Clang compilers, our evaluation shows that BOLT speeds up their binaries by up to 20.4% on top of FDO and LTO, and up to 52.1% if the binaries are built without FDO and LTO.”
I've always wondered how people actually get the profiles for Profile-Guided-Optimization. Unit tests probably won't actuate high-performance paths. You'd need a set of performance-stress tests. Is there a write-up on how everyone does it in the wild ?
Google and Meta do in-production profiling. I think that tech is coming to everyone else slowly.
You might be surprised how much speedup you can get from (say) just running a test suite as PGO samples. If I had to guess this is probably because compilers spend a lot of time optimising cold paths which they otherwise would have no information about
Yeah, getting the profile is obviously a very important step. Because if it wasn't, why collect the profile at all? We could just do "regular" LTO.
I'm not sure there's one correct way to collect the profile, though. ISTM we could either (a) collect one very "general" profile, to optimize for arbitrary workload, or (b) profile a single isolated workload, and optimize for it. In the blog I tried to do (b) first, and then merged the various profiles to do (a). But it's far from perfect, I think.
But even with the very "rough" profile from "make installcheck" (which is the basic set of regression tests), is still helps a lot. Which is nice. I agree it's probably because even that basic profile is sufficient for identifying the hot/cold paths.
I think you have to be a bit careful here, since if the profiles are too different from what you'll actually see in production, you can end up regressing performance instead of improving it. E.g., imagine you use one kind of compression in test and another in production, and the FDO decides that your production compression code doesn't need optimization at all.
If you set up continuous profiling though (which you can use to get flamegraphs for production) you can use that same dataset for FDO.
Yeah, I was worried using the "wrong" profile might result in regressions. But I haven't really seen that in my tests, even when using profiles from quite different workloads (like OLTP vs. analytics, different TPC-H queries, etc.). So I guess most optimizations are fairly generic, etc.
There are some projects (not sure if available to use in anger) to generate PGO data use using AI.
That's not how it works. BOLT is mainly about figuring out the most likely instructions that will run after branches and putting them close together in the binary. Unlikely instructions like error and exception paths can be put at the end of the binary. Putting the most used instructions close together leverages prefetching and cache so that unused instructions aren't what is being prefetched and cached.
In short it is better memory access patterns for instructions.
I suspect you know this based on the detail in your comment and just missed it, but parent is talking about FDO, not BOLT.
Yes, but I'm not talking about BOLT
With the LTO, I think it's more complicated - it depends on the packagers / distributions, and e.g. on Ubuntu we apparently get -flto for years.
For distros, you're probably talking about small programs with shared libraries. I talked to the Bolt guy at an LLVM meeting and Bolt is set up for big statically linked programs like what you'd see at Facebook or Google (which has Propeller). It may have changed but even though they were upstreaming Bolt to LLVM, they didn't really have support for small programs with shared libraries.
How easy would it be to have an entire distro (re)built with BOLT? Say for example Gentoo?
based on what "fishgoesblub" commented, building - read: `emerge -e @world` - a gentoo system with profiling forced, and then using it in that "degraded" state for a while ought to be able to inform PGO, right? if there's a really good speedup from putting hot code together, the hottest code after moderate use should suffice to speed up things, and this could continually be improved.
I'm also certain that if there were a way to anonymously share profiling data upstream (or to the maintainers), that would decrease the "degradation" from the first step, above. I am 100% spitballing here. I'm a dedicated gentoo sysadmin, but i know only a small bit about optimization of the sort being discussed here. So it is possible that every user would have to do the "unprofiled profiler" build first, which, if one cares, is probably a net negative to the planet, unless the idea pans out, then it's a huge positive for the planet - man hours, electricity, wear/endurance on parts, etc.
It would be difficult as every package/program would need a step to generate the profile data by executing and running the program like the user would.
Is it theoretically possible to perform the profile generation+apply steps dynamically at runtime?
I wouldn't want to support it, but similar things have been done before.
Alexia Massalin's Synthesis[0] (pdf) operating system did JIT-like optimizations for system calls. Here's a LWN article[1] with a summary. Anyone who's interested in operating systems should read this thesis.
HP's Dynamo[2] runtime optimizer did JIT-like optimizations on PA-RISC binaries; it was released in 2000. DynamoRIO[3] is an open source descendant. Also, DEC had a similar tool for the Alpha, but I've forgotten the name.
[0] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...
[1] https://lwn.net/Articles/270081/
[2] https://dl.acm.org/doi/pdf/10.1145/349299.349303
[3] https://dynamorio.org/
Nat Freidman developed "GNU Rope"[1] from 1998 which, if memory serves, was inspired by a tool that did the same thing in IRIX (cord, I believe).
[1] http://lwn.net/1998/1029/als/rope.html
This is getting way outside the traditional compiler model, but I believe the .NET JIT has been adding more support for this in the last couple versions. One aspect of it is covered at https://devblogs.microsoft.com/dotnet/performance-improvemen...
I believe some JIT systems already do PGO / might be extended to do what BOLT does.
It would be hard to trust the result.
Does it work with rustc binaries?
Already done. https://github.com/rust-lang/rust/pull/116352
completely out of the loop here so asking, what is BOLT, how does it actually improve postgres? what do the optimizations do under the hood? and how do we know they haven't disabled something mission critical?
Literally the second sentence
On the subject of completely free speedups to databases, someone sent a patch to MySQL many years ago that loads the text into hugepages, to reduce iTLB misses. It has large speedups and no negative consequences so of course it was ignored. The number of well-known techniques that FOSS projects refuse to adopt is large.
MySQL has adopted a lot of performance work from FB, Google, and others. Though I suspect they want their implementation for license reasons.