Comments (22)
Compliments aside, where this article stops is where things get exciting. Postgres shines here, as does Vitess, Cassandra, ScyllaDB, even MongoDB has materialized views now. Vitess and Scylla are so good, it’s a shame they’re not more popular among smaller startups!
What I haven’t seen yet is a really good library for managing materialized views.
For this reason I would strongly advise, in the spirit of https://wiki.postgresql.org/wiki/Don't_Do_This, that you Don't Do Materialized Views.
Sure, Differential/Timely Dataflow exist and they're very interesting; I have not gotten to build a database system with them and the systems that provide them in a usable format to end users (e.g. Materialize) are too non-boring for me to want to deploy in a production app.
https://github.com/sraoss/pg_ivm
It's however not available on RDS, so I've never had the chance to try it myself.
Only certain kinds of conditions, such as a rolling window over a timestamp field, seem amenable to efficient incremental updates. What am I missing?
> There are a few startups these days peddling a newfangled technology called “incremental view maintenance” or “differential dataflow”
I think they should be a little more explicit about the differences though, because it can be very misleading for those who arent aware of the distinction.
This technique is very well supported in the big commercial engines. In MSSQL's Indexed View case, the views are synchronously updated when the underlying tables are modified. This has implications at insert/update/delete time, so if you are going to be doing a lot of these you might want to do it on a read replica to avoid impact to production writes.
https://learn.microsoft.com/en-us/sql/relational-databases/v...
https://learn.microsoft.com/en-us/sql/t-sql/statements/creat...
Oracle's implementation scales horizontally and can do incremental view maintenance using commit logs:
https://oracle-base.com/articles/misc/materialized-views
But it may not even be necessary because Oracle also supports query caching both server and client side. By default you have to opt-in on the query level:
SELECT /*+ RESULT_CACHE */ .... FROM .....
but if you do then queries might not even hit the database at all, the results are cached by the drivers. The main limitation of that feature is that it's meant for read-mostly tables and cache keys are invalidated on any write to the table. So for aggregating counts in a large multi-tenant table the hit rate would be low and a materialized view is more appropriate.It's a bit unclear why the author of the article ignores triggers as not "in vogue" though. That's supported by every DB and will work fine for this use case. You do have to learn how to use these features but it can save a lot of work, especially when introducing non-consistent data sources like a Redis cache. Consistency is so valuable.
Disclosure: work for Oracle, opinions are my own (albeit in this case, presumably highly aligned).
Bro that's your entire job description. What is left?
With PostgreSQL the materialized view won't be automatically updated though, you need to do `REFRESH MATERIALIZED VIEW` manually.
3rd party solutions like https://readyset.io/ and https://materialize.com/ exist specifically because databases don’t actually have what we all want materialized views to be.
But I don't think there is anything similar in the GCP transactional db options like Spanner or CloudSQL.
I'm currently parking PostgREST behind Fastly (varnish) for pretty much the same benefits plus edge CDNs for my read APIs. I really just use materialized views for report generation now.
It's important to understand how your implementation works before committing to it.
Now that would be awesome!
EDIT: come to think of it, it would require going through CDC stream, and figuring out if any of the tables affected are a dependency of given materialized view. Maybe with some ast parsing as well to handle tenants/partitions. Sounds like it can work?
At least on SQLAnywhere it seems to be implemented using triggers, ie not unlike what one would do if rolling your own.
https://oracle-base.com/articles/23/materialized-view-concur...
From the article: "Concurrent refreshes will only happen if there are no overlaps in the materialized view rows updated by each session."
I've found them VERY useful for a narrow range of use cases but, I probably don't realize what I'm missing.
'There are a few startups these days peddling a newfangled technology called “incremental view maintenance” or “differential dataflow”. '
Incremental view maintenance can change recomputation cost of view updates from O(N) to O(1). DBSP is based on z-sets, a generalization of relational algebra. The paper won best paper at SIGMOD. There is a startup, Feldera, commercializing it.
This is just ignorance to dismiss as 'new fangled'.
Also I love their website design.
It is misleading to say that IVM reduces the cost of view updates from O(n) to O(1). While that is not technically incorrect, for any nontrivial query (e.g anything with an index join) the cost of a view update will be smaller than the original query but not constant time.
Also, the tone of “newfangled” was not dismissive in the context of an article praising IVM. At worst, it was sarcastic; I interpreted it as teasing.
I do research related to IVM / DBSP.
There is constantly so much new stuff in software, you have to be a bit willfully ignorant of some things some of the time.
>There are a few startups these days peddling a newfangled technology called “incremental view maintenance” or “differential dataflow”. Basically the way it works is you just say “hey, I’d like to keep track of how many tasks each project has” by writing any SQL query you want:
>[query]
>The “magic” is actually really cool. Basically the SQL query is analyzed to produce a DAG of the data flow with different nodes for filters, groups, joins, etc, and then each node “knows” how to map any change in its input to the appropriate change in the output.
So the person who calls the technology "magic" (with emoji sparkles around it!) and "really cool" and generally spent all this time writing an article about this particular annoying problem and all of the other bad solutions about it, and then mention these startups at the end... Is being denigrating and dismissive.
Which one of these companies do you work for that causes you to feel so hurt? Because the author didn't mention them by name?
IMO a slept-on database feature is table partitioning to improve query performance. If you have a frequently-used filter field that you can partition on (e.g. creation timestamp), then you can radically improve query performance of large databases by having the DB only need to full-scan the given partitions. The database itself manages where records are placed, so there is no additional overhead complexity beyond initial setup. I've only used this for PostgreSQL, but I assume that other databases have similar partition mechanisms.
> And then by magic the results of this query will just always exist and be up-to-date. You can just query it and it’s instant;
This is not true of a materialized view, which needs to be refreshed after changes. PostgreSQL does not support incremental refresh out of the box, so you must refresh the entire view. A standard view would be "magically" up to date, as it is going to defer execution to runtime.
In fact, the query would return the result straight from counting the project_id index entries, never even needing to scan the table itself (as the author acknowledges).
If you're really dealing with many, many thousands of tasks per project, then materialized views are going to be just as slow to update as to view. They're not "magic". The standard performant solution would be to keep a num_tasks field that was always incremented or decremented in a transaction together with inserting or deleting a task row. That will actually be lightning fast.
Materialized views aren't even supported in many common relational databases. They're a very particular solution that has very particular tradeoffs. Unfortunately, this article doesn't go into the tradeoffs at all, and picks a bad example where they're not even an obviously good solution in the first place.
Of course this isn't really relevant until there are a very large number of rows to count for a given query. Much larger than what is likely for "tasks in a project". I've run into this only with queries that end up counting 10e7/8/9 rows, i.e. more like OLAP workloads
A better, canonical use case for matviews is subdividing the projects table into project-tasks-reporting by month by customer by market vertical, or what have you.
All said, still not a massive needle-mover
So I just pre-emptively break "normal form" and maintain counts in the database right from the start. I either update it with multiple stmts everytime in a txn, or with triggers. In sqlite I prefer triggers mostly because I know the operation and edge cases of sqlite trigger impl better than I know other DBs'.
Caching can just be invalidation-based in this case.
Most of the times the solution is not pre-aggregation but proper indexes on foreign key columns (missing fk indexes is one of the most common mistakes in RDB design).
> increase contention and harm scalability
Contention, concurrent connections, high-throughput and the associated race conditions are absolute table stakes for RDBMs. They just won't be taken seriously if they can't handle thousands of concurrent updates. So imho for 90% of projects this just won't be an issue.
> PostgreSQL you increase bloat as every update creates a new row version
This is true, but the complexity has to live somewhere. There will be as many rows added to a materialized view as there will be in the original table.
> Most of the times the solution is not pre-aggregation
This is wrong. Caching = pre-aggregation and is almost always either the solution or a part of the solution to scalability, latancy etc. Don't compute what you can retrieve.
[All just my understanding]
Isn't their tech to address that, like golang's "singleflight"?
Of course, that assumes you somehow have an event when interesting things change, and that you've gone through the effort of figuring out how you want to record (or update the record) for later.
https://zero.rocicorp.dev/docs/introduction
IVM is what allows the core feature of our DX: that users can just do a fairly large query with sorts and joins, like:
zero.query.posts.where('authorID', 42)
.orderBy('created', 'desc')
.limit(1000)
.related('comments',
c => c.orderBy('created', 'desc').limit(10))
... and we sync updates to this query incrementally to the client.TanStack DB also uses IVM, but only client-side currently.
If you are interested in such things, you can take a poke around the source here: https://github.com/rocicorp/mono/tree/main/packages/zql. Or come find us in Discord: https://discord.rocicorp.dev/
I'm not affiliated with any of these names, I'm just really interested in IVM:
Materialize Readyset Feldera RisingWave
> Uh oh, someone is tapping you on the shoulder and saying this is too slow because it has to do a complete index scan of the tasks for the project, every time you load the page
Just ask them if that's actually the bottleneck and go for a walk outside, before sweating over anything else discussed in this post.