I don't understand these "DB in browser" products.
If the data "belongs" to the server, why not send the query to the server and run it there?
If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?
A lot of the benefits of such databases are their ability to optimise queries for improving performance in a context where the data can't fit in memory (and possibly not even on single disks/machines), as well as additional durability and atomicity improvements. If the data is small enough to be reasonable to send to a client, then it's small enough to fit in memory, which means it'll be fast to query no matter how you go about it.
The page says one advantage is "Ad-hoc queries on data lakes", but isn't that possible with the most basic form that simply sends a query to the database?
What am I failing to understand about this category of products?
The essence of the approach is that a large majority of FE apps have constructed quite complex caching layers to improve performance over "querying backend data" - take a look at things like Next.js or React Query <- as the post above mentions, they're essentially rebuilding databases. So instead this approach just moves the db to the browser, along with a powerful syncing layer.
I think it's an approach that deserves more attention, especially to improve DX where we end up writing a whole lot of database-related logic on clients. Mind as well then just use a database on the client as well
To the other good points, I'd add that when people start making queries that aren't O(1) or O(n), modern computers and networks are capable of reasonably conveniently moving around amounts of data where it becomes practical for remote client to use its rather substantial power to answer questions on a gigabyte or two of data with its own computing power, and if enough people are doing this at once the combined client power they can have can easily overpower any reasonable amount of cloud compute you might be willing to deploy for this problem as they run non-trivial queries. I may be happy for dozens of clients to download some data and chew through some O(n log n) with mixed O(n^2) components on their own whereas I would not care to provision enough cloud compute to handle it all.
I think people forget that as cheap as cloud compute is, client compute is even cheaper. Generally it's already paid for. (Marginal electricity costs are lost in the noise of everything else you're supporting a client user with.) And while the years of doubling every 1.5 years may be gone, clients do continue to speed up, and they are especially speeding up in ways that databases can take advantage of (more cores, more RAM, more CPU cache). Moving compute to clients can be a valuable thing in some circumstances on its own.
Yes, but as the numbers advance from "the remote desktop has like 128KB" and "the remote desktop can chew through gigabytes without much stress", the delta between O(n) and O(n^2) opens up a lot more. It is perhaps a bit counterintuitive, but as systems grow in capability that delta grows.
> If the data "belongs" to the server, why not send the query to the server and run it there?
We could, but if the data size is not that huge, sending it once to the client and then letting the client perform the queries can be desirable. The tool works without internet access, the latent is much better, all results are coherent etc
> If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?
Just because it fits in memory doesn't mean that the shape of the data does not matter. A data structure optimised for whatever query/analysis you want to perform has a significant impact on how fast and efficiently you can perform those operations
- database engines should always run on a server, never on the user’s own machine.
- local data is never larger than the database engine, and it’s better in all cases to move all of the local data to the server where the database engine runs.
- local data can always be moved to another machine, regardless of sensitivity.
Buy are these always true, all the time? It seems to me that there are many use cases (e.g. training local AI models) where these rules should not be so absolute.
There is an increasing subset of people (think those that used to work in MS Access, Excel power users) who learned SQL in a business IT course or on the job. They don’t have data sized to fit in a DB, but they do want to do analyses that use window functions and things that SQL makes more natural than Excel vtables or functions. They may have to give reports to an equally technical boss who would like to play with the report and explore assumptions using SQL.
The data size is typically not large; SQL and integrating with cloud-based spreadsheets is the selling point.
Once you accept the idea that you're going to run SQL on the browser, the forces are going to inevitably pull you to having an entire DB engine in the browser anyhow. You're going to need the entire query parser, enough of the IO code to be able to read the DB (which is a lot of it), you're going to want the optimizer, you're going to need RPC to send the SQL and read the results... yes, there is certainly some stuff you can trim, but it's not like it's a horrible idea.
So even if someone, say, started a startup with the idea "I'm going to present an MVP of an in-browser DB engine people can use", in 5-10 years it'd be a nearly fully-fledged DB anyhow.
Moreover, DuckDB isn't exactly a “full-scale” DB in the sense that someone would think about Postgres/MySQL. It is an embedded database more like sqlite that happens to be good for analytical workloads. It doesn't have to worry about things like HA, user auth, network wire formats, etc. that a usual “full-scale” DB would.
There are also some data lakes (though I have not really worked with any) that are homegrown and consist of smallish parquet files in a blob store; they might have several databases looking at them at any given time, and an embedded browser DB provides a frontend that is easier to stand up and auth against in some ways.
There may not be a reporting DB available to send queries to, in the traditional sense.
I built a web application that became a progressive web app because it needed to work offline because it was used by a pilot during their work, and being at high altitudes, internet is not guaranteed. We cached a bit on localStorage, and I'd loop trying to make HEAD requests (there was no "am I online" API available...) and once we knew you were online we would push the locally cached changes up to the DB. In the meantime we still displayed the current data as you've changed it as well. It worked out nicely.
There's a lot of static data that never changes, that if you can request once, and cache it somewhere, it becomes really useful. It's less requests coming to the server. Heck, Tumblr in the 2010s would display a lot of blog data on the dashboard such as follower count, likes count, etc they had outages all the time, when I started to see less and less outages was when they stopped showing all that blog metadata on the dashboard because you didnt have hundreds of thousands (millions?) of users refreshing to see new content, querying across dozens of databases to see likes, follows, etc. It was very common to frequently refresh tumblr to see the latest content.
Imagine had they cached this data instead and only updated it based on a lastUpdated fields value being different.
I think the "db in browser" idea is interesting because it makes the user pay for the compute for many more of the typical DB tasks.
Any time you're in a situation where you're in an aggregation point you need to be wary of Moore's law; this is load balancers, database systems, log aggregation, etc.
If your computers are getting N cheaper / faster per year, but you're getting M new clients and all your clients are also getting N cheaper / faster per year, you're going to be getting more traffic faster than you can scale, unless you figure out how to cheat. Cheating may be sharding your internal workloads, but it may also be "make someone else do the work."
Right - so not a database, but a columnar analytics compute engine. Half of why we wrote arrow js was for pumping server arrow data to webgl, and the other half for powering arrow columnar compute libraries & packages like this.
For sub-100ms smooth interactivity, for data in a certain size range sweet spot, can be very nice!
The "delete" button does not work. The "home" button inserts a whitespace. Pasting with "Ctrl+v" also does not work. Every keypress results in blinking, and there is a notable input lag.
When I tried a query
duckdb> SELECT * FROM 'https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/*.gz'
...> ;
Catalog Error: Table with name https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/*.gz does not exist!
Did you mean "sqlite_master"?
LINE 1: SELECT * FROM 'https://clickhouse-public-datasets.s3....
Suggesting the "sqlite_master" database is also misleading.
IndexedDB is a key/value store; DuckDB is a relational analytic database. You would use IndexedDB for something like application data, and DuckDB for (e.g.) a data-heavy dashboard or ad-hoc query tool.
So what? We should be structuring a lot of databases as layers on top of transactional key value stores. We should be able to swap out the key value store without changing the higher level query engine.
We maybe[1] should, but we are not. AFAIK there's no relational query engine for IndexedDB.
Also, there are a lot more people who can get things done with SQL than using indexed KV-stores. KV stores tending to have horrible APIs doesn't help the situation.
[1] E.g. SQLite started with a gdbm backend, but later rolled their own because of limitations caused by it.
If you use https://dexie.org/ on top of IndexDB, you can turn it into a fairly useful document store with secondary indices, an improved query building interface, and reactive queries. It's doing exactly what you would expect on top of IndexDB if you did it yourself, but saves a ton of boilerplate. They have a sync service to go with it, but in my usage I managed that myself.
It feels very impractical indeed. Also the size of the binary to load the compiled wasm. All this would be much better done on the server and if really needed, users may be given a way to download results (ideally with their own preferred tool for fetching files)
Well the data has to come from somewhere right?
If the goal is to facilitate client-side (bring your own data) scenarios, I'd make a proper native desktop app and take full advantage of the system. A hybrid something running in the browser feels like a compromise between both solutions.
Why would you make a "proper" native desktop app (or more specifically apps for every platform you want to support) if you can do it with a PWA (which you can do for vast majority of apps).
Because PWAs are limited in more ways than practical to list here, just to name a few - the browser they run in, restricted by the availability and quality of internet connection, they're not sustainable as build tools don't even bother with backwards compatibility given the pace of evolution, practically no user control over a pwa "app" running in the browser.
Remember, PWAs exist as a work-around for gatekeepy OS vendors making it hard to create cross-platform apps. PWAs don't resolve anything - PWAs move the problem to the browser space, where (today at least) we only have closed, proprietary, very-much revenue-driven browser implementations. The related web standards have also largely been influenced by FAANGS as the likes of Google wanting to turn "the web as their webstore".
Please do some listing. Most of the time people worrying about PWA limitations are decade or so out of date of what the contemporary web apis are capable of.
For example, PWAs are trivial to cache with service workers (a lot easier than app install for both the developer and the user), and after the first load they work totally offline.
You have almost total control of the PWA. Just launch the browser DevTools and you can even edit the code on the fly. With e.g. iOS apps you have no control or even knowledge of what the apps are doing, unless you manage decrypt them, manage to decompile or inspect the binary and/or root your own device through exploits.
All major browser engines are open source and the APIs are open standards (even Google deprecated their proprietary APIs). PWAs are way less tied to likes of Google than native apps.
I think you should update your information about browser APIs. You'll be pleasantly surprised.
Another interesting option is PouchDB[0], which is a Javascript implementation of the CouchDB[1] synchronization API. It allows you to acheive eventual consistency between a client with intermittent connectivity, and a backend database.
If you query a Parquet file from your lake via DuckDB-in-browser, does DuckDB run in WASM on the web client and pull the compressed parquet to your browser where it is decompressed?
Or are you connecting some DuckDB on the web client to some DuckDB component on a server somewhere?
I presume yes to the first and no to the second but just checking I have my mental model correct.
DuckDB has certain optimizations which allows it to read only parts of a parquet file. It can also read remote file in streaming fashion so it does not have to wait for the entire file to be downloaded or to store a large amount of data in memory.
Relevant documentation: https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB can use the HTTP range header trick agains parquet, which means a lot of analytical questions against multi-GB files can be answered by fetching only small portions of the overall data.
Howdy! I work at MotherDuck and DuckDB Labs (part time as a blogger). At MotherDuck, we have both client side and server side compute! So the initial reduction from PB/TB to GB/MB can happen server side, and the results can be sliced and diced at top speed in your browser!
Please spend a sentence or two explaining the server side filtering mechanism and linking to documentation! I would like to know the conditions required for streaming queries! From the sibling comment and a search of the docs it seems like this is a Parquet only feature, which seems pretty important to note!
Parquet is designed with predicate push-down in mind. Partitions are laid out on disk, and then blocks within files are laid out so that consumers can very, very easily narrow in on which files they need to read, before doing anymore IO than a list, or a small metadata read.
Once you know what you are reading, many parquet/arrow libraries will support streaming reads/aggregations, so the client doesn’t need to load the whole working set in memory.
As far as I can tell they're not storing anything locally, they're pulling the data from Google Cloud Storage as you access it.
That said, a Wasm module doesn't have access to any storage facilities that the browser doesn't already expose (IndexedDB, OPFS, cookies, etc.) so even if it could be used by marketers, they would gain nothing by using DuckDB for that over just using the same underlying browser storage API.
So when I want to browse your website on my phone with a limited data plan, I have to download an entire database client and database, as well as any of your other huge JS libraries?
I'd prefer that to page reloads, lost focus, scroll position and any input between clicking on a checkbox and receiving a response. Bonus points if I only have to download a db-diff next time, compared to downloading 5x db worth through POST /get-products?<filters>&page=<n> in every e-shopping session. Rarely a client-facing business database exceeds your average youtube cat video in size (assuming pics stored as urls).
Ten years ago your concern might be valid but in 2024, given that most people in the developed economies have ample amount of 4g/5g data plans with generous RAM/storage on their mobile phones, it makes more sense to just send them the whole DB once.
This is the same spirit as Electron vs native. Of course native uses less ram and is efficient but its not noticeable by the end user (because they dont care how its built). So we let developers write desktop apps in HTML/JS and end users do not notice usually until they open up task manager
We have abundance of data, memory, storage that continually drops in cost. It makes a lot of sense to simply download the DB once and then access it locally with optional syncing.
This is the future because we can't count on mobile phones to continuously maintain connectivity especially with limited data plans. Far better to simply allocate % of your fixed data plan and decide which local first app to install on your phone.
I live in a 4g/5g country (with 5g disabled on my phone because it seems to perform worse than 4g), with a 2gb/month data plan. I've usually stayed inside my data budget, but I'm cutting it a little close this month and I've blown past it a couple of times when I've forgotten to switch back to wifi. Sure, I live in a country where I can buy infinite data, but I don't see the point when I rarely use more than 2gb, and a 2gb plan is cheaper
I think their tool is designed for working professionals on laptops - seems like analysts doing traditional business insights reporting. Perhaps there's a world where that target audience is using their phones, but I doubt it's a use case they're too concerned with.
I would be worried about this approach for something that doesn't need to expose a SQL interface though! It's not the right fit for most webapps.
If it's an app, you're not in the browser anymore anyway, so you don have to worry about this browser wasm. You can store data locally anyway and don't have to pull it from the sever
If the data "belongs" to the server, why not send the query to the server and run it there?
If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?
A lot of the benefits of such databases are their ability to optimise queries for improving performance in a context where the data can't fit in memory (and possibly not even on single disks/machines), as well as additional durability and atomicity improvements. If the data is small enough to be reasonable to send to a client, then it's small enough to fit in memory, which means it'll be fast to query no matter how you go about it.
The page says one advantage is "Ad-hoc queries on data lakes", but isn't that possible with the most basic form that simply sends a query to the database?
What am I failing to understand about this category of products?