What Happens When You Put a Database in the Browser?

xnorswap · on June 26, 2024

I don't understand these "DB in browser" products.

If the data "belongs" to the server, why not send the query to the server and run it there?

If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?

A lot of the benefits of such databases are their ability to optimise queries for improving performance in a context where the data can't fit in memory (and possibly not even on single disks/machines), as well as additional durability and atomicity improvements. If the data is small enough to be reasonable to send to a client, then it's small enough to fit in memory, which means it'll be fast to query no matter how you go about it.

The page says one advantage is "Ad-hoc queries on data lakes", but isn't that possible with the most basic form that simply sends a query to the database?

What am I failing to understand about this category of products?

sktrdie · on June 26, 2024

I think a large benefit to "DB in browser" is about the whole "local-first" software movement. A good overview is written here https://sqlsync.dev/posts/stop-building-databases/

The essence of the approach is that a large majority of FE apps have constructed quite complex caching layers to improve performance over "querying backend data" - take a look at things like Next.js or React Query <- as the post above mentions, they're essentially rebuilding databases. So instead this approach just moves the db to the browser, along with a powerful syncing layer.

I think it's an approach that deserves more attention, especially to improve DX where we end up writing a whole lot of database-related logic on clients. Mind as well then just use a database on the client as well

jerf · on June 26, 2024

To the other good points, I'd add that when people start making queries that aren't O(1) or O(n), modern computers and networks are capable of reasonably conveniently moving around amounts of data where it becomes practical for remote client to use its rather substantial power to answer questions on a gigabyte or two of data with its own computing power, and if enough people are doing this at once the combined client power they can have can easily overpower any reasonable amount of cloud compute you might be willing to deploy for this problem as they run non-trivial queries. I may be happy for dozens of clients to download some data and chew through some O(n log n) with mixed O(n^2) components on their own whereas I would not care to provision enough cloud compute to handle it all.

I think people forget that as cheap as cloud compute is, client compute is even cheaper. Generally it's already paid for. (Marginal electricity costs are lost in the noise of everything else you're supporting a client user with.) And while the years of doubling every 1.5 years may be gone, clients do continue to speed up, and they are especially speeding up in ways that databases can take advantage of (more cores, more RAM, more CPU cache). Moving compute to clients can be a valuable thing in some circumstances on its own.

SoftTalker · on June 26, 2024

This exact argument has been made since people had PCs on their desks (the 1990s, at least). We're still trying to figure out how to do it well.

jerf · on June 26, 2024

Yes, but as the numbers advance from "the remote desktop has like 128KB" and "the remote desktop can chew through gigabytes without much stress", the delta between O(n) and O(n^2) opens up a lot more. It is perhaps a bit counterintuitive, but as systems grow in capability that delta grows.

SoftTalker · on June 26, 2024

Server capacity has grown along with it if not even faster.

A server in 1990 had total storage and RAM roughly comparable to a low-end mobile phone today.

jerf · on June 27, 2024

Server capacities have not grown at O(n^2) on data sets.

Computational complexity classes are not intuitive things. You can't just go "oh, that's much bigger than that so it must that many times better".

IMTDb · on June 26, 2024

> If the data "belongs" to the server, why not send the query to the server and run it there?

We could, but if the data size is not that huge, sending it once to the client and then letting the client perform the queries can be desirable. The tool works without internet access, the latent is much better, all results are coherent etc

> If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?

Just because it fits in memory doesn't mean that the shape of the data does not matter. A data structure optimised for whatever query/analysis you want to perform has a significant impact on how fast and efficiently you can perform those operations

jt2190 · on June 26, 2024

This argument boils down to:

- database engines should always run on a server, never on the user’s own machine.

- local data is never larger than the database engine, and it’s better in all cases to move all of the local data to the server where the database engine runs.

- local data can always be moved to another machine, regardless of sensitivity.

Buy are these always true, all the time? It seems to me that there are many use cases (e.g. training local AI models) where these rules should not be so absolute.

jimberlage · on June 26, 2024

There is an increasing subset of people (think those that used to work in MS Access, Excel power users) who learned SQL in a business IT course or on the job. They don’t have data sized to fit in a DB, but they do want to do analyses that use window functions and things that SQL makes more natural than Excel vtables or functions. They may have to give reports to an equally technical boss who would like to play with the report and explore assumptions using SQL.

The data size is typically not large; SQL and integrating with cloud-based spreadsheets is the selling point.

xnorswap · on June 26, 2024

Okay, so it's more about SQL than the database aspect?

I can see there'd be demand for that, but I'm not convinced the porting a full-scale DB to WASM is the best way to achieve that goal.

jerf · on June 26, 2024

Once you accept the idea that you're going to run SQL on the browser, the forces are going to inevitably pull you to having an entire DB engine in the browser anyhow. You're going to need the entire query parser, enough of the IO code to be able to read the DB (which is a lot of it), you're going to want the optimizer, you're going to need RPC to send the SQL and read the results... yes, there is certainly some stuff you can trim, but it's not like it's a horrible idea.

So even if someone, say, started a startup with the idea "I'm going to present an MVP of an in-browser DB engine people can use", in 5-10 years it'd be a nearly fully-fledged DB anyhow.

paulgb · on June 26, 2024

Moreover, DuckDB isn't exactly a “full-scale” DB in the sense that someone would think about Postgres/MySQL. It is an embedded database more like sqlite that happens to be good for analytical workloads. It doesn't have to worry about things like HA, user auth, network wire formats, etc. that a usual “full-scale” DB would.

jimberlage · on June 26, 2024

There are also some data lakes (though I have not really worked with any) that are homegrown and consist of smallish parquet files in a blob store; they might have several databases looking at them at any given time, and an embedded browser DB provides a frontend that is easier to stand up and auth against in some ways.

There may not be a reporting DB available to send queries to, in the traditional sense.

ncruces · on June 26, 2024

I don't know about DuckDB, but SQLite's Wasm build is well under 1 MiB minified/gzipped.

There's a lot that could be trimmed, but at that point, why?

giancarlostoro · on June 26, 2024

I built a web application that became a progressive web app because it needed to work offline because it was used by a pilot during their work, and being at high altitudes, internet is not guaranteed. We cached a bit on localStorage, and I'd loop trying to make HEAD requests (there was no "am I online" API available...) and once we knew you were online we would push the locally cached changes up to the DB. In the meantime we still displayed the current data as you've changed it as well. It worked out nicely.

There's a lot of static data that never changes, that if you can request once, and cache it somewhere, it becomes really useful. It's less requests coming to the server. Heck, Tumblr in the 2010s would display a lot of blog data on the dashboard such as follower count, likes count, etc they had outages all the time, when I started to see less and less outages was when they stopped showing all that blog metadata on the dashboard because you didnt have hundreds of thousands (millions?) of users refreshing to see new content, querying across dozens of databases to see likes, follows, etc. It was very common to frequently refresh tumblr to see the latest content.

Imagine had they cached this data instead and only updated it based on a lastUpdated fields value being different.

cduzz · on June 26, 2024

I think the "db in browser" idea is interesting because it makes the user pay for the compute for many more of the typical DB tasks.

Any time you're in a situation where you're in an aggregation point you need to be wary of Moore's law; this is load balancers, database systems, log aggregation, etc.

If your computers are getting N cheaper / faster per year, but you're getting M new clients and all your clients are also getting N cheaper / faster per year, you're going to be getting more traffic faster than you can scale, unless you figure out how to cheat. Cheating may be sharding your internal workloads, but it may also be "make someone else do the work."

tlarkworthy · on June 26, 2024

That analytics in the browser is about 10000x times more performant, and doesn't contest on a shared resource.

lmeyerov · on June 26, 2024

Right - so not a database, but a columnar analytics compute engine. Half of why we wrote arrow js was for pumping server arrow data to webgl, and the other half for powering arrow columnar compute libraries & packages like this.

For sub-100ms smooth interactivity, for data in a certain size range sweet spot, can be very nice!

zX41ZdbW · on June 26, 2024

I tried https://shell.duckdb.org/, but it was a very rough experience.

The "delete" button does not work. The "home" button inserts a whitespace. Pasting with "Ctrl+v" also does not work. Every keypress results in blinking, and there is a notable input lag.

When I tried a query

    duckdb> SELECT * FROM 'https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/*.gz'
       ...> ;
    Catalog Error: Table with name https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/*.gz does not exist!
    Did you mean "sqlite_master"?
    LINE 1: SELECT * FROM 'https://clickhouse-public-datasets.s3....

Suggesting the "sqlite_master" database is also misleading.

niutech · on July 2, 2024

How would DuckDB know which *.gz are in that folder? Is there a directory listing?

jeroenhd · on June 26, 2024

Why run a database in WASM when IndexedDB exists? Browsers already have a database built in, I don't see the need to download another one.

paulgb · on June 26, 2024

IndexedDB is a key/value store; DuckDB is a relational analytic database. You would use IndexedDB for something like application data, and DuckDB for (e.g.) a data-heavy dashboard or ad-hoc query tool.

niutech · on July 2, 2024

You can leverage SQL on top of IndexedDB by using AlaSQL (https://github.com/AlaSQL/alasql) or absurd-sql (https://github.com/jlongster/absurd-sql).

quotemstr · on June 26, 2024

So what? We should be structuring a lot of databases as layers on top of transactional key value stores. We should be able to swap out the key value store without changing the higher level query engine.

jampekka · on June 26, 2024

We maybe[1] should, but we are not. AFAIK there's no relational query engine for IndexedDB.

Also, there are a lot more people who can get things done with SQL than using indexed KV-stores. KV stores tending to have horrible APIs doesn't help the situation.

[1] E.g. SQLite started with a gdbm backend, but later rolled their own because of limitations caused by it.

niutech · on July 2, 2024

There is SQL on top of IndexedDB: absurd-sql or AlaSQL.

paulgb · on June 26, 2024

When you build that, I will upvote the HN post to that too ;)

quotemstr · on June 29, 2024

Don't tempt me. :-) This is what SQLite 4 was supposed to have been, BTW.

Retr0id · on June 26, 2024

Per [0], surprisingly, you can beat IndexedDB performance by layering another DB engine on top. Discussed on HN: https://news.ycombinator.com/item?id=28156831

[0] https://jlongster.com/future-sql-web

robmccoll · on June 26, 2024

If you use https://dexie.org/ on top of IndexDB, you can turn it into a fairly useful document store with secondary indices, an improved query building interface, and reactive queries. It's doing exactly what you would expect on top of IndexDB if you did it yourself, but saves a ton of boilerplate. They have a sync service to go with it, but in my usage I managed that myself.

pjc50 · on June 26, 2024

IndexedDB should have been SQLite, but for various squishy reasons it isn't, it's something less good.

jampekka · on June 26, 2024

Because of the IndexedDB API?

tehbeard · on June 26, 2024

That and Apple breaking it in new and "exciting" ways...

isodev · on June 26, 2024

It feels very impractical indeed. Also the size of the binary to load the compiled wasm. All this would be much better done on the server and if really needed, users may be given a way to download results (ideally with their own preferred tool for fetching files)

jampekka · on June 26, 2024

A server sounds quite inpractical if you could otherwise serve the application statically. Or offline.

Also having user data on server causes problems with privacy etc.

isodev · on June 26, 2024

Well the data has to come from somewhere right? If the goal is to facilitate client-side (bring your own data) scenarios, I'd make a proper native desktop app and take full advantage of the system. A hybrid something running in the browser feels like a compromise between both solutions.

jampekka · on June 26, 2024

Why would you make a "proper" native desktop app (or more specifically apps for every platform you want to support) if you can do it with a PWA (which you can do for vast majority of apps).

isodev · on June 26, 2024

Because PWAs are limited in more ways than practical to list here, just to name a few - the browser they run in, restricted by the availability and quality of internet connection, they're not sustainable as build tools don't even bother with backwards compatibility given the pace of evolution, practically no user control over a pwa "app" running in the browser.

Remember, PWAs exist as a work-around for gatekeepy OS vendors making it hard to create cross-platform apps. PWAs don't resolve anything - PWAs move the problem to the browser space, where (today at least) we only have closed, proprietary, very-much revenue-driven browser implementations. The related web standards have also largely been influenced by FAANGS as the likes of Google wanting to turn "the web as their webstore".

jampekka · on June 26, 2024

Please do some listing. Most of the time people worrying about PWA limitations are decade or so out of date of what the contemporary web apis are capable of.

For example, PWAs are trivial to cache with service workers (a lot easier than app install for both the developer and the user), and after the first load they work totally offline.

You have almost total control of the PWA. Just launch the browser DevTools and you can even edit the code on the fly. With e.g. iOS apps you have no control or even knowledge of what the apps are doing, unless you manage decrypt them, manage to decompile or inspect the binary and/or root your own device through exploits.

All major browser engines are open source and the APIs are open standards (even Google deprecated their proprietary APIs). PWAs are way less tied to likes of Google than native apps.

I think you should update your information about browser APIs. You'll be pleasantly surprised.

jt2190 · on June 26, 2024

Maybe you’re thinking of WebSQL? https://caniuse.com/sql-storage

lmeyerov · on June 26, 2024

Columnar analytics are slow in non-columnar engines

Zambyte · on June 26, 2024

Another interesting option is PouchDB[0], which is a Javascript implementation of the CouchDB[1] synchronization API. It allows you to acheive eventual consistency between a client with intermittent connectivity, and a backend database.

[0] https://pouchdb.com/

[1] https://couchdb.apache.org/

gregw2 · on June 26, 2024

I am not 100% clear how this works...

If you query a Parquet file from your lake via DuckDB-in-browser, does DuckDB run in WASM on the web client and pull the compressed parquet to your browser where it is decompressed? Or are you connecting some DuckDB on the web client to some DuckDB component on a server somewhere?

I presume yes to the first and no to the second but just checking I have my mental model correct.

zX41ZdbW · on June 26, 2024

I have an example of a database hosted on GitHub: https://github.com/ClickHouse/web-tables-demo

Which you can query anywhere using a ClickHouse local database engine.

When you do a query, it downloads only the required ranges of required columns from the server and does computations on your machine.

In the same way you can query any data sources with ClickHouse, either local or remote, which is a strict superset in power and usability than duckdb.

4b11b4 · on June 26, 2024

The data would be sent to the browser

throwaway-blaze · on June 26, 2024

Um, my data lake is measured in Pb, not Tb. How is that going to work exactly?

undefuser · on June 26, 2024

DuckDB has certain optimizations which allows it to read only parts of a parquet file. It can also read remote file in streaming fashion so it does not have to wait for the entire file to be downloaded or to store a large amount of data in memory. Relevant documentation: https://duckdb.org/2021/06/25/querying-parquet.html

simonw · on June 26, 2024

DuckDB can use the HTTP range header trick agains parquet, which means a lot of analytical questions against multi-GB files can be answered by fetching only small portions of the overall data.

Here's a post about applying this same trick to SQLite from few years ago: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

1egg0myegg0 · on June 26, 2024

Howdy! I work at MotherDuck and DuckDB Labs (part time as a blogger). At MotherDuck, we have both client side and server side compute! So the initial reduction from PB/TB to GB/MB can happen server side, and the results can be sliced and diced at top speed in your browser!

victor106 · on June 26, 2024

Does duckdb work with delta files?

code_biologist · on June 26, 2024

Please spend a sentence or two explaining the server side filtering mechanism and linking to documentation! I would like to know the conditions required for streaming queries! From the sibling comment and a search of the docs it seems like this is a Parquet only feature, which seems pretty important to note!

FridgeSeal · on June 26, 2024

Parquet is designed with predicate push-down in mind. Partitions are laid out on disk, and then blocks within files are laid out so that consumers can very, very easily narrow in on which files they need to read, before doing anymore IO than a list, or a small metadata read.

Once you know what you are reading, many parquet/arrow libraries will support streaming reads/aggregations, so the client doesn’t need to load the whole working set in memory.

LunaSea · on June 26, 2024

This only covers very simple min / max / sum cases.

For all others you'll need to download all columns you are filtering or selecting.

justincormack · on June 26, 2024

Not specific to Ducks but S3 select https://docs.aws.amazon.com/AmazonS3/latest/userguide/select... can filter Parquet server side on S3 and is supported by some other object stores.

threeseed · on June 26, 2024

They didn't mention the lifecycle of the database.

Because if it's anything longer lived than a week then it could be used by marketers to evade Apple's ITT for retargeting.

Which would be a huge win for advertisers and a loss for privacy.

paulgb · on June 26, 2024

As far as I can tell they're not storing anything locally, they're pulling the data from Google Cloud Storage as you access it.

That said, a Wasm module doesn't have access to any storage facilities that the browser doesn't already expose (IndexedDB, OPFS, cookies, etc.) so even if it could be used by marketers, they would gain nothing by using DuckDB for that over just using the same underlying browser storage API.

cryptonector · on June 26, 2024

Supercookies?

voidUpdate · on June 26, 2024

So when I want to browse your website on my phone with a limited data plan, I have to download an entire database client and database, as well as any of your other huge JS libraries?

wruza · on June 26, 2024

I'd prefer that to page reloads, lost focus, scroll position and any input between clicking on a checkbox and receiving a response. Bonus points if I only have to download a db-diff next time, compared to downloading 5x db worth through POST /get-products?<filters>&page=<n> in every e-shopping session. Rarely a client-facing business database exceeds your average youtube cat video in size (assuming pics stored as urls).

threeseed · on June 26, 2024

What year was this comment written ?

Most web apps these days are single page applications which don't require page reloads for every UI interaction.

vundercind · on June 26, 2024

True. Only the fast ones do full page reloads now.

wruza · on June 26, 2024

Sadly the current one. Most web apps that have a backend database do all of the aforementioned. See e.g. https://amazon.com

Also,

https://www.ebay.com/b/Digital-Cameras/31388/bn_779 -- try choosing a brand

https://www.bestbuy.com/site/video-games/video-games-accesso... -- only loses scroll, probably a winner

https://www.newegg.com/p/pl?N=50001157%20100007671%206013930... -- uses "Apply" button, out of competition, but still better

https://www.levi.com/US/en_US/sale/mens-sale/c/levi_clothing... -- a winner of the "wtf is going on after I click my size" category

localfirst · on June 26, 2024

Ten years ago your concern might be valid but in 2024, given that most people in the developed economies have ample amount of 4g/5g data plans with generous RAM/storage on their mobile phones, it makes more sense to just send them the whole DB once.

This is the same spirit as Electron vs native. Of course native uses less ram and is efficient but its not noticeable by the end user (because they dont care how its built). So we let developers write desktop apps in HTML/JS and end users do not notice usually until they open up task manager

We have abundance of data, memory, storage that continually drops in cost. It makes a lot of sense to simply download the DB once and then access it locally with optional syncing.

This is the future because we can't count on mobile phones to continuously maintain connectivity especially with limited data plans. Far better to simply allocate % of your fixed data plan and decide which local first app to install on your phone.

voidUpdate · on June 27, 2024

I live in a 4g/5g country (with 5g disabled on my phone because it seems to perform worse than 4g), with a 2gb/month data plan. I've usually stayed inside my data budget, but I'm cutting it a little close this month and I've blown past it a couple of times when I've forgotten to switch back to wifi. Sure, I live in a country where I can buy infinite data, but I don't see the point when I rarely use more than 2gb, and a 2gb plan is cheaper

jimberlage · on June 26, 2024

I think their tool is designed for working professionals on laptops - seems like analysts doing traditional business insights reporting. Perhaps there's a world where that target audience is using their phones, but I doubt it's a use case they're too concerned with.

I would be worried about this approach for something that doesn't need to expose a SQL interface though! It's not the right fit for most webapps.

afavour · on June 26, 2024

If it’s an app you use regularly that trade off could easily be worth it.

voidUpdate · on June 26, 2024

If it's an app, you're not in the browser anymore anyway, so you don have to worry about this browser wasm. You can store data locally anyway and don't have to pull it from the sever

afavour · on June 26, 2024

> If it's an app, you're not in the browser anymore anyway

Webapps would beg to differ

jampekka · on June 26, 2024

Also all the code and data can be easily cached with e.g. Service Workers, so it will be downloaded only once.