98% Cloud Cost Saved By Writing Our Own Database

325,946
0
Published 2024-04-20
Recorded live on twitch, GET IN

Article
hivekit.io/blog/how-weve-saved-98-percent-in-cloud…
By: twitter.com/hivekit_io

My Stream
twitch.tv/ThePrimeagen

Best Way To Support Me
Become a backend engineer. Its my favorite site
boot.dev/?promo=PRIMEYT

This is also the best way to support me is to support yourself becoming a better backend engineer.

MY MAIN YT CHANNEL: Has well edited engineering videos
youtube.com/ThePrimeagen

Discord
discord.gg/ThePrimeagen


Have something for me to read or react to?: www.reddit.com/r/ThePrimeagenReact/

Kinesis Advantage 360: bit.ly/Prime-Kinesis

Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
turso.tech/deeznuts

All Comments (21)
  • @Fik0n
    The best thing about saving 98% cloud cost is that developer hours are free and that this will be super easy to maintain when the original devs quit.
  • @TomNook.
    I saved 99% of my cloud costs by connecting my frontend to an excel spreadsheet. Such a great idea!
  • @ivanjermakov
    TLDR: they wrote their own log file. No ACID = not a DB.
  • @bfors8498
    I call this impressive-sounding-blogpost-driven-development
  • @michaelcohen7676
    Chat misunderstanding RTK. RTK is literally just correcting GPS data using a known point in realtime. It is not better than GPS, it just enhances the way the measurements are interpreted
  • @christ.4977
    Isn't streaming data like this what kafka was made for?
  • @dv_xl
    You mentioned at the beginning of the video that making your own language makes sense if its designed to actually solve a problem in a better way. This is that. They did not attempt to write a general purpose db. They wrote a really fast log file that is queryable in real time for their domain. This wins them points in costs (margins matter) but more importantly, gives them a marked advantage against competitors. Note that theyre storing and querying way more efficiently. Quality of product is improving while cost of competition is increasing. Seems like a no brainer on the business side.
  • @PeterSteele111
    I do GIS at work and have several hand held units that connect over bluetooth to iOS and Android on my desk right now that can get sub meter accuracy. I have even played with centimeter accuracy. I have trimble and juniper geode units on hand. I built the mobile apps we use for marking assets in the field and syncing back to our servers, and am currently working on an offline mode for that. So yeah, GPS has come a long way since last you looked. Internal hardware is like 10-20 meters on a phone, but dedicated hardware that can pass over as a mock location on Android or whatever can get much much more accurate results.
  • @jsax01001010
    5:35 Preping for scale can be worthwhile if they manage to get a contract with a very large company. A company I work for recently contracted with a company that provides a similar service. The small scale test with 2,000 GPS trackers was straining their infrastructure. The full rollout of 200,000 trackers broke their service for a week or two while the had to rush to scale up their service by about 20x.
  • I signed in and made a youtube account just now to say THANK YOU! 15:00 I DIDN'T THINK ABOUT VERSIONING MY DATA! Sometimes, the things you don't know when self taught are just jaw dropping. This has been very humbling.
  • @andrasschmidthu
    Great solution! If they want to further optimize they should use fixed point instead of floating point and do variable length difference encoding. Most numbers would fit 8 or 16 bits. Using that the memory requirement could easily be half or even less. The size of the entry should be stored in uint16 or uint8 even. If size>65536 is possible then use variable length encoding for the size too. The whole data stream should be stored like that: a stream. 30.000 34 byte entries a second is 1MB/s which is a joke. Write all logs into a stream and parallel collect them for each data source in RAM until a disc block worth of data is collected. Only flush the whole blocks to the disc. This would optimize storage access and you could reach bandwidth limit of the hardware. In case of power failure the logs have to be re-processed like a transaction log is reprocessed by a database. Once we have optimized such a logger that we used no FS raw access to a spinning HDD and we could sustain very good write bandwidth using cheap hardware.
  • @mikeshardmind
    The thing about the version in the header is spot on, but unlikely to help them here since they want to be able to directly access specific bytes for fast indexing, so all the bytes for that can't ever change meaning. Assuming they haven't already used all of the available flags, the last flag value could be used to indicate another flag-prefixed data section.
  • @Michaeltje01
    8:58 KeyboardG: "high write and buffered is Kafka" Yeah I'm with this comment. I still don't understand why they couldn't use Kafka instead of some custom DB.
  • @paulmdevenney
    I thought the first rule was "don't invent your own security", but I think a close second might be don't invent your own database. If your entire business workflow isn't focused around making that thing better, then you're in for a bad time.
  • @michaellatta
    In their case I would use Kafka to collect the data, and materialize to a database for queries.
  • @bkucenski
    There's a work around for the version in header thing. You can run V2 on a different port. But that's less safe than getting your minimum header right out of the gate. Error checking is also a good idea so that if something gets munged up in transit (or you send the wrong version to the wrong destination), a simple math function will mostly guarantee the bits won't math and the message can be rejected. You can also then check what version the bits do math for and give a nice error message.
  • @complexity5545
    The title is a play on [ not knowing ] the difference between "database" and "database-engine." Databases are just files that store content. A Database-engine is a CPU process that manages connections (and users) that read||write specific blocks of a data file. It was still an interesting article. Good Video.
  • @EraYaN
    You really don’t need the version per record, per chunk is more than good enough. You are going to do time based migrations anyway so it’s all good (as in start a new chunk at time stamp x with version n+1).
  • @avwie132
    Saved cloud cost, now they have maintenance and ultra-specific-high-payed-developer cost and a self-induced vendor lock-in. Well done. Tens of thousands of vehicles and people isn't special and isn't big at all. Somehow everybody thinks their problem is a unique one. But it isn't. Looking at their proposition it looks like something FlightTracker has been doing for ages..... Writing a blog post about something you just built is always easy because everything appears to work like it should. Now fast forward to 5 years in the future, and see how you handled all the incoming business requirement changes in your bespoke binary format.
  • @darkwoodmovies
    At first I thought saving $10k per month was worth it, but then I realized that a single entry-level software engineer costs more