Modern Open-Source Databases
Modern open-source databases have transformed from simple storage engines into sophisticated platforms that handle distributed workloads, real-time analytics, vector search, and multi-model data all within a single system. The pace of innovation in this space over the last three years has been remarkable – both in extending the capabilities of established databases and in the arrival of purpose-built systems for new data workloads.
Choosing the Right Open-Source Database
PostgreSQL with pgvector + TimescaleDB + PostGIS covers 80% of workloads without adding a new database to your stack.
PostgreSQL remains the most popular open-source relational database and has become significantly more powerful through extensions. pgvector adds vector similarity search for AI applications. TimescaleDB adds time-series hypertables and continuous aggregates for IoT and observability data. Citus adds horizontal sharding for distributed PostgreSQL. PostGIS adds geospatial queries. The result is that many organisations now run a single PostgreSQL deployment that handles workloads that previously required separate specialised databases. For distributed relational workloads, CockroachDB and YugabyteDB offer PostgreSQL-compatible APIs with automatic geographic distribution and strong consistency. MySQL continues to power a huge proportion of the web and has seen strong improvements in the InnoDB engine and the MySQL HeatWave cloud service that adds in-memory analytics. On the NoSQL side, MongoDB 7.x has improved its transaction support and added time-series collections, while Cassandra remains the go-to for write-heavy, geographically distributed key-value and wide-column workloads.
New-generation databases have emerged for specific modern workloads. ClickHouse is now widely deployed for real-time analytics on event and log data – it ingests millions of rows per second and returns aggregation queries on billions of rows in under a second through columnar storage and vectorised execution. DuckDB has become the go-to in-process analytical database for data scientists: it runs embedded in Python or inside a file, queries Parquet and CSV files directly, and handles analytical workloads on a laptop that previously required a cluster. Neon adds branching and serverless scaling to PostgreSQL, making it popular for development environments where you want to clone a production database instantly for testing. The common thread across all these developments is that open-source databases are now capable enough to handle workloads that previously required expensive commercial products.
Frequently Asked Questions
Why has PostgreSQL become so dominant among open-source databases?
PostgreSQL combines strong ACID compliance, a rich SQL dialect, excellent performance, and the most active extension ecosystem of any open-source database. Extensions like pgvector, TimescaleDB, and PostGIS let you add specialised capabilities without switching databases. Its licence (PostgreSQL licence, similar to MIT) has no commercial restrictions, which makes it the safe default for any organisation worried about open-source licence changes.
What is ClickHouse and when should you use it?
ClickHouse is an open-source columnar analytical database optimised for very high-throughput inserts and extremely fast aggregation queries on billions of rows. You should use it when you need real-time analytics on event streams, log data, user behaviour, or telemetry at a scale where row-oriented databases like PostgreSQL become too slow. It is not a good fit for transactional workloads with frequent updates or complex joins across normalised schemas.
What makes DuckDB different from other analytical databases?
DuckDB runs in-process – embedded inside your Python, R, Java, or Node.js application with no server to manage. It reads Parquet, CSV, and JSON files directly from disk or object storage without loading them first. It is fast enough for serious analytical workloads on a single machine, which makes it perfect for data science exploration, local pipeline testing, and analytical queries in applications that do not need a full data warehouse.
Should you use a dedicated time-series database or extend PostgreSQL with TimescaleDB?
For most teams, TimescaleDB is the right first choice because it adds time-series capabilities (hypertables, continuous aggregates, compression, data retention policies) on top of PostgreSQL, which you may already operate. You retain the full SQL ecosystem, existing tooling, and operational knowledge. Dedicated time-series databases like InfluxDB or QuestDB offer better write throughput at extreme scale but add operational complexity. Start with TimescaleDB and migrate only if you hit its limits.
