For a few days ago, I started to use PostgreSQL. Current development is not using database as it becomes I/O bottleneck, so I merely write down everything from RAM to disk at once(and additionally it is compressed in LZ4 to shrink write time). The problem was, that RAM usage skyrocketed as target data becomes TBs. Just guess what you feel when you encounter out of memory after 10 minute operation. What's worse, since I was working on portable server, the total amount of RAM was limited to certain degrees, not like rack-type servers.
So, I changed the mind model - use DB like as something like swap file! Currently the process is processing all the data on robin hood hash, but if we can control them under a database table and remove that table after use, we can save RAM dramatically(well, at least out of memory won't be shown if we limit working memory for DB).
At first, I tried SQLite. It consumes so low on RAM so that even Android and iOS can use, and before that I developed a class optimized for bulk insert with the engine. However, soon I found that it consumes practically infinite time when the query is even slightly complicated. When I tried an inner join query with two tables with 7 million records each, I gave up after 30 minutes. And it couldn't process in parallel since the engine is working only on single thread.
Later, I found out that SQLite uses only nested loop on table joins. If we're working on previous case(two tables with 7 million records each), we need to loop for 49 trillion times. And it should work on single thread. Oh God.
As I was cornered and forced to use "ordinary" database, my choice was MySQL, MariaDB, and PostgreSQL. MySQL was dropped at early stage as no tweak and tuning seems to have worked as I intended, so my choice became more limited. And I chose PostgreSQL for its a few exceptional features:
- It's not snail-slow as people say. Instead, if you tweak just few, it "flashes." No, I can't agree with the words "it's slower than MySQL or MariaDB." When you're working hard to tweak MySQL/MariaDB to maximize its performance with your complicated tweak but if you don't do same amount of efforts and try PostgreSQL just out-of-box, well, it's not fair. Isn't it?
- Parallel query processing! If you send a query the scheduler "smartly" judges and multi processing is done as needed. For my inner join of two tables with 7 million records, it consumed 3 cores and gave result in 1 minute 30 seconds.
- It's simply reliable in any bad condition. After stress test, PosgreSQL survived and gave result when MariaDB service was restarted
Currently I'm using it as if it's some kind of hashmap. Well, maybe that's because I previously worked with them. Thinking that the basis of database indexing is hashmap, it may be one of the best practice for any SQL based database. Additionally, my coding became simplified since database does most of the things automatically......
Now I agree with the slogan, The World's Most Advanced Open Source Relational Database. Keep up the good work, PostgreSQL Global Development Group!