Code{nested}: September 2020

For a few days ago, I started to use PostgreSQL. Current development is not using database as it becomes I/O bottleneck, so I merely write down everything from RAM to disk at once(and additionally it is compressed in LZ4 to shrink write time). The problem was, that RAM usage skyrocketed as target data becomes TBs. Just guess what you feel when you encounter out of memory after 10 minute operation. What's worse, since I was working on portable server, the total amount of RAM was limited to certain degrees, not like rack-type servers.

So, I changed the mind model - use DB like as something like swap file! Currently the process is processing all the data on robin hood hash, but if we can control them under a database table and remove that table after use, we can save RAM dramatically(well, at least out of memory won't be shown if we limit working memory for DB).

At first, I tried SQLite. It consumes so low on RAM so that even Android and iOS can use, and before that I developed a class optimized for bulk insert with the engine. However, soon I found that it consumes practically infinite time when the query is even slightly complicated. When I tried an inner join query with two tables with 7 million records each, I gave up after 30 minutes. And it couldn't process in parallel since the engine is working only on single thread.

Later, I found out that SQLite uses only nested loop on table joins. If we're working on previous case(two tables with 7 million records each), we need to loop for 49 trillion times. And it should work on single thread. Oh God.

As I was cornered and forced to use "ordinary" database, my choice was MySQL, MariaDB, and PostgreSQL. MySQL was dropped at early stage as no tweak and tuning seems to have worked as I intended, so my choice became more limited. And I chose PostgreSQL for its a few exceptional features:

It's not snail-slow as people say. Instead, if you tweak just few, it "flashes." No, I can't agree with the words "it's slower than MySQL or MariaDB." When you're working hard to tweak MySQL/MariaDB to maximize its performance with your complicated tweak but if you don't do same amount of efforts and try PostgreSQL just out-of-box, well, it's not fair. Isn't it?
Parallel query processing! If you send a query the scheduler "smartly" judges and multi processing is done as needed. For my inner join of two tables with 7 million records, it consumed 3 cores and gave result in 1 minute 30 seconds.
It's simply reliable in any bad condition. After stress test, PosgreSQL survived and gave result when MariaDB service was restarted

Currently I'm using it as if it's some kind of hashmap. Well, maybe that's because I previously worked with them. Thinking that the basis of database indexing is hashmap, it may be one of the best practice for any SQL based database. Additionally, my coding became simplified since database does most of the things automatically......

Now I agree with the slogan, The World's Most Advanced Open Source Relational Database. Keep up the good work, PostgreSQL Global Development Group!

얼마 전부터 PostgreSQL을 사용하기 시작했습니다. 현재 개발중인 시스템은 I/O 속도 문제때문에 DB를 쓰지 않고 RAM 구조를 통채로 디스크에 복사해뒀다가 필요할때 다시 읽어들이는 방식을 쓰고 있는데(그리고 그 와중에 디스크 쓰기 시간을 줄이기 위해 데이터를 LZ4로 미리 압축해 놓습니다), 처리하는 데이터의 양이 TB급으로 늘어나기 시작하면 RAM 사용량이 기하급수적으로 늘어나더군요. 10분정도 작업 수행하다 말고 out of memory를 만나면 참 난감해지더군요. 게다가 프로그램이 돌아가는 환경이 포터블 서버라 랙타입 서버처럼 RAM을 무식하게(......) 꽂을 수도 없는 일이고......

해서, DB를 swap file처럼 쓰는 방식을 생각해냈습니다. 지금은 모든 데이터를 robin hood hash에 올려서 처리하는데, 그 대신 DB에 필요한 데이터를 넣어두고 결과를 처리한 후 처리가 끝나면 테이블을 삭제하는 형태로 쓰면 RAM을 획기적으로 절약할 수 있겠다는 결론을 내렸습니다(아니 working memory를 제한하면 최소한 out of memory는 안 내놓겠죠 -_-).

처음에는 SQLite를 적용해 보았습니다. Android와 iOS에서 사용할 정도로 RAM 사용량이 적고, 마침 예전에 SQLite를 이용해서 bulk insert를 할 수 있는 별도의 클래스를 만들어 두기도 했거든요. 하지만 SQLite는 쿼리가 조금만 복잡해져도 수행시간이 거의 무한하게 늘어지는(......) 문제점이 있었습니다. 레코드 7백만개짜리 테이블 두 개에 대해서 간단한 inner join을 수행해 보니, 수행을 시작한지 30분이 지나도 결과가 나오지 않더군요. 게다가 single thread에서 동작할 수밖에 없는 특성상 처리를 병렬로 수행할 수도 없었고......

왜 그러나 확인해보니, SQLite는 table join에서 nested loop만을 사용하더군요. 7백만 개짜리 테이블 두 개라면 loop를 49경번(......)을 돌아야 된다는 이야기죠. 그것도 싱글 스레드로. '이건 아니지 않느냐'는 생각에 깔끔하게 포기했습니다.

결국 본격적인(?) DB를 적용할 수밖에 없게 된 상황에서, 선택지는 MySQL, MariaDB, PostgreSQL의 세 가지 정도로 좁혀지더군요. 개중에 MySQL은 아무리 튜닝을 해도 제가 원하는데로 움직여주지 않아서 일찌감치 탈락(-_-)시키고, MariaDB와 PostgreSQL을 비교해본 뒤 PostgreSQL을 선택했습니다. 이유가 몇가지 있는데......

굼뜨다는 세간의 평가와 달리, 설정을 조금만 건드려주면 말 그대로 날아다닙니다. 솔직히, MySQL이나 MariaDB보다 느리다는 말에는 동의하지 못하겠습니다. MySQL/MariaDB에서는 InnoDB 속도 향상을 위해서 엄청나게들 tweak을 하시면서 PostgreSQL에서는 그런 작업을 하지 않으면서 느리다는 평가를 내리는건 어불성설일 겁니다
단일 쿼리를 병렬로 수행합니다. 쿼리를 보내면 스케줄러가 알아서 판단해서 필요에 따라 멀티프로세싱으로 작업을 수행합니다. 제가 수행했던 저 7백만개 레코드 테이블 두 개의 inner join은 core 3개를 점유한 뒤 1분 30초만에 답을 내놓더군요.
그 어떤 악조건에서도 의지가 됩니다. 제가 부하 테스트를 여러번 해본 결과, MariaDB에서 프로세스가 터져버리는 상황에서도 PostgreSQL은 끝까지 기다려주면 무조건 결과를 내놓았습니다

원래 hashmap에서 작업을 수행해서 그런지는 모르겠지만, 지금도 DB를 hashmap 비슷하게 사용하고 있습니다. 뭐, DB indexing의 기본이 hashmap인걸 생각해보면 적절한 용례일지도 모르겠네요. 그 외에, DB가 상당부분을 알아서 처리해주는 터라 프로그램의 주 business logic이 간편해진 것은 덤이고......

이래저래, The World's Most Advanced Open Source Relational Database라는 슬로건에 부합하는 코끼리 파워를 느껴보았습니다. 앞으로도 계속 승승장구하길 기원합니다.

Code{nested}

Short thoughts about PostgreSQL(+a few lines for SQLite)

PostgreSQL에 대한 주절주절(+SQLite 이야기 약간)

블로그를 이전합니다

Popular in Code{nested}

Labels