Code{nested}: PostgreSQL

Showing posts with label PostgreSQL. Show all posts

Thursday, March 7, 2024

PostgreSQL vs. SQLite: read & write in multithreaded environment

The start was humble. I needed to cache some data, and I thought just push them to database table and give index, and the rest will be database's job. There were only 2 TEXT fields, and I needed to refer to only one field to search for specific row - which is some kind of key-value store -, so I thought whatever database engine should be fine.

And yes. It was a BIG mistake.

First I tried SQLite, and I found out that, in multithreaded environment some records are evaporated when trying to write to the table simultaneously, even with -DSQLITE_THREADSAFE=2 compile time option. I pushed the same data in same condition, and sometimes I have only 20 records, other times 40, and yet 26 for some others....... What drove me crazier was that the SQLite itself worked fine without any I/O problems. A good moment to shout "WHAT THE HELL?!" in real time.

So I changed the engine to PostgreSQL. Our trustworthy elephat friend saved all the records without any loss. I was satisfied with that, but...... Though I applied b-tree index to necessary field of the table, it took 100 milliseconds for just running SELECT field2 WHERE field1='something'. No, the table was small enough. There were only 680 records and data lengh was at most 30 characters for field 1 and only 4 characters for field 2. I configured the engine with some optimization, so it worked fine for bigger tables so I felt assured for its performance, but I didn't expect something like this, even in my dreams.

Elephant is tough, but as a side effect it's too slow.......

So, one last chance: I ran pg_dump to move data from PostgreSQL to SQLite, and with same condition(same index, same table structure, ......), I turned on at .timer SQLite shell and it took less than 0.001 second. Yohoo!

After some more experiments, SQLite can't fully resist from data loss by itself even with multithread support option enabled, and you need more external support like std::mutex. I guess that it's fread() call doesn't support full serialization in multithread environment, but I have neither time nor abilities to do the proper inspection. :P

Anyway, now I use the combination of SQLite + WAL mode + more SQLite internal cache + std::mutex. Still the write performance looks good, but if needed, I think I could use more files with load balancing via non-cryptographic hash.

PostgreSQL vs. SQLite: 멀티스레드 환경에서의 읽기-쓰기

그러니까....... 시작은 소소했습니다. 뭔가 데이터를 캐싱할 일이 있었는데, DB에 쌓아두고 index 걸면 나머지는 DB가 알아서 하지 않겠느냐 하는 거였습니다. 데이터라고 해봐야 별거 없이 그냥 TEXT 필드 두 개가 전부인데다가 실제로 데이터를 찾을 때는 둘 중 하나만 가지고 찾으면 되는 매우 간단한 key-value store 형태의 구조라, 어떤 DB를 써도 상관없겠지 하고 안일하게(.......) 생각했습니다.

옙. 그건 큰 착각이었습니다.

처음에는 SQLite를 사용해봤는데, 멀티스레드 환경에서 동시에 쓰기를 수행하니 컴파일시 -DSQLITE_THREADSAFE=2 옵션을 추가해도 일부 레코드가 유실되더군요. 동일한 데이터를 동일한 조건에서 동일하게 넣는데 어떨 때는 레코드가 20개만 있고, 어떨 때는 40개, 어떨 때는 또 26개...... 게다가 심지어 동작에는 이상이 없습니다(......). 정말이지 What the hell을 라이브로 외치기 딱 좋은 순간이죠.

그래서 DB를 PostgreSQL로 바꿨습니다. 우리의 우직하고 단단한 코끼리 친구는 레코드 유실 없이 모든 데이터를 다 받아서 잘 보관해줍니다. 그리고 만족하고 있던 그 찰나....... 인덱스까지 b-tree로 잘 걸어줬음에도 불구하고 SELECT field2 WHERE field1='something' 하나 수행하는데 무려 100밀리초가 걸립니다. 그렇다고 해서 테이블의 크기가 컸던 것도 아닌게, 레코드라고 해봐야 겨우 680여개 뿐이었고, 레코드 길이도 하나는 길어봐야 30글자, 나머지는 4글자 고정이었거든요. 나름 환경설정 최적화도 해 주었고, 그래서 대형 테이블에서는 비교적 빠르게 돌아갔던 터라 안심하고 있었는데, 이런 사소한 부분에서 성능 문제를 맞닥뜨릴줄은 꿈에도 몰랐습니다.

코끼리는 딴딴하지만 그 대신 미친듯이 느린걸로......

해서...... 혹시나 해서 PostgreSQL에서 pg_dump로 데이터를 덤프해서 SQLite로 옮긴 뒤에 동일하게 SELECT를 수행해 봤습니다. SQLite shell에서 .timer 걸고 돌려보니 0.001초가 채 안 됩니다(......).

추가로 더 시험을 해 본 결과, SQLite에서는 멀티스레드 지원 옵션이 추가된 상황에서 데이터 유실을 완벽하게 방어하지는 못하고, std::mutex 같은 별도 외부 지원을 추가해야 되더군요. 아마 fread() call이 멀티스레드 상황에서의 serialization을 제대로 지원하지 않지 않는 것 아닐까 하고 추측하고 있습니다만, 거기까지 추적하기엔 시간도 없거니와 능력도 안 되어서...... :P

하여간, 지금은 SQLite + WAL mode + 내부 캐시 증량 + std::mutex 조합을 사용하고 있습니다. 쓰기 속도는 아직 충분하다고 여겨집니다만, 만일 더 필요하다면 파일을 여러개로 늘린 뒤에 non-cryptographic hash로 load balancing을 하면 될 것 같습니다.

Wednesday, November 18, 2020

Strangely beautiful SQL UPDATE (PostgreSQL 12)

When you need to fill data to null fields of multiple records, you had better avoid UPDATE. Rather:

Prepare for a table to insert all the data to fill
Use SELECT ~ OUTER JOIN to join with the original table
And prepare for the "result" table to push the JOINed results via INSERT INTO.

I found out that I could finish the job in 5:35 when traditional UPDATE didn't finish in 10 minutes.

Tested under PostgreSQL 12.

이상하고 아름다운 SQL UPDATE (PostgreSQL 12)

대량의 레코드를 대상으로 빈 필드에 데이터를 추가로 넣는 작업을 할 경우, UPDATE로 각 레코드를 찾아서 갱신하는 것보다

수정할 데이터를 삽입할 테이블을 하나 더 생성해서 추가 데이터를 여기에 넣은 뒤에
SELECT ~ OUTER JOIN으로 원래 테이블과 합치고
결과를 INSERT INTO로 별도 테이블로 밀어넣는게 훨씬 빠릅니다.

UPDATE로 10분 넘게 걸려도 안 끝나던게 이 방법을 쓰니까 5분 35초만에 끝나네요. -o-;

PostgreSQL 12로 테스트해봤습니다

Saturday, September 19, 2020

Short thoughts about PostgreSQL(+a few lines for SQLite)

For a few days ago, I started to use PostgreSQL. Current development is not using database as it becomes I/O bottleneck, so I merely write down everything from RAM to disk at once(and additionally it is compressed in LZ4 to shrink write time). The problem was, that RAM usage skyrocketed as target data becomes TBs. Just guess what you feel when you encounter out of memory after 10 minute operation. What's worse, since I was working on portable server, the total amount of RAM was limited to certain degrees, not like rack-type servers.

So, I changed the mind model - use DB like as something like swap file! Currently the process is processing all the data on robin hood hash, but if we can control them under a database table and remove that table after use, we can save RAM dramatically(well, at least out of memory won't be shown if we limit working memory for DB).

At first, I tried SQLite. It consumes so low on RAM so that even Android and iOS can use, and before that I developed a class optimized for bulk insert with the engine. However, soon I found that it consumes practically infinite time when the query is even slightly complicated. When I tried an inner join query with two tables with 7 million records each, I gave up after 30 minutes. And it couldn't process in parallel since the engine is working only on single thread.

Later, I found out that SQLite uses only nested loop on table joins. If we're working on previous case(two tables with 7 million records each), we need to loop for 49 trillion times. And it should work on single thread. Oh God.

As I was cornered and forced to use "ordinary" database, my choice was MySQL, MariaDB, and PostgreSQL. MySQL was dropped at early stage as no tweak and tuning seems to have worked as I intended, so my choice became more limited. And I chose PostgreSQL for its a few exceptional features:

It's not snail-slow as people say. Instead, if you tweak just few, it "flashes." No, I can't agree with the words "it's slower than MySQL or MariaDB." When you're working hard to tweak MySQL/MariaDB to maximize its performance with your complicated tweak but if you don't do same amount of efforts and try PostgreSQL just out-of-box, well, it's not fair. Isn't it?
Parallel query processing! If you send a query the scheduler "smartly" judges and multi processing is done as needed. For my inner join of two tables with 7 million records, it consumed 3 cores and gave result in 1 minute 30 seconds.
It's simply reliable in any bad condition. After stress test, PosgreSQL survived and gave result when MariaDB service was restarted

Currently I'm using it as if it's some kind of hashmap. Well, maybe that's because I previously worked with them. Thinking that the basis of database indexing is hashmap, it may be one of the best practice for any SQL based database. Additionally, my coding became simplified since database does most of the things automatically......

Now I agree with the slogan, The World's Most Advanced Open Source Relational Database. Keep up the good work, PostgreSQL Global Development Group!

PostgreSQL에 대한 주절주절(+SQLite 이야기 약간)

얼마 전부터 PostgreSQL을 사용하기 시작했습니다. 현재 개발중인 시스템은 I/O 속도 문제때문에 DB를 쓰지 않고 RAM 구조를 통채로 디스크에 복사해뒀다가 필요할때 다시 읽어들이는 방식을 쓰고 있는데(그리고 그 와중에 디스크 쓰기 시간을 줄이기 위해 데이터를 LZ4로 미리 압축해 놓습니다), 처리하는 데이터의 양이 TB급으로 늘어나기 시작하면 RAM 사용량이 기하급수적으로 늘어나더군요. 10분정도 작업 수행하다 말고 out of memory를 만나면 참 난감해지더군요. 게다가 프로그램이 돌아가는 환경이 포터블 서버라 랙타입 서버처럼 RAM을 무식하게(......) 꽂을 수도 없는 일이고......

해서, DB를 swap file처럼 쓰는 방식을 생각해냈습니다. 지금은 모든 데이터를 robin hood hash에 올려서 처리하는데, 그 대신 DB에 필요한 데이터를 넣어두고 결과를 처리한 후 처리가 끝나면 테이블을 삭제하는 형태로 쓰면 RAM을 획기적으로 절약할 수 있겠다는 결론을 내렸습니다(아니 working memory를 제한하면 최소한 out of memory는 안 내놓겠죠 -_-).

처음에는 SQLite를 적용해 보았습니다. Android와 iOS에서 사용할 정도로 RAM 사용량이 적고, 마침 예전에 SQLite를 이용해서 bulk insert를 할 수 있는 별도의 클래스를 만들어 두기도 했거든요. 하지만 SQLite는 쿼리가 조금만 복잡해져도 수행시간이 거의 무한하게 늘어지는(......) 문제점이 있었습니다. 레코드 7백만개짜리 테이블 두 개에 대해서 간단한 inner join을 수행해 보니, 수행을 시작한지 30분이 지나도 결과가 나오지 않더군요. 게다가 single thread에서 동작할 수밖에 없는 특성상 처리를 병렬로 수행할 수도 없었고......

왜 그러나 확인해보니, SQLite는 table join에서 nested loop만을 사용하더군요. 7백만 개짜리 테이블 두 개라면 loop를 49경번(......)을 돌아야 된다는 이야기죠. 그것도 싱글 스레드로. '이건 아니지 않느냐'는 생각에 깔끔하게 포기했습니다.

결국 본격적인(?) DB를 적용할 수밖에 없게 된 상황에서, 선택지는 MySQL, MariaDB, PostgreSQL의 세 가지 정도로 좁혀지더군요. 개중에 MySQL은 아무리 튜닝을 해도 제가 원하는데로 움직여주지 않아서 일찌감치 탈락(-_-)시키고, MariaDB와 PostgreSQL을 비교해본 뒤 PostgreSQL을 선택했습니다. 이유가 몇가지 있는데......

굼뜨다는 세간의 평가와 달리, 설정을 조금만 건드려주면 말 그대로 날아다닙니다. 솔직히, MySQL이나 MariaDB보다 느리다는 말에는 동의하지 못하겠습니다. MySQL/MariaDB에서는 InnoDB 속도 향상을 위해서 엄청나게들 tweak을 하시면서 PostgreSQL에서는 그런 작업을 하지 않으면서 느리다는 평가를 내리는건 어불성설일 겁니다
단일 쿼리를 병렬로 수행합니다. 쿼리를 보내면 스케줄러가 알아서 판단해서 필요에 따라 멀티프로세싱으로 작업을 수행합니다. 제가 수행했던 저 7백만개 레코드 테이블 두 개의 inner join은 core 3개를 점유한 뒤 1분 30초만에 답을 내놓더군요.
그 어떤 악조건에서도 의지가 됩니다. 제가 부하 테스트를 여러번 해본 결과, MariaDB에서 프로세스가 터져버리는 상황에서도 PostgreSQL은 끝까지 기다려주면 무조건 결과를 내놓았습니다

원래 hashmap에서 작업을 수행해서 그런지는 모르겠지만, 지금도 DB를 hashmap 비슷하게 사용하고 있습니다. 뭐, DB indexing의 기본이 hashmap인걸 생각해보면 적절한 용례일지도 모르겠네요. 그 외에, DB가 상당부분을 알아서 처리해주는 터라 프로그램의 주 business logic이 간편해진 것은 덤이고......

이래저래, The World's Most Advanced Open Source Relational Database라는 슬로건에 부합하는 코끼리 파워를 느껴보았습니다. 앞으로도 계속 승승장구하길 기원합니다.

Thursday, September 13, 2018

Choosing a database

Nowadays I'm developing a new product and I'm stuck with choosing the "right" database.
Well, today we have a wide variety of choices including NoSQL, but too many choices makes me more difficult on selection. I have to consider more things, and I have no idea on how the users will use the product. In my situation I concluded NoSQL is not the way so I'm considering three among SQL engines, but all the candidates bite me on some area.

SQLite: unlike what others think, SQLite has no problems at all on using in production environment. It's stable and FAST. Especially, it's so "lightning" fast that the spiral booster is always turned on(c.f. Cyber Formula). In almost all cases it overwhelms the others. However, it doesn't support concurrent write. To do it, I have to make a wrapper to do it for the database. OTL
PostgreSQL: magnificent. I dare to say it's the best in the world of open source. Though it is based on SQL(I know it's ORDB, but anyway) yet supports NoSQL too. The problem is it's damn sloooooooooooooow like snail and consumes a lot of resources. I can tune it to make it faster but still it lacks some speed.
MariaDB: Fast. I see the speed on most of SQL commands I use. If I use Aria with some setup, in some (limited) conditions it can be faster than SQLite. However, it sometimes hits my back hard. Pretty hard. Even just now, I found a 2-second query suddenly took 11 minutes. And now it takes 42 seconds. Hey you seal?

From the viewpoint of a developer, speed is important, but the more important is whether it's foreseeable or not. If it's slow I can tune it hard, and if it's still slow I can make the application to do another thing concurrently when doing slow stuff, but if the performance is that random, simply I can't use it.

Well, I think I have to ride the elephant again and drop my belief to the Holy Mother.

(or making with the database for missiles......)

데이터베이스 선택의 문제

작금 제품을 개발하면서 가장 커다란 과제중 하나가 데이터베이스 선택입니다.

뭐, 요즘은 NoSQL 포함해서 다양한 상황에 부합하는 여러 제품들이 나와있습니다만, 그렇기 때문에 더 선택이 어려워지기도 합니다. 고려해야 할 사항도 많을 뿐더러, 사용자가 실제로 이 제품을 어떻게 쓸지도 솔직히 예측이 쉽지 않기 때문에 더 그렇습니다. 저같은 경우는 NoSQL은 아직 아닌 것 같아서 SQL들 중에 고르고 있는데...... 이게 난감하네요. 현재 세 가지를 고려하고 있는데, 셋 다 뭔가 하나씩 부족합니다.

SQLite: 많은 사람들의 생각과는 다르게, SQLite를 제품에 적용하는데는 아무런 문제가 없습니다. 안정적이고, 빠릅니다. 속력만 놓고 보면 스파이럴 부스터를 항시 켜놓은 느낌입니다. 모든 경우에서 밑의 두 개를 말 그대로 '발라버립니다.' 그런데 concurrency 지원이 안됩니다. 굳이 하려면 직접 만들어야 해요. OTL
PostgreSQL: 오픈소스계 최강입니다. SQL이면서 NoSQL 기능까지 같이 들어있는...... 단점이라면 (특히 쓰기가) 거북이처럼 느리고, 자원 소모가 상당하다는 겁니다. 튜닝을 좀 하면 빨라지긴 하는데, 그래도 제가 원하는 수준까지 끌어내기에는 부족합니다. OTL
MariaDB: 빠릅니다. 주로 사용하는 구문이 빠르게 움직입니다. Storage engine을 Aria로 놓고 설정을 좀 해놓으면 특정 상황에서는 SQLite보다 더 빨라지기도 합니다(......). 단점이라면 가끔 예측 불가능한 행동을 한다는 거죠. 방금전에도 2초면 끝나던 query 하나가 갑자기 11분 넘게 뻘뻘대더니만 지금은 인풋 한번 수행시간이 평균 42초가 되었습니다(......OTL).

무언가를 만드는 사람의 입장에서는 속도도 중요하지만 더 중요한건 예측 가능성인 것 같습니다. 만일 뭔가 느리다면 속도를 최대한 끌어올리고, 그래도 안되면 느린 작업을 하는 동안에 다른 작업을 수행하게 만들면 되는데, 작업이 저렇게 예측 불가능하게 들쭉날쭉하게 움직이면 답이 안 나오더군요.

뭐, 아무래도 믿었던 성모님으로의 신앙(?)을 버리고 다시 코끼리 등 위에 올라타야 될 것 같습니다.

(아니면 미사일 DB로 뚝딱뚝딱 만들거나......)

Saturday, July 28, 2018

DB insertion speed test

To be used in my work, I tested which database inserts the records the fastest. Well, actually it was not much - I just created a table with integer(PK)/timestamp/integer and inserted 1 million records with only PK changing.

Benchmark environment is as follows:

Dell Inspiron 7373/i5 8th generation
Windows 10 Home
Target database(all 32bits)

PostgreSQL 10.6: used libpq
FIrebird 3.0.3: used OO API (C++)
SQLite3: #include <sqlite3.h> // ...... :P

And here comes the results and some remarks. I hope it would help visitors.
(I'm not that much good enough to make something open sourced, so this is the best I can offer to the community...... ;) )

Insertion speed

1: SQLite, in-memory database: 2 seconds
2: PostgreSQL, COPY: 5 seconds
3: PostgreSQL, bulk insert: 7 seconds
4: Firebird, bulk insert: 10 seconds
5: SQLite, save to file: 56 seconds

Disk I/O

Thanks to Windows kernel, if the system receives too many small I/Os, it burdens disk too much, making it to the bottleneck
When using SQLite to save the result to disk where bulk insertion is impossible, disk I/O hit 100%
In Firebird, bulk insertion is implemented via PSQL, where you send a number of records and the each INSERT line in PSQL sequentially inserts the records; still it burdens the disk I/O very high(around 50%), though lower than SQLite
In Firebird, set the page size to at least 8192 to lessen disk I/O burdens

Bulk insertion

We can speed up the insertion using bulk insertion, but we have some speed bottleneck, regardless of disk I/O.
Firebird: 10 records at once, regardless of page size

Using stored procedure

PostgreSQL: 40 records show the maximum performance

PostgreSQL

COPY(source: CSV) is incredibly fast. Even the official documentation recommends COPY in bulk insertion
In libpq, the speed is around same whether you send the records in either text or binary

Miscellany

It's far faster to use the official interface directly rather than using any wrapper. When testing Firebird at first, with SOCI it took 59 seconds, while after applying the same logic to Firebird OO API directly the time taken is shrunk to only 20-something seconds.

That's all. Please let me know if you have any questions or comments. =_=/

DB insertion 속도 테스트

업무상 필요로 인해 어느 DB가 record insertion을 제일 빠르게 하는지를 테스트하게 되었습니다. 뭐 대단한걸 한건 아니고, integer(PK)/timestamp/integer로 구성된 테이블에서 PK값만 바꾸고 나머지는 고정값으로 해서 1백만개 레코드를 일괄 등록하는 프로그램을 만들어 돌려봤습니다.

대상 환경은 이렇습니다

Dell Inspiron 7373/i5 8세대 모델
Windows 10 Home
대상 DB(모두 32비트 모델)

PostgreSQL 10.6: libpq 사용
FIrebird 3.0.3: OO API (C++) 사용
SQLite3: #include <sqlite3.h> // ...... :P

결과 및 시사점을 정리합니다. 도움이 되셨으면 합니다.
(아직 뭔가를 만들어 open source로 만들어 내놓을 정도의 실력이 되지는 않으니 이런거라도......)

Insertion 속도

1위: SQLite, in-memory database: 2초
2위: PostgreSQL, COPY: 5초
3위: PostgreSQL, bulk insert: 7초
4위: Firebird, bulk insert: 10초
5위: SQLite, 파일에 저장: 56초

Disk I/O 관련

Windows 커널 특성상 small I/O가 많아지면 disk에 부담이 너무 많이 가 bottleneck이 됨
Bulk insert가 불가능한 SQLite 파일 저장 조건에서 disk I/O가 100%가 됨
Firebird의 경우, PSQL을 사용하여 bulk insert를 구현하는 형태로, 다수의 레코드를 한번에 받아 디스크에 개별 INSERT 명령어를 사용하여 순차적으로 쓰는 형태로 구현되므로, SQLite급까지는 아니더라도 disk I/O가 높게 일어남(disk I/O 50%)
Firebird의 경우 page size를 최소 8192이상으로 가져가는 것이 disk I/O 부하 감소에 도움이 됨

Bulk insertion 관련

Bulk insertion을 이용하여 insertion 속도를 올릴 수 있으나, 일정 수준 이상을 넘어서면 속도가 더이상 올라가지 않음. 이는 disk I/O와는 별개임
Firebird: 10개 수준에서 최고속력을 보임(page size와 상관없음)

Sotred procedure

PostgreSQL: 40개 수준에서 최고속력을 보임

PostgreSQL 관련

COPY(원본: CSV 기반)가 규격외로 빠른 속력을 보임. 실제로 공식 메뉴얼에서도 bulk insertion에서는 COPY를 추천함
libpq에서 insertion 데이터를 보낼때 데이터를 text와 binary 중 어느 형태로 보내도 소요 시간은 동일함

기타

각 DB에서 제공하는 직접 연결 인터페이스를 사용하는 것이 wrapper를 사용하는 것보다 훨씬 빠름. Firebird로 최초 테스트시, SOCI 적용시 59초가 소요되었으나 Firebird OO API로 동일 로직 구현시 20초대에서 완료됨

이정도입니다. 혹시 궁금하신 점이 있으시면 문의 주세요. =_=/

Pages

Thursday, March 7, 2024

Wednesday, November 18, 2020

Saturday, September 19, 2020

Thursday, September 13, 2018

Saturday, July 28, 2018