As this is going to be a long question, I will start at the end.
I have a script, with a few custom modules i have written, that when run, after about 2 minutes trows the following error.
<--- Last few GCs --->
[1:0x789e000] 160886 ms: Scavenge (interleaved) 967.1 (992.5) -> 963.6 (994.7) MB, pooled: 0 MB, 30.27 / 0.00 ms (average mu = 0.354, current mu = 0.353) allocation failure;
[1:0x789e000] 161486 ms: Mark-Compact (reduce) 972.1 (997.6) -> 962.3 (980.8) MB, pooled: 0 MB, 12.50 / 0.00 ms (+ 515.7 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 588 ms) (average mu = 0.352, current mu
<--- JS stacktrace --->
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----
1: 0xe2f0f2 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [node]
2: 0x1240cf0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
3: 0x1240fc7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
4: 0x1470715 [node]
5: 0x1470743 [node]
6: 0x14897fa [node]
7: 0x148c9c8 [node]
8: 0x1e25421 [node]
I realise that the first thought is to increase the memory with something like NODE_OPTIONS=--max_old_space_size=2048
however, this error happens after inserting 59 000 records into my database, and the amount of records I need inserted is a whopping 2.5 million, so I do not think increasing the memory is going to work, unless i gave it 1tb of memory.
What and Why?
I am connecting to an accommodation provider, who has an API, and their system asks that all hotel searches are performed by the end user, and their API is only responsible for giving single hotel live rates and availability once the user has chosen from the 25 search results provided on my site.
So they give me an API endpoint that passes back a URL to an AWS server where I can download a compressed file that holds all 2.5 million hotels they manage rates and availability for. I am meant to download the 2gb .zst file, decompress it back to a 28gb .jsonl file, and use the data contained to setup my own hotel search dashboard with filtering etc.
Once the user has chosen a hotel to view, that is when i start pulling single hotel live rates and room availability from their live API endpoints.
My Solution
My server is a Digital Ocean Docker 4gb memory, 100gb disk, 2 vCPUs, Node JS
I have a server action that looks like this
- Call their API and get the download link
- Download the file and add the .zst file to my Docker Volume using custom module from Hyperbytes
- Decompress the .zst into a .jsonl file with my own custom module which takes about 4.5 minutes and adds it to the same path as the .zst
- Add the .jsonl file location to a database table as a single row
- Another of my custom modules then reads the entire .jsonl file line by line and stores all 2.5 million byteLengths in a single array
- Database Bulk Insert step takes the array and adds all to a second database table of 2.5 million rows, in 34 seconds.
- Custom Database Query to output the first row with a byteStart of 0 and every 300th row byteStart. Which returns 8620 ByteStart integers
- Repeat step
- Custom Module by myself and Hyperbytes, uses
fs.createReadStream(load_path, {start: startBytes, end: endByytes});
to save reading the entire 28gb .jsonl file, it only reads about 30mb and fetches 300 lines, then adds each line to an empty array. - Database Bulk Insert to a third table which stores a line per row in its original JSON format.
And that is basically the entire flow, when i run this, it is very fast, and in roughly 2 minutes I have 59 000 records inserted into my 3rd table, but then it runs out of memory and just dies with the error message above, this means it has only run the repeat 196 times out of the total 8620 times it needed to complete the task.
So question is how do i overcome this, is there a way to release the memory once it has been written to the database, or can i use some kind of bull queries from @mebeingken to try split this up somehow, just not sure what to do from here, considering its working, sort of, lol.