Just want to get some feedback on best practises on database row deletions and performance.
I have a hotel dump of 2.5 million hotels that I get every 7 days, there is also a daily incremental dump, that can contain more than 100k hotel updates by days 5 and 6.
When i get this incremental dump i need to go delete the 100k rows in the hotels database table by ID, and then insert the new 100k rows.
I tried with a repeat, and delete one at a time, and it took about an hour
I tried using an IN operator, and that took even longer surprisingly.
I tried doing a parallel solution and landed up with an error about database lock.
I also tried updating instead of deleting, and that was also one at a time and the slowest.
Any suggestions would be great.
I insert the new 100k rows with the bulk insert step in about a minute, its like an F1 car.
For answers to questions like this I believe this is where an LLM is very useful indeed. So I can't take credit for it. Although Prompt Engineer is now apparently a job...
ChatGPT pumped this out:
Deleting hundreds of thousands of rows in MySQL efficiently can be challenging, but here are some strategies you can use to optimize the process:
Use Batches: Instead of deleting all rows at once, delete in smaller batches (e.g., 1,000 to 10,000 rows at a time). This reduces lock contention and allows the database to maintain performance. You can do this with a loop in your application or using a stored procedure.
SET @rows_affected = 1;
WHILE @rows_affected > 0 DO
DELETE FROM your_table WHERE condition LIMIT 10000;
SET @rows_affected = ROW_COUNT();
END WHILE;
Use LIMIT with DELETE: If you can define a condition (like an ID range), use the LIMIT clause in your DELETE statement to control how many rows are deleted at once.
Disable Foreign Key Checks Temporarily: If your table has foreign keys, disabling them can speed up the delete operation, but be careful with this approach as it can lead to data integrity issues.
SET FOREIGN_KEY_CHECKS = 0;
-- Perform your deletes here
SET FOREIGN_KEY_CHECKS = 1;
Disable or Adjust Indexes: Consider dropping indexes before the delete operation and recreating them afterward if the table is heavily indexed. This can significantly speed up the delete process.
Use a Transaction: If you're deleting rows that are interrelated, wrap your deletes in a transaction to ensure data consistency and possibly speed up the operation.
START TRANSACTION;
-- Your delete operations here
COMMIT;
Consider Partitioning: If you're frequently deleting large amounts of data from a table, consider partitioning it. This allows for easier management of large datasets.
Analyze and Optimize: After deleting large amounts of data, run ANALYZE TABLE and OPTIMIZE TABLE to help improve performance and reclaim space.
Monitor Performance: Keep an eye on system performance and adjust the batch size and method based on the impact observed during the delete operations.
Always make sure to back up your data before performing mass deletions and test your queries in a safe environment first!
Thank you so much, that gives me some ideas to try out, appreciate it.
Lol, I added an index to make it faster, haha, bad idea it seems.
I am going to try the small batch loop plan in parallel, that sounds like a good plan to me because my table is pretty self contained with no foreign key relationships.
Let us know how things turned out and which steps gave you the most acceptable performance.
Been a while since I managed a DB but here would be my high-level approach. The safest approach would be to disconnect the DB from the app first, eliminating any contention for data, locked rows, dirty reads. Then...
Make sure your indexes on the 2.5M row table are only what your app needs to be performant. If the hotel ID is not the primary key, make sure this column has an index. Scrutinize any further indexes on that table since deleting rows require all indexes on that table to be rebuilt. Drop indexes except for the one needed to quickly find the 100k records
If ID does not change when applying the updated 100k rows, then you probably will be safe turning off FOREIGN_KEY_CHECKS.
Also, if you can disconnect the DB from the app, AND rows are not interrelated, you don't need to worry about running the delete loop within a transaction. You just want to quickly get rid of the 100k rows and insert them with updated data. If you're doing this using Wappler, you should probably keep transactions on and build in error control if the transaction fails.
Finally, below would be my sequence of events to test execute:
Drop indexes except for index on ID
SET FOREIGN_KEY_CHECKS = 0
Transaction start
Batch Delete (loop or no loop, test performance)
Transaction commit
Insert 100k rows
SET FOREIGN_KEY_CHECKS = 1
Rebuild dropped Indexes
Some DB's write performance is amazing, others not. You may find more performance by batching the writes, or looking into configuration settings of your chosen DB for improving write performance.
I am actually doing all this on a completely seperate database server, as the 2.5 million records are 28GB in their original JSONL file, and when inserted in the MariaDB database in a single InnoDB table it takes 35GB in the DB.
So everything happens in this interim database server and then I insert or update the primary database with only the data i need when strippling apart the single JSON line. This JSON line can be as long as 4million characters, or as short as 1500 characters depending on how much data is available for that single hotel.
The primary database this JSONL line populates is probably only using a third or less of the data provided.
wow!! this is insane.. and here im worried about 5K records.,... hahaha this is jus another level.... first of all .. WELL DONE.... in managing such a large amount of data.. and getting it to work..... and secondly.... well done in always pushing the limits.... thanks for the inspiration Paul...
I would love to, because i can bulk insert 2.5 rows in half the time as deleting 100k rows by ID currently, but because I am supplied a full hotel dump every Monday, and then Tuesday they update an incremental dump which may only have 15k records, then on Wednesday I get another incremental dump, with Tuesdays 15k + Wednesdays 20k etc. means that if I truncate I will have only 15k hotels on Tuesday, instead of 2.5 million hotels, and 15k with slight changes.
Hope I am sort of making sense with that strange wording
Thanks Brian, I will give that a try for sure, did not even think that was possible, but seeing the syntax it does make some sense. Will report back findings.
Thanks Apple, have not tried it as yet, and great timing because I am busy working on that functionality currently, so wonderful news. Do you possibly have a quick link to it.