Import huge size CSV files to MySQL at high speed.

  • 24 July 2019
Post image

The ability to import from a CSV file into a DB (MySQL) is common and the information is staggering. In my last article, I wrote about how to quickly write a large amount of data as a CSV file, and this time, I would say the opposite is true.

However, since CSV to DB is subject to various restrictions, it seems that there is no single answer that is the fastest and most reliable. In this article, I have ignored the server’s upload_max_filesize and available memory.


Methods of MySQL import of CSV

The methods listed earlier are as follows.

  1. Insert one record at a time (very slow)
  2. Bulk insert (fast)
  3. Insert in batches (a bit slow)
  4. LOAD DATA LOCAL INFILE Insert (fastest)

You also need to take into account whether the environment is conducive to doing everything the right way.

In addition, this time, I am using PHP (Laravel) on the assumption that it will be used in a framework, but in terms of insert processing speed, it is a DB issue, so it is the same no matter what framework or language is used.


Very slow! Insert one record at a time.

This is naturally the slowest method, as it takes a tremendous amount of time to fetch the CSV data and issue an INSERT statement for each row.

$file = new \SplFileObject($path);
$file->setFlags(\SplFileObject::READ_CSV);
 
foreach ($file as $row => $line) {
	Hoge::create([
        'huga'=> $line[0],
        ...
    ]);
}

I guess the processing is heavier than normal Insert because it is done with Laravel’s Model::create, but there is a problem with this kind of one-line Insert. It is a waste of computer resources.

I tried to import the data (300,000 lines of CSV file) using this method, but it took more than 5 minutes and the request timed out.


BULK INSERT is fast!

BULK INSERT is a way to insert multiple data in a single query.

This is naturally fast. Also, since it is a single query process, there is no need to put up a transaction.

$csvData = [
['hoge','foo',...]
...
]
DB::table('hoge').insert($csvData);

I thought this would be the best way to handle it, but there are problems with Bulk as well.

In the case of MySQL, there is an upper limit on the size of the query data called “max_allowed_packet”. It is said that 16MB is the maximum value.

https://qiita.com/_am_/items/91824da643256d46b847

If you try to INSERT a CSV file of 300,000 rows that I verified with one query, the data size of the SQL statement will lightly exceed 16MB or so.

If you have a huge amount of data to INSERT, it is impossible to INSERT it in one query. If the amount of data is indefinite, the following batch processing is the most common.


Compromise with batch INSERT processing?

INSERTing one at a time is too slow, and INSERTing one query at a time is problematic for data size. In general, it would be better to process the data in batches that combine these two methods. (e.g. INSERT once every 2000 records)

$file = new \SplFileObject($path);
$file->setFlags(\SplFileObject::READ_CSV);
$cnt = 0;
$insertData = [];
foreach ($file as $row => $line) {
	$cnt++;
	$insertData[] = [
       'huga'=> $line[0],
        ...
    ];
	if ($cnt >= 2000) {
		DB::table('hoge').insert($insertData);
		$cnt = 0;
		$insertData = [];
	}
}

In most cases, this is done in batch processing, since it can successfully INSERT CSVs of any size and achieve realistic processing times.

However, with this verification data (300,000 rows of data), it took about 2 minutes to complete the processing of all queries. Note that the above does not include transaction processing.


Execute each CSV file with LOAD DATA LOCAL INFILE

If you are not convinced by the execution time of batch processing, you can consider executing INSERT from a CSV file.

This seems to be the fastest INSERT method, faster than bulk insert and without the max_allowed_packet limitation.

$path = "test.csv";
$sql = <<< EOM
LOAD DATA LOCAL INFILE '$path' REPLACE INTO TABLE hoge 
CHARACTER SET utf8 FIELDS TERMINATED BY ',' 
OPTIONALLY ENCLOSED BY '"' 
IGNORE 1 LINES;
EOM;
DB::insert($sql, []);

The file is in UTF-8, and the first line of the csv is ignored.

With this, I was able to INSERT 300,000 rows of data in about 2 seconds. That’s incredibly fast! Since the file path is specified, the file data itself is sent and processed as if it were a bulk INSERT.

Of course, it is impossible to use this method to INSERT a CSV file that is too large for the DB server to handle (100GB or so), and the process will fail if the memory capacity of the DB server is exceeded.

So, if you need to import a huge CSV file and its size is indefinite, it is probably safer to use batch processing to insert it. If you still want to pursue processing speed, it is better to use LOAD DATA LOCAL INFILE to Insert. For example, if a user uploads a large CSV file, insert it into MySQL and return the response, it will be useful when you can’t make the user wait for several minutes.

gRPC witrh Rust the fastest?Pursuing 1ms
gRPC witrh Rust the fastest?Pursuing 1ms
Menu The time between request and response should be as short as …
> Read More

(Article migrated from another blog)

You May Also Like