Loading half a billion rows into MySQL---转载

Background

We have a legacy system in our production environment that keeps track of when a user takes an action on Causes.com (joins a Cause, recruits a friend, etc). I say legacy, but I really mean a prematurely-optimized system that I’d like to make less smart. This 500m record database is split across monthly sharded tables. Seems like a great solution to scaling (and it is) – except that we don’t need it. And based on our usage pattern (e.g. to count a user’s total number of actions, we need to do query N tables), this leads to pretty severe performance degradation issues. Even with memcache layer sitting in front of old month tables, new features keep discovering new N-query performance problems. Noticing that we have another database happily chugging along with 900 million records, I decided to migrate the existing system into a single table setup. The goals were:

  • reduce complexity. Querying one table is simpler than N tables.
  • push as much complexity as possible to the database. The wrappers around the month-sharding logic in Rails are slow and buggy.
  • increase performance. Also related to one table query being simpler than N.

Alternative Proposed Solutions

MySQL Partitioning: this was the most similar to our existing set up, since MySQL internally stores the data into different tables. We decided against it because it seemed likely that it wouldn’t be much faster than our current solution (although MySQL can internally do some optimizations to make sure you only look at tables that could possibly have data you want). And it’s still the same complexity we were looking to reduce (and would further be the only database set up in our system using partitioning).

Redis: Not really proposed as an alternative because the full dataset won’t fit into memory, but something we’re considering loading a subset of the data into to answer queries that we make a lot that MySQL isn’t particularly good at (e.g. ‘which of my friends have taken an action’ is quick using Redis’s built in SET UNION function). The new MySQL table might be performant enough that it doesn’t make sense to build a fast Redis version, so we’re avoiding this as possible premature optimization, especially with a technology we’re not as familiar with.

Dumping the old data

MySQL provides the `mysqldump’ utility to allow quick dumping to disk:

  msyqldump -T /var/lib/mysql/database_data database_name

This will produce a TSV file for each table in the database, and this is the format that `LOAD INFILE’ will be able to quickly load later on.

Installing Percona 5.5

We’ll be building the new system with the latest-and-greatest in Percona databases on CentOS 6.2:

  rpm -Uhv http://www.percona.com/downloads/percona-release/percona-release-0.0-1.x86_64.rpm
  yum install Percona-Server-shared-compat Percona-Server-client-55 Percona-Server-server-55 -y

[ open bug with the compat package: https://bugs.launchpad.net/percona-server/+bug/908620]

Specify a directory for the InnoDB data

This isn’t exactly a performance tip, but I had to do some digging to get MySQL to store data on a different partition. The first step is to make use your my.cnf contains a

datadir = /path/to/data

directive. Make sure /path/to/data is owned by mysql:mysql (chown -R mysql.mysql /path/to/data) and run:

mysql_install_db --user=mysql --datadir=/path/to/data

This will set up the directory structures that InnoDB uses to store data. This is also useful if you’re aborting a failed data load and want to wipe the slate clean (if you don’t specify a directory, /var/lib/mysql is used by default). Just

rm -rf *

the data directory and run the mysql_install_db command.

[* http://dev.mysql.com/doc/refman/5.5/en/mysql-install-db.html]

SQL Commands to Speed up the LOAD DATA

You can tell MySQL to not enforce foreign key and uniqueness constraints:

  SET FOREIGN_KEY_CHECKS = 0;
  SET UNIQUE_CHECKS = 0;

and drop the transaction isolation guarantee to UNCOMMITTED:

  SET SESSION tx_isolation=‘READ-UNCOMMITTED‘

and turn off the binlog with:

  SET sql_log_bin = 0

And when you’re done, don’t forget to turn it back on with:

  SET UNIQUE_CHECKS = 1;
  SET FOREIGN_KEY_CHECKS = 1;
  SET SESSION tx_isolation=‘READ-REPEATABLE‘

It’s worth noting that a lot of resources will tell you to to use the “DISABLE KEYS” directive and have the indices all built once all the data has been loaded into the table. Unfortunately, InnoDB does not support this. I tried it, and while it took only a few hours to load 500m rows, the data was unusable without any indices. You could drop the indices completely and add them later, but with a table size this big I didn’t think it would help much.

Another red herring was turning off autocommit and committing after each `LOAD DATA’ statement. This was effectively the same thing as autocommitting, and manually commiting led to `LOAD DATA’ slowdowns a quarter of the way in.

http://dev.mysql.com/doc/refman/5.1/en/alter-table.html, search for “DISABLE KEYS” ] [http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/]

Performance adjustments made to my.cnf

  -- http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit
  -- this loosens the frequency with which the data is flushed to disk
  -- it‘s possible to lose a second or two of data this way in the event of a
  -- system crash, but this is in a very controlled circumstance
  innodb_flush_log_at_trx_commit=2
  -- rule of thumb is 75% - 80% of total system memory
  innodb_buffer_pool_size=16GB
  -- don‘t let the OS cache what InnoDB is caching anyway
  -- http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/
  innodb_flush_method=O_DIRECT
  -- don‘t double write the data
  -- http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_doublewrite
  innodb_doublewrite = 0

Use LOAD DATA INFILE

This is the most optimized path toward bulk loading structured data into MySQL. 8.2.2.1. Speed of INSERT Statements predicts a ~20x speedup over a bulk INSERT (i.e. an INSERT with thousands of rows in a single statement). See also 8.5.4. Bulk Data Loading for InnoDB Tables for a few more tips.

Not only is it faster, but in my experience with this migration, the INSERT method will slow down faster than it can load data and effectively never finish (last estimate I made was 60 days, but it was still slowing down).

INFILE must be in the directory that InnoDB is storing that database information. If MySQL is in /var/lib/mysql, then mydatabase would be in /var/lib/mysql/mydatabase. If you don’t have access to that directory on the server, you can use LOAD DATA LOCAL INFILE. In my testing, putting the file in the proper place and using `LOAD DATA INFILE’ increased load performance by about 20%.

http://dev.mysql.com/doc/refman/5.5/en/load-data.html]

Perform your data transformation directly in MySQL

Our old actioncredit system was unique on (MONTH(created_at), id), but the new system is going to generate new autoincrementing IDs for each records as it’s loaded in chronological order. The problem was that my 50 GB of TSV data doesn’t match up to the new schema. Some scripts I had that would use Ruby to transform the old row into the new row was laughably slow. I did some digging and found out that you can tell MySQL to (quickly) throw away the data you don’t want in the load statement itself, using parameter binding:

  LOAD DATA INFILE ‘data.csv‘ INTO TABLE mytable
  FIELDS TERMINATED by ‘\t‘ ENCLOSED BY ‘\"‘
  (@throwaway), user_id, action, created_at

This statement is telling MySQL which fields are represented in data.csv. @throwaway is a binding parameter; and in this case we want to discard it so we’re not going to bind it. If we wanted to insert a prefix, we could execute:

  LOAD DATA INFILE ‘data.csv‘ INTO TABLE mytable
  FIELDS TERMINATED by ‘\t‘ ENCLOSED BY ‘\"‘
  (id, user_id, @action, created_at
  SET action=CONCAT(‘prefix_‘, action)

and every loaded row’s `action’ column will begin with the string ‘prefix’.

Checking progress without disrupting the import

If you’re loading large data files and want to check the progress, you definitely don’t want to use `SELECT COUNT(*) FROM table’. This query will degrade as the size of the table grows and slowdown the LOAD process. Instead you can query:

mysql> SELECT table_rows FROM information_schema.tables WHERE table_name = ‘table‘;
+------------+
| table_rows |
+------------+
|   27273886 |
+------------+
1 row in set (0.23 sec)

If you want to watch/log the progress over time, you can craft a quick shell command to poll the number of rows:

$ while :; do mysql -hlocalhost databasename -e "SELECT table_rows FROM information_schema.tables WHERE table_name = ‘table‘ \G ; " | grep rows | cut -d‘:‘ -f2 | xargs echo `date +"%F %R"` , | tee load.log && sleep 30; done
2012-05-29 18:16 , 32267244
2012-05-29 18:16 , 32328002
2012-05-29 18:17 , 32404189
2012-05-29 18:17 , 32473936
2012-05-29 18:18 , 32543698
2012-05-29 18:18 , 32616939
2012-05-29 18:19 , 32693198

The `tee’ will echo to STDOUT as well as to `file.log’, the ’\G’ formats the columns in the result set as rows, and the sleep gives it a pause between loading.

LOAD DATA chunking script

I quickly discovered that throwing a 50m row TSV file at LOAD DATA was a good way to have performance degrade to the point of not finishing. I settled on using `split’ to chunk data into one million rows per file:

for month_table in action*.txt; do
  echo "$(date) splitting $month_table..."
  split -l 1000000 $month_table curmonth_
  for segment in curmonth_*; do
    echo "On segment $segment"
    time mysql -hlocalhost action_credit_silo <<-SQL
      SET FOREIGN_KEY_CHECKS = 0;
      SET UNIQUE_CHECKS = 0;
      SET SESSION tx_isolation=‘READ-UNCOMMITTED‘;
      SET sql_log_bin = 0;
      LOAD DATA INFILE ‘$segment‘ INTO TABLE actioncredits
      FIELDS TERMINATED by ‘\t‘ ENCLOSED BY ‘\"‘
      (@throwawayId, action, user_id, target_user_id, cause_id, item_type, item_id, activity_id, created_at, utm_campaign) ;
SQL
    rm $segment
  done
  mv $month_table $month_table.done
done                                                                                                                        

Wrap-up

Over the duration of this script, I saw chunk load time increase from 1m40s to around an hour per million inserts. This is however better than not finishing at all, which I wasn’t able to achieve until making all changes suggested in this post and using the aforementioned `load.sh’ script. Other tips:

  • use as few indices as you can
  • loading the data in sequential order not only makes the loading faster, but the resulting table will be faster
  • if you can load any of the data from MySQL (instead of a flat file intermediary), it will be much faster. You can use the `INSERT INTO .. SELECT’ statement to copy data between tables quickly.
时间: 2024-12-10 04:46:22

Loading half a billion rows into MySQL---转载的相关文章

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop(中英双语)

文章标题 Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop Deep dive into the new Tungsten execution engine 作者介绍 Sameer Agarwal, Davies Liu and Reynold Xin 文章正文 参考文献 https://databricks.com/blog/2016/05/23/apache-spark-as-a-compile

安装mysql 转载 的

centOS 6.5下升级mysql,从5.1升级到5.7 1.备份数据库,升级MySQL通常不会丢失数据,但保险起见,我们需要做这一步.输入命令: mysqldump -u xxx -h xxx -P 3306 -p --all-databases > databases.sql 2. 停止MySQL服务,输入命令: service mysqld stop 3. 卸载旧版MySQL,输入命令: yum remove mysql mysql-* 执行过程中会询问你是否移除,此时输入"Y&q

Mysql 索引 转载

转自 :http://blog.csdn.net/wud_jiyanhui/article/details/7403375 什么是索引 索引时一种特殊的文件,他们包涵着对数据表里所有记录的引用指针. 当对数据表记录进行更新后,都会对索引进行刷新. 索引会占用相当大的空间,应该只为经常查询和最经常排序的数据列建立索引. 索引类型   ①普通索引:这是最基本的索引类型,而且它没有唯一性之类的限制.普通索引可以通过以下几种方式创建:  I.创建索引 例如:CREATE INDEX <索引的名字> O

MySQL执行计划解读(转载)

MySQL执行计划解读 Explain语法 EXPLAIN SELECT …… 变体: 1. EXPLAIN EXTENDED SELECT …… 将执行计划“反编译”成SELECT语句,运行SHOW WARNINGS 可得到被MySQL优化器优化后的查询语句 2. EXPLAIN PARTITIONS SELECT …… 用于分区表的EXPLAIN 执行计划包含的信息 id 包含一组数字,表示查询中执行select子句或操作表的顺序 id相同,执行顺序由上至下 如果是子查询,id的序号会递增,

MySQL高可用方案-PXC(Percona XtraDB Cluster)环境部署详解

Percona XtraDB Cluster简称PXC.Percona Xtradb Cluster的实现是在原mysql代码上通过Galera包将不同的mysql实例连接起来,实现了multi-master的集群架构.下图中有三个实例,组成了一个集群,而这三个节点与普通的主从架构不同,它们都可以作为主节点,三个节点是对等的,这种一般称为multi-master架构,当有客户端要写入或者读取数据时,随便连接哪个实例都是一样的,读到的数据是相同的,写入某一个节点之后,集群自己会将新数据同步到其它节

mysql优化-数据库优化、SQL优化

我有一张表w1000,里面有1000万条数据,这张表结构如下:CREATE TABLE `w1000` ( `id` varchar(36) NOT NULL, `name` varchar(10) DEFAULT NULL, `age` int(3) DEFAULT NULL, `money` double(8,2) DEFAULT NULL, `address` varchar(100) DEFAULT NULL, `create_date` datetime(3) DEFAULT NULL

MYSQL explain详解 转自http://blog.csdn.net/zhuxineli/article/details/14455029

标签: WHERE子句用于限制哪一个行匹配下一个如果Extra值不为Using wher查询可能会有一些错误 如果想 2013-11-24 17:55 36299人阅读 评论(5) 收藏 举报  分类: mysql(13)  版权声明:本文为博主原创文章,未经博主允许不得转载. explain显示了mysql如何使用索引来处理select语句以及连接表.可以帮助选择更好的索引和写出更优化的查询语句. 先解析一条sql语句,看出现什么内容 EXPLAINSELECTs.uid,s.username

MySQL 数据库常用命令小结

1.MySQL常用命令 create database name; 创建数据库 use databasename; 选择数据库 drop database name 直接删除数据库,不提醒 show tables; 显示表 describe tablename; 表的详细描述 select 中加上distinct去除重复字段 mysqladmin drop databasename 删除数据库前,有提示. 显示当前mysql版本和当前日期 select version(),current_dat

Mysql性能的优化配置

一.MySQL 性能优化之-影响性能的因素 1. 商业需求的影响 不合理需求造成资源投入产出比过低,这里我们就用一个看上去很简单的功能来分析一下. 需求:一个论坛帖子总量的统计,附加要求:实时更新 从功能上来看非常容易实现,执行一条 SELECT COUNT(*) from 表名 的 Query 就可以得到结果.但是,如果我们采用不是 MyISAM 存储引擎,而是使用的 Innodb 的存储引擎,那么大家可以试想一下,如果存放帖子的表中已经有上千万的帖子的时候,执行这条 Query 语句需要多少