原地址:http://blog.163.com/[email protected]/blog/static/16387704020142585616183/
pg_statsinfo架构如下 :
pg_statsinfo作为被监控数据库的worker进程跟随数据库一起启动, 被监控数据库的状态信息定时自动或手动的上传到repo数据库. repo数据库可用和被监控机放一起, 也可以独立使用, 对于数据库比较多的环境, 建议使用独立的repo数据库.
pg_stats_reporter是一个报告插件, 联合pg_statsinfo一起使用, 后面会讲到.
pg_statsinfo的运行机制如下 :
一个启动了pg_statsinfo插件的数据库进程列表举例 :
pg93 17471 1 0 07:34 pts/1 00:00:05 /home/pg93/pgsql9.3.3/bin/postgres
pg93 17472 17471 0 07:34 pts/1 00:00:00 postgres: pg_statsinfo launcher process
pg93 17473 17471 0 07:34 ? 00:00:00 postgres: logger process
pg93 17475 17471 0 07:34 ? 00:00:25 postgres: checkpointer process
pg93 17476 17471 0 07:34 ? 00:00:24 postgres: writer process
pg93 17477 17471 3 07:34 ? 00:03:16 postgres: wal writer process
pg93 17478 17471 0 07:34 ? 00:00:04 postgres: autovacuum launcher process
pg93 17479 17471 0 07:34 ? 00:00:03 postgres: archiver process last was 00000001000001D000000076
pg93 17480 17471 0 07:34 ? 00:00:16 postgres: stats collector process
pg93 17482 17472 0 07:34 pts/1 00:00:08 /home/pg93/pgsql9.3.3/bin/pg_statsinfod 17471
pg_statsinfo launcher 进程是postmaster进程fork而来, /home/pg93/pgsql9.3.3/bin/pg_statsinfod 17471 这个进程则是pg_statsinfo launcher 进程fork而来.
pg_statsinfo负责从数据库获取快照写入repo数据库, 同时过滤csvlog的信息(例如错误信息)到另一个文件或syslog, 这个过滤后的文件或syslog通常被其他监控软件用于监控数据库CSVLOG爆出的问题.
同时pg_statsinfo进程还可以接收启动或关闭agent的信号.
[[email protected]-172-16-3-150 ~]# /home/pg93/pgsql9.3.3/bin/pg_statsinfo --help
pg_statsinfo reports a PostgreSQL database.
Usage:
pg_statsinfo -r REPORTID [-i INSTANCEID] [-b SNAPID] [-e SNAPID] [-B DATE] [-E DATE]
[-o FILENAME] [connection-options]
pg_statsinfo -l [-i INSTANCEID] [connection-options]
pg_statsinfo -s [connection-options]
pg_statsinfo -S COMMENT [connection-options]
pg_statsinfo -D SNAPID [connection-options]
pg_statsinfo --start [connection-options]
pg_statsinfo --stop [connection-options]
General options:
-r, --report=REPORTID generate a report that specified by REPORTID
---------------------------
* Summary
* DatabaseStatistics
* InstanceActivity
* OSResourceUsage
* DiskUsage
* LongTransactions
* NotableTables
* CheckpointActivity
* AutovacuumActivity
* QueryActivity
* LockConflicts
* ReplicationActivity
* SettingParameters
* SchemaInformation
* Profiles
* All
---------------------------
(can prefix match. For example, "su" means ‘Summary‘)
-i, --instid limit to instances of specified instance ID
-b, --beginid begin point of report scope (specify by snapshot ID)
-B, --begindate begin point of report scope (specify by timestamp)
-e, --endid end point of report scope (specify by snapshot ID)
-E, --enddate end point of report scope (specify by timestamp)
-l, --list show the snapshot list
-s, --size show the snapshot size
-S, --snapshot=COMMENT get a snapshot
-D, --delete=SNAPID delete a snapshot
--start start the pg_statsinfo agent
--stop stop the pg_statsinfo agent
Output options:
-o, --output=FILENAME destination file path for report
Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt
Generic options:
--echo echo queries
--elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit
Read the website for details. <http://pgstatsinfo.projects.postgresql.org/>
Report bugs to <[email protected]>.
例如关闭agent, pg_statsinfod将被停掉
[[email protected]-172-16-3-150 ~]# /home/pg93/pgsql9.3.3/bin/pg_statsinfo --stop -h 127.0.0.1 -p 1921 -U postgres
LOG: waiting for pg_statsinfod to shut down
LOG: pg_statsinfod stopped
[[email protected] ~]# ps -ewf|grep "pg93 "
pg93 17471 1 0 07:34 pts/1 00:00:05 /home/pg93/pgsql9.3.3/bin/postgres
pg93 17472 17471 0 07:34 pts/1 00:00:00 postgres: pg_statsinfo launcher process
pg93 17473 17471 0 07:34 ? 00:00:00 postgres: logger process
pg93 17475 17471 0 07:34 ? 00:00:26 postgres: checkpointer process
pg93 17476 17471 0 07:34 ? 00:00:25 postgres: writer process
pg93 17477 17471 3 07:34 ? 00:03:27 postgres: wal writer process
pg93 17478 17471 0 07:34 ? 00:00:04 postgres: autovacuum launcher process
pg93 17479 17471 0 07:34 ? 00:00:03 postgres: archiver process last was 00000001000001D100000023
pg93 17480 17471 0 07:34 ? 00:00:17 postgres: stats collector process
启动agent, 即将pg_statsinfod这个后台进程起来 :
[[email protected]-172-16-3-150 ~]# /home/pg93/pgsql9.3.3/bin/pg_statsinfo --start -h 127.0.0.1 -p 1921 -U postgres
LOG: waiting for pg_statsinfod to start
LOG: pg_statsinfod started
[[email protected] ~]# ps -ewf|grep "pg93 "
pg93 3031 17472 0 09:16 pts/1 00:00:00 /home/pg93/pgsql9.3.3/bin/pg_statsinfod 17471
接下来以如下环境为例, 说明pg_statsinfo 如何部署 :
被监控数据库版本 : PostgreSQL 9.3
repo数据库版本 : PostgreSQL 9.3
被监控数据库服务器操作系统 : CentOS 6.4 x64
被监控服务器和REPO服务器时区 +8
被监控数据库时区参数 : timezone=‘PRC‘, log_timezone=‘UTC‘
REPO数据库时区参数 : timezone=‘PRC‘
在被监控服务器, REPO服务器的环境中, 保持操作系统的时区, 数据库的timezone一致. 否则可能导致各统计表的采集时间不一致的问题.
步骤 :
1. 在被监控机下载pg_statsinfo, 并安装.
# wget http://pgfoundry.org/frs/download.php/3540/pg_statsinfo-2.5.0.tar.gz
# tar -zxvf pg_statsinfo-2.5.0.tar.gz
# cd pg_statsinfo-2.5.0
# export PATH=/home/pg93/pgsql/bin:$PATH
# which pg_config
/home/pg93/pgsql/bin/pg_config
# make USE_PGXS=1
# make USE_PGXS=1 install
2. 配置数据库的postgresql.conf.
su - pg93
cd $PGDATA
vi postgresql.conf
关键配置项 :
如果要监控SQL级别的统计信息, 需要打开pg_stat_statements插件.
shared_preload_libraries = ‘pg_stat_statements,pg_statsinfo‘ # (change requires restart)
timezone = ‘PRC‘
log_timezone = ‘UTC‘
log_destination = ‘csvlog‘ # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose # terse, default, or verbose messages
log_statement = ‘ddl‘ # none, ddl, mod, all
track_activities = on
track_counts = on
track_functions = all # none, pl, all
log_min_messages = log # values in order of decreasing detail:
log_truncate_on_rotation = on # If on, an existing log file with the
pg_stat_statements.max = 1000
pg_stat_statements.track = all
pg_statsinfo.snapshot_interval = 10min # set snapshot interval
pg_statsinfo.enable_maintenance = ‘on‘ # enable maintenance mode(‘on‘ or ‘off‘)
pg_statsinfo.maintenance_time = ‘00:02:00‘ # Delete old snapshots every day in this time.
pg_statsinfo.repository_keepday = 7 # keep old snapshots in this period in maintenance.
pg_statsinfo.syslog_min_messages = ‘disable‘
pg_statsinfo.textlog_line_prefix = ‘%t %p %c-%l %x %q(%u, %d, %r, %a) ‘ # This format is same as syslog‘s format.
pg_statsinfo.syslog_line_prefix = ‘%t %p %c-%l %x %q(%u, %d, %r, %a) ‘ # This format is same as syslog‘s format.
pg_statsinfo.long_lock_threashold = 30 # threthold parameter for getting LOCK infomation
pg_statsinfo.repository_server=‘host=172.16.3.39 port=5432 user=statsrepo dbname=statsrepo‘
如果不配置repo的连接密码, 使用.pgpass的方式验证, 那么pg_statsinfo.repository_server必须使用host, 不能使用hostaddr.
其他配置项 :
listen_addresses = ‘0.0.0.0‘ # what IP address(es) to listen on;
port = 1921 # (change requires restart)
max_connections = 100 # (change requires restart)
superuser_reserved_connections = 3 # (change requires restart)
unix_socket_directories = ‘.‘ # comma-separated list of directories
unix_socket_permissions = 0700 # begin with 0 to use octal notation
tcp_keepalives_idle = 60 # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10 # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10 # TCP_KEEPCNT;
shared_buffers = 2048MB # min 128kB
maintenance_work_mem = 512MB # min 1MB
vacuum_cost_delay = 10 # 0-100 milliseconds
vacuum_cost_limit = 10000 # 1-10000 credits
bgwriter_delay = 10ms # 10-10000ms between rounds
wal_level = hot_standby # minimal, archive, or hot_standby
synchronous_commit = off # synchronization level;
wal_buffers = 16MB # min 32kB, -1 sets based on shared_buffers
wal_writer_delay = 10ms # 1-10000 milliseconds
checkpoint_segments = 128 # in logfile segments, min 1, 16MB each
archive_mode = on # allows archiving to be done
archive_command = ‘/bin/date‘ # command to use to archive a logfile segment
max_wal_senders = 32 # max number of walsender processes
wal_keep_segments = 128 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
wal_receiver_status_interval = 1s # send replies at least this often
hot_standby_feedback = on # send info from standby to prevent
random_page_cost = 2.0 # same scale as above
effective_cache_size = 96GB
track_activity_query_size = 1024 # (change requires restart)
autovacuum = on # Enable autovacuum subprocess? ‘on‘
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_naptime = 3s # time between autovacuum runs
autovacuum_vacuum_scale_factor = 0.0002 # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.0001 # fraction of table size before analyze
datestyle = ‘iso, mdy‘
lc_messages = ‘C‘ # locale for system error message
lc_monetary = ‘C‘ # locale for monetary formatting
lc_numeric = ‘C‘ # locale for number formatting
lc_time = ‘C‘ # locale for time formatting
default_text_search_config = ‘pg_catalog.english‘
3. 配置启动数据库的OS用户环境变量. 确保默认可以直接使用超级用户连接, 无需密码.
例如 :
[email protected]-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# \q
环境变量 :
[email protected]-172-16-3-150-> cat .bash_profile
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/bin
export PATH
export PGPORT=1921
export PGDATA=/ssd2/pg93/pg_root
export LANG=en_US.utf8
export PGHOME=/home/pg93/pgsql
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$ORACLE_HOME/bin:$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGUSER=postgres
export PGHOST=$PGDATA
export PGDATABASE=digoal
4. 配置repo数据库, 如果repo数据库是独立的, 那么新建库和用户即可.
[email protected]-172-16-3-39-> psql
psql (9.3.2)
Type "help" for help.
postgres=# create role statsrepo nosuperuser login encrypted password ‘statsrepo‘;
postgres=# create database statsrepo with template template0 encoding ‘UTF8‘;
postgres=# grant all on database statsrepo to statsrepo;
配置repo数据库的pg_hba.conf, 允许被监控服务器访问statsrepo数据库.
host statsrepo statsrepo 0.0.0.0/0 md5
配置repo数据库的postgresql.conf, timezone=‘PRC‘. 保持与被监控库timezone一致.
配置repo数据库服务器的时区. /etc/sysconfig/clock, 保持与被监控服务器一致.
实际上在启动安装了pg_statsinfo插件的被监控库时, statsinfo进程会自动执行pg_statsrepo_partition.sql和pg_statsrepo_alert.sql
创建repo相关的表, 函数. 因此statsrepo的schema是自动初始化的, 不需要提前初始化.
[email protected]-172-16-3-150-> cd /home/pg93/pgsql/share/contrib/
[email protected]-172-16-3-150-> ll
total 180K
-rw-r--r-- 1 root root 5.4K Mar 4 17:21 pg_statsinfo.sql
-rw-r--r-- 1 root root 73K Mar 4 17:21 pg_statsrepo83.sql
-rw-r--r-- 1 root root 16K Mar 4 17:21 pg_statsrepo_alert.sql
-rw-r--r-- 1 root root 71K Mar 4 17:21 pg_statsrepo_partition.sql
-rw-r--r-- 1 root root 152 Mar 4 17:21 uninstall_pg_statsinfo.sql
-rw-r--r-- 1 root root 152 Mar 4 17:21 uninstall_pg_statsrepo.sql
同时在被监控库的本地postgres数据库里面会自动执行pg_statsinfo.sql脚本, 创建statsinfo schema, 以及一些表和函数.
例如snapshot是一个手工创建快照的函数, 还有维护快照的函数.
postgres=# \df statsinfo.*
List of functio
ns
Schema | Name | Result data type |
Argument data types
| Type
-----------+--------------------+------------------+--------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------+--------
statsinfo | activity | record | OUT idle double precision, OUT idle_in_xact double precision, OUT waiting doubl
e precision, OUT running double precision, OUT backends integer, OUT client text, OUT pid integer, OUT start timestamp with time zon
e, OUT duration double precision, OUT query text
| normal
statsinfo | array_agg | anyarray | anyelement
| agg
statsinfo | cpustats | SETOF record | OUT cpu_id text, OUT cpu_user bigint, OUT cpu_system bigint, OUT cpu_idle bigin
t, OUT cpu_iowait bigint, OUT overflow_user smallint, OUT overflow_system smallint, OUT overflow_idle smallint, OUT overflow_iowait
smallint
| normal
statsinfo | cpustats | SETOF record | prev_cpustats statsinfo.cpustats_type, OUT cpu_id text, OUT cpu_user bigint, OU
T cpu_system bigint, OUT cpu_idle bigint, OUT cpu_iowait bigint, OUT overflow_user smallint, OUT overflow_system smallint, OUT overf
low_idle smallint, OUT overflow_iowait smallint
| normal
statsinfo | devicestats | SETOF record | OUT device_major text, OUT device_minor text, OUT device_name text, OUT device_
readsector bigint, OUT device_readtime bigint, OUT device_writesector bigint, OUT device_writetime bigint, OUT device_ioqueue bigint
, OUT device_iototaltime bigint, OUT overflow_drs smallint, OUT overflow_drt smallint, OUT overflow_dws smallint, OUT overflow_dwt s
mallint, OUT overflow_dit smallint, OUT device_tblspaces name[] | normal
statsinfo | devicestats | SETOF record | prev_devicestats statsinfo.devicestats_type[], OUT device_major text, OUT devic
e_minor text, OUT device_name text, OUT device_readsector bigint, OUT device_readtime bigint, OUT device_writesector bigint, OUT dev
ice_writetime bigint, OUT device_ioqueue bigint, OUT device_iototaltime bigint, OUT overflow_drs smallint, OUT overflow_drt smallint
, OUT overflow_dws smallint, OUT overflow_dwt smallint, OUT overflow_dit smallint, OUT device_tblspaces name[] | normal
statsinfo | last_xact_activity | SETOF record | OUT pid integer, OUT xid xid, OUT in_xact boolean, OUT queries text
| normal
statsinfo | loadavg | SETOF record | OUT loadavg1 real, OUT loadavg5 real, OUT loadavg15 real
| normal
statsinfo | maintenance | void | repository_keep_period timestamp with time zone
| normal
statsinfo | memory | SETOF record | OUT memfree bigint, OUT buffers bigint, OUT cached bigint, OUT swap bigint, OUT
dirty bigint
| normal
statsinfo | profile | SETOF record | OUT processing text, OUT execute bigint, OUT total_exec_time double precision
| normal
statsinfo | sample | void |
| normal
statsinfo | snapshot | void |
| normal
statsinfo | snapshot | void | comment text
| normal
statsinfo | start | void | timeout integer
| normal
statsinfo | stop | void | timeout integer
| normal
statsinfo | tablespaces | SETOF record | OUT oid oid, OUT name text, OUT location text, OUT device text, OUT avail bigin
t, OUT total bigint, OUT spcoptions text[]
| normal
(17 rows)
5. 配置被监控机的.pgpass, 启动postgresql数据库的操作系统用户的HOME目录下.
su - pg93
vi ~/.pgpass
172.16.3.39:5432:statsrepo:statsrepo:statsrepo
chmod 400 ~/.pgpass
重启数据库, 如果在此前修改了log_timezone, 请把CSVLOG清除后重启.
cd $PGDATA/pg_log
rm -rf *
pg_ctl start
可以观察到自动初始化statsrepo的过程.
使用举例 :
在被监控库手工创建快照.
postgres=# select * from statsinfo.snapshot(‘test‘);
snapshot
----------
(1 row)
在repo查看快照收集到的信息 :
\c statsrepo statsrepo
select * from snapshot;
select * from checkpoint;
select * from autovacuum;
...
[参考]
1. http://pgstatsinfo.projects.pgfoundry.org/pg_statsinfo.html
2. http://pgstatsinfo.projects.pgfoundry.org/pg_stats_reporter.html
3. http://pgfoundry.org/frs/?group_id=1000422
4. http://blog.163.com/[email protected]/blog/static/16387704020142411454515/
5. http://blog.163.com/[email protected]/blog/static/1638770402014257448826/
6. http://blog.163.com/[email protected]/blog/static/16387704020142583013467/