作者:徐田原
当发起“select * from XXX”时,数据会加载到操作系统缓存然后才到shared buffer。PostgreSQL缓存读顺序share_buffers -> 操作系统缓存 -> 硬盘。同样当将脏页向磁盘刷写时,也是先到操作系统缓存,然后由操作系统调用fsync()将操作系统缓存中数据持久化到磁盘。这样PG实际上由两份数据,看起来有些浪费空间,但是操作系统缓存是一个简单的LRU而不是数据库优化的clock sweep algorithm。一旦在shared_buffers中命中,那么读就不会下沉到操作系统缓存。如果shared buffer和操作系统缓存有相同页,操作系统缓存中的页很快会被驱逐替换。
什么参数能影响操作系统的fsync将脏页刷回磁盘吗?
当然,通过postgresql.conf中参数bgwriter_flush_after,该参数整型,默认512KB。当后台写进程写了这么多数据时,会强制OS发起sync将cache中数据刷到底层存储。这样会限制内核页缓存中的脏数据数量,从而减小checkpoint时间或者后台大批量写回数据的时间。
不仅仅是bgwriter,即使checkpoint进程和用户进程也从shared buffer刷写脏页到OS cache。可以通过checkpoint_flush_after影响checkpoint进程的fsync,通过backend_flush_after影响后台进程的fsync。
缓存
pg_buffercache模块提供了一种实时检测共享缓冲区的方法。
这个模块提供了一个C函数:pg_buffercache_pages,它返回一个记录的集合和一个视图:pg_buffercache,它包装了这个函数来更方便的使用。
共享缓存中的每个缓冲区都有一行。未使用的缓冲区除了bufferid以外的所有列为null。共享系统目录被显示为属于数据库零。
因为缓存被所有的数据库共享,通常会有关系中的一些页面不属于当前的数据库。这意味着对许多行来说,在pg_class中可能不会有匹配的连接行,或者甚至可能会有错误连接。如果你试图连接pg_class,一个好的办法就是将连接限制为reldatabase等于当前数据库的OID或者为零的行。
当访问pg_buffercache视图时,内部缓冲区管理器会锁住足够长的时间来拷贝所有这个视图会展示的缓冲区状态数据。这确保了这个视图产生一个一致的结果集,同时不会不必要的长时间阻碍正常的缓冲区活动。虽然如此,但是如果这个视图被频繁读取的话,会对数据库性能产生一些影响。
1.1 pg_buffercache的部署
cd /home/postgres/postgresql-13.3/contrib/pg_buffercachegmake && gmake installpsql -c 'create extension pg_buffercache;'
1.2 pg_buffercache的使用
pg_buffercache 主要是用来查看pgsql数据库 shared_pool 的使用情况
select name,setting,unit,current_setting(name) from pg_settings where 1 = 1 and name = 'shared_buffers'; name | setting | unit | current_setting ----------------+---------+------+----------------- shared_buffers | 16384 | 8kB | 128MB(1 row)
查看使用情况
select d.datname, c.relname, c.relkind, count(*) as buffersfrom pg_class cinner join pg_buffercache b on b.relfilenode = c.relfilenodeinner join pg_database d on (b.reldatabase = d.oid and d.datname = current_database())where 1 = 1group by d.datname, c.relname, c.relkindorder by d.datname, 4 desc ; datname | relname | relkind | buffers ----------+-----------------------------------------------+---------+--------- postgres | tmp_t0 | r | 16150 postgres | pg_operator | r | 14 postgres | pg_depend_reference_index | i | 11 postgres | pg_statistic | r | 10 postgres | pg_depend | r | 8 postgres | pg_rewrite | r | 7 postgres | pg_toast_2619 | t | 7 postgres | pg_init_privs | r | 5 postgres | pg_operator_oprname_l_r_n_index | i | 5 postgres | pg_amop | r | 5 postgres | pg_depend_depender_index | i | 5 postgres | pg_index | r | 4 postgres | pg_amop_fam_strat_index | i | 3 postgres | pg_operator_oid_index | i | 3 postgres | pg_amproc_fam_proc_index | i | 3 postgres | pg_amop_opr_fam_index | i | 3 postgres | pg_namespace_oid_index | i | 2 postgres | pg_aggregate_fnoid_index | i | 2 postgres | pg_cast | r | 2 postgres | pg_cast_source_target_index | i | 2 postgres | pg_constraint_conrelid_contypid_conname_index | i | 2 postgres | pg_description_o_c_o_index | i | 2 postgres | pg_index_indexrelid_index | i | 2 postgres | pg_index_indrelid_index | i | 2 postgres | pg_namespace | r | 2 postgres | pg_namespace_nspname_index | i | 2 postgres | pg_opclass | r | 2 postgres | pg_opclass_am_name_nsp_index | i | 2 postgres | pg_opclass_oid_index | i | 2 postgres | pg_rewrite_oid_index | i | 2 postgres | pg_rewrite_rel_rulename_index | i | 2 postgres | pg_statistic_relid_att_inh_index | i | 2 postgres | pg_toast_2619_index | i | 2 postgres | pg_extension | r | 1 postgres | pg_am | r | 1 postgres | pg_aggregate | r | 1 postgres | pg_amproc | r | 1 postgres | pg_description | r | 1 postgres | pg_statistic_ext_relid_index | i | 1 postgres | pg_init_privs_o_c_o_index | i | 1 postgres | pg_extension_oid_index | i | 1 postgres | pg_extension_name_index | i | 1(42 rows)
pg_prewarm 它可以用于在系统重启时,手动加载经常访问的表到操作系统的cache或PG的shared buffer,从而减少检查系统重启对应用的影响。
2.1 pg_prewarm 函数
CREATE FUNCTION pg_prewarm(regclass, mode text default 'buffer', fork text default 'main', first_block int8 default null, last_block int8 default null)RETURNS int8AS 'MODULE_PATHNAME', 'pg_prewarm'LANGUAGE C
备注: regclass 参数为数据库对像,通常情况为表名; modex 参数指加载模式,可选项有 ‘prefetch’, ‘read’,’buffer’, 默认为 ‘buffer’ 具体稍后介绍; fork 表示对像模式,可选项有 ‘main’, ‘fsm’, ‘vm’, 默认为 ‘main’, first_block 表示开始prewarm的数据块,last_block 表示最后 prewarm 的数据块.
pg_prewarm 加载模式
mode 参数指加载模式,可选项有 ‘prefetch’, ‘read’,’buffer’, 默认为 ‘buffer’.
prefetch: 异步地将数据预加载到操作系统缓存;
read: 最终结果和 prefetch 一样,但它是同步方式,支持所有平台.
buffer: 将数据预加载到数据库缓存
2.2 pg_prewarm的部署
psql -c 'create EXTENSION pg_prewarm;'
2.3 pg_prewarm的演示
先将PG的shared buffer设为128M,OS总的memory有8G。然后创建下面的大小近1G的表test:
postgres=# create table tmp_t0 ( id int8,name varchar(100),memo1 varchar(200),memo2 varchar(200));CREATE TABLEpostgres=# insert into tmp_t0 select id,md5(id::varchar),md5(md5(id::varchar)),md5(md5(md5(id::varchar))) from generate_series(1,10000000) as id;INSERT 0 10000000
在每次都清掉操作系统cache和PG的shared buffer的情况下,分别测试下面几种场景:
不进行pg_prewarm的情况:postgres=# \d tmp_t0 Table 'public.tmp_t0' Column | Type | Collation | Nullable | Default --------+------------------------+-----------+----------+--------- id | bigint | | | name | character varying(100) | | | memo1 | character varying(200) | | | memo2 | character varying(200) | | | postgres=# SELECT pg_size_pretty(pg_total_relation_size('tmp_t0')); pg_size_pretty ---------------- 1347 MB(1 row)postgres=# explain analyze select count(*) from tmp_t0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=1291.463..1292.254 rows=1 loops=1) -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=1291.353..1292.247 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=1286.237..1286.238 rows=1 loops=3) -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=6.579..1137.312 rows=3333333 loops=3) Planning Time: 0.214 ms Execution Time: 1292.363 ms(8 rows)可以看到,近1G的表,全表扫描一遍,耗时1.2spostgres=# select pg_prewarm('tmp_t0', 'read', 'main'); pg_prewarm ------------ 172414(1 row)postgres=# explain analyze select count(*) from tmp_t0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=437.033..437.697 rows=1 loops=1) -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=436.866..437.689 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=434.013..434.014 rows=1 loops=3) -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.039..305.793 rows=3333333 loops=3) Planning Time: 0.083 ms Execution Time: 437.726 ms(8 rows)时间降至4秒多!这时反复执行全表扫描,时间稳定在0.4秒多。尝试buffer模式:postgres=# select pg_prewarm('tmp_t0', 'buffer', 'main'); pg_prewarm ------------ 172414(1 row)postgres=# explain analyze select count(*) from tmp_t0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=480.060..482.189 rows=1 loops=1) -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=479.862..482.180 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=475.604..475.605 rows=1 loops=3) -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.035..336.157 rows=3333333 loops=3) Planning Time: 0.068 ms Execution Time: 482.220 ms(8 rows)比read模式时间略少,但相差不大。可见,如果操作系统的cache够大,数据取到OS cache还是shared buffer对执行时间影响不大(在不考虑其他应用影响PG的情况下)尝试prefetch模式,即异步预取。这里,有意在pg_prewarm返回后,立即执行全表查询。这样在执行全表查询时,可能之前的预取还没完成,从而使全表查询和预取并发进行,缩短了总的响应时间:postgres=# select pg_prewarm('tmp_t0', 'prefetch', 'main'); pg_prewarm ------------ 172414(1 row)postgres=# explain analyze select count(*) from tmp_t0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=434.313..436.782 rows=1 loops=1) -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=434.155..436.774 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=431.066..431.067 rows=1 loops=3) -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.034..307.687 rows=3333333 loops=3) Planning Time: 0.061 ms Execution Time: 436.816 ms(8 rows)可以看到,总的完成时间是0.4秒多,使用pg_prewarm做预取大大缩短了总时间。因此在进行全表扫描前,做一次异步的prewarm,不失为一种优化全表查询的方法。问题点:执行1次select * from 不就可以将表的数据读入shared buffer和OS cache而实现预热了吗?岂不是比做这样一个插件更简单?实际上,对于较大的表(大小超过shared buff的1/4),进行全表扫描时,PG认为没必要为这种操作使用所有shared buffer,只会让其使用很少的一部分buffer,一般只有几百K,所以,预热大表是不能用一个查询直接实现的详情可以阅读:https://github.com/postgres/postgres/tree/17792bfc5b62f42a9dfbd2ac408e7e71c239330a/src/backend/storage/buffer
这是一种将数据库表,索引CACHE到OS层面的缓存里,只要内存足够大,可以将需要的数据都CACHE到OS 内存中,这极到的提高了应用的处理速度。
3.1 pgfincore函数
pgsysconf –查看操作系统CACHE情况pgsysconf_pretty –查看操作系统CACHE情况pgfincore –查看对象(表,索引)CACHE情况pgfadvise_willneed –将数据库对象(表,索引)载入OS CACHEpgfadvise_dontneed –将数据库对象(表,索引)刷出OS CACHEpgfadvise_normal –将数据库对象(表,索引)修改为普通内存方式pgfadvise_loader -将数据库对象(表,索引)自定义载入OS CACHE
函数返回列
relpath : the relation pathblock_size : the size of one block disk?block_disk : the total number of file system blocks of the relationblock_mem : the total number of file system blocks of the relation in buffer cache. (not the shared buffers from PostgreSQL but the OS cache)group_mem : the number of groups of adjacent block_mem
3.2 pgfincore部署
make cleanmakesumake install
For PostgreSQL >= 9.1, log in your database and:
postgres=# CREATE EXTENSION pgfincore;
For other release, create the functions from the sql script (it should be in your contrib directory):
psql mydb -f pgfincore.sql
3.3 pgfincore的演示
pgfincore
查询表CACHE情况
提供对象在操作系统缓存中的信息的。
它分为三个函数,参数分别为(relname, fork, getdatabit),(relname, getdatabit),(relname),三个参数意思为对象名,进程名(这个地方默认是main),是否要显示databit(很长,注意显示),第一个函数需要全部输入,第二个函数默认fork为main,第三个函数默认fork为main,getdatabit为false。
它输出的是文件位置及名称(relpath),文件顺序(segment),OS page或block大小(os_page_size),对象占用系统缓存需要的页面个数(rel_os_pages),对象已经占用缓存页面个数(pages_mem),在缓存中连续的页面组的个数(group_mem),OS剩余的page数(os_pages_free),加载信息的位图(databit)。
postgres=# select * from pgfincore ('tmp_t0'); relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty --------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+------------- base/13580/16426 | 0 | 4096 | 262144 | 0 | 0 | 1899395 | | 0 | 0 base/13580/16426.1 | 1 | 4096 | 82684 | 0 | 0 | 1899395 | | 0 | 0(2 rows)postgres=# select * from pgfincore('tmp_t0', false); relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty --------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+------------- base/13580/16426 | 0 | 4096 | 262144 | 262144 | 1 | 1553850 | | 0 | 0 base/13580/16426.1 | 1 | 4096 | 82684 | 82684 | 1 | 1553850 | | 0 | 0(2 rows)postgres=# select * from pgfincore ('tmp_t0', 'main', false); relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty --------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+------------- base/13580/16426 | 0 | 4096 | 262144 | 262144 | 1 | 1553819 | | 0 | 0 base/13580/16426.1 | 1 | 4096 | 82684 | 82684 | 1 | 1553819 | | 0 | 0(2 rows)
pgsysconf与pgsysconf_pretty
查询当前操作系统的块大小,剩余多少可用的CACHE(块)。输出OS block的大小(os_page_size),OS中剩余的page数(os_pages_free)和OS拥有的page总数(os_total_pages)。postgres=# select * from pgsysconf(); os_page_size | os_pages_free | os_total_pages --------------+---------------+---------------- 4096 | 1899465 | 1997514(1 row)postgres=# select * from pgsysconf_pretty(); os_page_size | os_pages_free | os_total_pages --------------+---------------+---------------- 4096 bytes | 6070 MB | 7803 MB(1 row)
pgfadvise_willneed
将数据库对象(表,索引)载入OS CACHE输出文件名(relpath),OS block大小(os_page_size),对象占用系统page数(rel_os_pages),OS剩余的page数(os_pages_free)postgres=# select * from pgfadvise_willneed('tmp_t0'); relpath | os_page_size | rel_os_pages | os_pages_free --------------------+--------------+--------------+--------------- base/13580/16426 | 4096 | 262144 | 1637008 base/13580/16426.1 | 4096 | 82684 | 1554178(2 rows)postgres=# select * from pgfincore ('tmp_t0'); relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty --------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+------------- base/13580/16426 | 0 | 4096 | 262144 | 262144 | 1 | 1553945 | | 0 | 0 base/13580/16426.1 | 1 | 4096 | 82684 | 82684 | 1 | 1553945 | | 0 | 0(2 rows)
pgfadvise_dontneed
将数据库对象(表,索引)刷出OS CACHE对当前对象设置dontneed标记。dontneed标记的意思就是当操作系统需要释放内存时优先释放标记为dontneed的pages。postgres=# select * from pgfadvise_dontneed('tmp_t0'); relpath | os_page_size | rel_os_pages | os_pages_free --------------------+--------------+--------------+--------------- base/13580/16426 | 4096 | 262144 | 1816091 base/13580/16426.1 | 4096 | 82684 | 1898771(2 rows)postgres=# select * from pgfincore ('tmp_t0'); relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty --------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+------------- base/13580/16426 | 0 | 4096 | 262144 | 0 | 0 | 1899411 | | 0 | 0 base/13580/16426.1 | 1 | 4096 | 82684 | 0 | 0 | 1899411 | | 0 | 0
pgfadvise_normal
使用普通内存方式,再pgfadv_dontneed方式刷出对象后,建议执行这个函数,将内存方式改为普通方式这里的pgfadvise主要调用了Linux下的函数posix_fadvise,标记值也是posix_fadvise所需要的。 postgres=# select * from pgfadvise_normal('tmp_t0'); relpath | os_page_size | rel_os_pages | os_pages_free --------------------+--------------+--------------+--------------- base/13580/16426 | 4096 | 262144 | 1899337 base/13580/16426.1 | 4096 | 82684 | 1899337(2 rows)postgres=# select * from pgfincore ('tmp_t0'); relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty --------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+------------- base/13580/16426 | 0 | 4096 | 262144 | 0 | 0 | 1899356 | | 0 | 0 base/13580/16426.1 | 1 | 4096 | 82684 | 0 | 0 | 1899356 | | 0 | 0
pgfadvise_loader
可以对页面缓存直接进行两方面的作用。因此,它能通过页面的位图在缓存中来对页面进行加载或卸载。它分为两个函数和上边的类似,就是设置缺省值,的输入参数是(relname, fork, segment, load, unload, databit)和(relname, segment, load, unload, databit),分别是对象名,文件分支名,文件序号,是否加载,是否卸载,位图信息。第二个函数默认fork为main。它输出的是物理文件名及path(relpath),OS page或block大小(os_page_size), OS中剩余的page数(os_pages_free),加载的page数(pages_load),卸载的page数(pages_unload)postgres=# select * from pgfadvise_loader('tmp_t0', 0, false, true, B'1100'); relpath | os_page_size | os_pages_free | pages_loaded | pages_unloaded ------------------+--------------+---------------+--------------+---------------- base/13580/16426 | 4096 | 1899322 | 0 | 2postgres=# select * from pgfadvise_loader('tmp_t0', 0, true, false, B'1100'); relpath | os_page_size | os_pages_free | pages_loaded | pages_unloaded ------------------+--------------+---------------+--------------+---------------- base/13580/16426 | 4096 | 1899287 | 2 | 0(1 row)postgres=# select * from pgfadvise_loader('tmp_t0', 0, true, true, B'1100'); relpath | os_page_size | os_pages_free | pages_loaded | pages_unloaded ------------------+--------------+---------------+--------------+---------------- base/13580/16426 | 4096 | 1899391 | 2 | 2(1 row)
pg_dropcache利用了bufmgr.h中的函数。但是这个插件的缺点就是不支持13.2,直接编译就会报错,这插件3年没有更新了。
DropDatabaseBuffers函数将移除所有的buffer cache。但是要注意得一点是脏页只是被简单的丢弃了,没有先写入磁盘,该操作风险极大。
4.1 pg_dropcache的部署
To install pg_dropcache clone this repository and run:
make install USE_PGXS=1
Then in psql (or any other client) execute:
create extension pg_dropcache;
pg_dropcache is a PostgreSQL extension that invalidates shared_buffers cache
4.2 pg_dropcache的演示
postgres=# select pg_prewarm('tmp_t0', 'buffer', 'main'); pg_prewarm ------------ 172414(1 row)postgres=# SELECT c.relname, pg_size_pretty(count(*) * 8192) as buffered, round(100.0 * count(*) / (SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer, 1) AS buffer_percent, round(100.0 * count(*) * 8192 / pg_table_size(c.oid), 1) AS percent_of_relation FROM pg_class c INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database()) GROUP BY c.oid, c.relname ORDER BY 3 DESC LIMIT 10; relname | buffered | buffer_percent | percent_of_relation -----------------------------------------------+------------+----------------+--------------------- tmp_t0 | 128 MB | 100.0 | 9.5 pg_constraint_conrelid_contypid_conname_index | 8192 bytes | 0.0 | 50.0(2 rows)postgres=# explain (buffers true, timing true, analyze true )select count(*) from tmp_t0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=483.540..484.797 rows=1 loops=1) Buffers: shared hit=16219 read=156195 -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=483.376..484.791 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=16219 read=156195 -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=470.719..470.720 rows=1 loops=3) Buffers: shared hit=16219 read=156195 -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.025..318.750 rows=3333333 loops=3) Buffers: shared hit=16219 read=156195 Planning Time: 0.037 ms Execution Time: 484.819 ms(12 rows)postgres=# select pg_dropcache(); pg_dropcache -------------- (1 row)postgres=# explain (buffers true, timing true, analyze true )select count(*) from tmp_t0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=985.393..990.863 rows=1 loops=1) Buffers: shared hit=96 read=172318 -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=985.216..990.856 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=96 read=172318 -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=965.922..965.923 rows=1 loops=3) Buffers: shared hit=96 read=172318 -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.102..796.406 rows=3333333 loops=3) Buffers: shared hit=96 read=172318 Planning Time: 0.036 ms Execution Time: 990.888 ms(12 rows)postgres=# SELECT c.relname, pg_size_pretty(count(*) * 8192) as buffered, round(100.0 * count(*) / (SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer, 1) AS buffer_percent, round(100.0 * count(*) * 8192 / pg_table_size(c.oid), 1) AS percent_of_relation FROM pg_class c INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database()) GROUP BY c.oid, c.relname ORDER BY 3 DESC LIMIT 10; relname | buffered | buffer_percent | percent_of_relation ---------------------------+------------+----------------+--------------------- tmp_t0 | 768 kB | 0.6 | 0.1 pg_am | 8192 bytes | 0.0 | 100.0 pg_amproc | 8192 bytes | 0.0 | 25.0 pg_index | 16 kB | 0.0 | 50.0 pg_namespace | 8192 bytes | 0.0 | 50.0 pg_opclass | 8192 bytes | 0.0 | 50.0 pg_aggregate_fnoid_index | 16 kB | 0.0 | 100.0 pg_amproc_fam_proc_index | 24 kB | 0.0 | 75.0 pg_aggregate | 8192 bytes | 0.0 | 33.3 pg_index_indexrelid_index | 16 kB | 0.0 | 100.0(10 rows)
pg_buffercache 既使用自身的缓冲区,也使用内核缓冲IO。这意味着数据会在内存中存储两次,首先是存入PostgreSQL缓冲区,然后是内核缓冲区。这被称为双重缓冲区处理。pg_prewarm 直接利用系统缓存的代码,对操作系统发出异步prefetch请求,在应用中,尤其在OLAP的情况下,对于大表的分析等等是非常耗费查询的时间的,而即使我们使用select table的方式,这张表也并不可能将所有的数据都装载到内存中,而pg_prewarm的功能就是完成一个张表全部进入到内存中的功能。pgfincore 主要是利用 POSIX 的 posix_fadvise 函数,pgfadvise_loader_file 中可以看到它的调用,支持两种模式:POSIX_FADV_WILLNEED与POSIX_FADV_DONTNEEDpg_dropcache 注意事项为脏页将被丢弃,因此它们不会被刷新到磁盘上,建议生产上慎用。