前言
最近,读了德哥的文章,里边提到了PG 17支持copy跳过错误行,但是还不支持记录跳过的错误行,以及跳过错误行的上限数的配置。嗯,有了这个结论,就想试着验证一下。看来,社区也意识到这方面的功能需要加强了。虽然是挤牙膏式的改进,但有总比没有强。
这里简单回顾一下,psql执行文件关于事物提交的特点,copy从文件导入时事务的特点,甚至psql命令行开启事务以后执行语句的特点。
实例
psql直接执行文件
这种方式,会自动跳过那些执行失败的SQL语句,而成功的则会自动提交。请看下边的示例:
psql -c "create table t(id int)"
CREATE TABLE
试着创建并执行简单的 SQL文件:
cat>>/tmp/tmp.sql<<EOF
INSERT INTO t VALUES (1) ;
select pg_current_xact_id();
INSERT INTO t VALUES('txt');
INSERT INTO t VALUES (2) ;
select pg_current_xact_id();
EOF
执行一下看看:
psql < /tmp/tmp.sql
INSERT 0 1
pg_current_xact_id
--------------------
794
(1 row)
ERROR: invalid input syntax for type integer: "txt"
LINE 1: INSERT INTO t VALUES('txt');
^
INSERT 0 1
pg_current_xact_id
--------------------
796
(1 row)
psql -c "select * from t"
id
----
1
2
(2 rows)
我们很明显的能看到,它会跳过中间失败的语句,并且每个成功的,会自动往上提交。
这个功能很便利,并且从执行结果来看,你也很清楚哪些语句是执行失败的。
psql命令行事务块执行多条语句
我们还是接着上边的示例,开启事务,重新来一下:
postgres=# truncate t;
TRUNCATE TABLE
postgres=# begin;
BEGIN
postgres=*# NSERT INTO t VALUES (1) ;
ERROR: syntax error at or near "NSERT"
LINE 1: NSERT INTO t VALUES (1) ;
^
postgres=!# select pg_current_xact_id();
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# INSERT INTO t VALUES('txt');
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# INSERT INTO t VALUES (2) ;
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# select pg_current_xact_id();
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# commit;
ROLLBACK
postgres=# select * from t;
id
----
(0 rows)
从上边的结果也能看出,中间只要有一条语句失败,那么整个事务必须先回滚一下,所有的操作全部回退。这跟别的DBMS的执行效果完全不一样。其它数据库基本上是失败的语句那条会自动回滚,成功的会在后边commit的时候自动提交。
PostgreSQL能否达到相同的目的呢?
有两个开关设置:
1、ON_ERROR_STOP
当它设置为ON时,会在第一处出错的时候,停止提交后边的语句,并且报错。而出错前边的语句,还是会自动提交。看下例
postgres=# \set ON_ERROR_STOP on
postgres=# begin;
BEGIN
postgres=*# INSERT INTO t VALUES (1) ;
INSERT 0 1
postgres=*# select pg_current_xact_id();
pg_current_xact_id
--------------------
798
(1 row)
postgres=*# INSERT INTO t VALUES('txt');
ERROR: invalid input syntax for type integer: "txt"
LINE 1: INSERT INTO t VALUES('txt');
^
postgres=!# INSERT INTO t VALUES (2) ;
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# select pg_current_xact_id();
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# commit;
ROLLBACK
2、ON_ERROR_ROLLBACK
这个选项很好,它能在碰到错误的时候,自动回滚。这个行为,就保持了与其它数据库的行为基本一致。如下例:
postgres=# \set ON_ERROR_ROLLBACK on
postgres=# begin;
BEGIN
postgres=*# NSERT INTO t VALUES (1) ;
ERROR: syntax error at or near "NSERT"
LINE 1: NSERT INTO t VALUES (1) ;
^
postgres=*# select pg_current_xact_id();
pg_current_xact_id
--------------------
799
(1 row)
postgres=*# INSERT INTO t VALUES('txt');
ERROR: invalid input syntax for type integer: "txt"
LINE 1: INSERT INTO t VALUES('txt');
^
postgres=*# INSERT INTO t VALUES (2) ;
INSERT 0 1
postgres=*# select pg_current_xact_id();
pg_current_xact_id
--------------------
799
(1 row)
postgres=*# commit;
COMMIT
所以,如果想跟其它DBMS基本一致,你就可以将ON_ERROR_ROLLBACK默认设为ON。
copy从文件导入数据
老版本16.3
请看下边一则简单的示例:
postgres=# drop table t;
DROP TABLE
postgres=# create table t(id int, col2 varchar(32), col3 varchar(32));
CREATE TABLE
postgres=# insert into t values(1, 'abc,def', 'abc,aaa');
INSERT 0 1
postgres=# \copy t to /tmp/t.dat csv;
COPY 1
postgres=# \! cat /tmp/t.dat
1,"abc,def","abc,aaa"
我们再简单的truncate一下,并导入:
postgres=# truncate t;
TRUNCATE TABLE
postgres=# \copy t from /tmp/t.dat;
ERROR: invalid input syntax for type integer: "1,"abc,def","abc,aaa""
CONTEXT: COPY t, line 1, column id: "1,"abc,def","abc,aaa""
postgres=# \copy t from /tmp/t.dat csv quote '"';
COPY 1
我们手动在里边加点错误的数据试试:
cat <<EOF>>/tmp/t.dat
abc,"ddd"
3,"a","b"
EOF
cat /tmp/t.dat
1,"abc,def","abc,aaa"
abc,"ddd"
3,"a","b"
测试验证:
postgres=# truncate t;
TRUNCATE TABLE
postgres=# \copy t from /tmp/t.dat csv quote '"';
ERROR: invalid input syntax for type integer: "abc"
CONTEXT: COPY t, line 2, column id: "abc"
postgres=# select * from t;
id | col2 | col3
----+------+------
(0 rows)
-- 你可以看到:ON_ERROR_ROLLBACK值是ON哦。
postgres=# \echo :ON_ERROR_ROLLBACK
on
postgres=# select version();
version
--------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 16.3 [By SeanHe] on aarch64-apple-darwin23.6.0, compiled by Apple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit
(1 row)
默认情况,有错的话,会以事务的形式,不提交。一条数据也导不进去。这个就是让人恼火的地方(遇到海量数据,会头疼)
下边试验的版本是16.3
PG 17中是啥样的?
5:47:29 opt psql
psql (17beta2 [By Sean])
Type "help" for help.
postgres=#
再试一次:
create table t(id int, col2 varchar(32), col3 varchar(32));
postgres=# \copy t from /tmp/t.dat with (format csv, quote '"');
2024-08-13 05:58:15.411 CST [4057] ERROR: invalid input syntax for type integer: "abc"
2024-08-13 05:58:15.411 CST [4057] CONTEXT: COPY t, line 2, column id: "abc"
2024-08-13 05:58:15.411 CST [4057] STATEMENT: COPY t FROM STDIN with (format csv, quote '"');
ERROR: invalid input syntax for type integer: "abc"
CONTEXT: COPY t, line 2, column id: "abc"
postgres=# select * from t;
id | col2 | col3
----+------+------
(0 rows)
默认情况下,跟老版本行为一样。
PG 17当中加了一些特殊的选项:
postgres=# \h copy
Command: COPY
Description: copy data between a file and a table
Syntax:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]
COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }
TO { 'filename' | PROGRAM 'command' | STDOUT }
[ [ WITH ] ( option [, ...] ) ]
where option can be one of:
FORMAT format_name
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
DEFAULT 'default_string'
HEADER [ boolean | MATCH ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL { ( column_name [, ...] ) | * }
FORCE_NULL { ( column_name [, ...] ) | * }
ON_ERROR error_action
ENCODING 'encoding_name'
LOG_VERBOSITY verbosity
URL: https://www.postgresql.org/docs/17/sql-copy.html
我们看看ON_ERROR的说明:
ON_ERROR
Specifies how to behave when encountering an error converting a column's input value into its data type. An
error_action
value ofstop
means fail the command, whileignore
means discard the input row and continue with the next one. The default isstop
.The
ignore
option is applicable only forCOPY FROM
when theFORMAT
istext
orcsv
.A
NOTICE
message containing the ignored row count is emitted at the end of theCOPY FROM
if at least one row was discarded. WhenLOG_VERBOSITY
option is set toverbose
, aNOTICE
message containing the line of the input file and the column name whose input conversion has failed is emitted for each discarded row.
设置ON_ERROR 为 'ignore', 试一下:
postgres=# \echo :ON_ERROR_ROLLBACK
on
postgres=# \copy t from /tmp/t.dat with (format csv, quote '"', ON_ERROR 'ignore');
NOTICE: 1 row was skipped due to data type incompatibility
COPY 2
postgres=# select * from t;
id | col2 | col3
----+---------+---------
1 | abc,def | abc,aaa
3 | a | b
(2 rows)
能看到第2行值自动跳过了。但是它不会明确告诉你出错的是第几行。不够完美啊。
总结
-
psql 命令行执行文件,可以跳过出错的行,继续执行
-
psql命令行事务块中多语句执行,默认如果中间有错,整个回滚。使用选项:
\set ON_ERROR_ROLLBACK on
, 可以与其它数据库保持一致,出错的自动回滚,其它语句继续执行 -
PG 17针对copy, 添加了ON_ERROR options支持,默认行为是stop, 将其指定为ignore 时(只对text/csv格式有效),可以忽略错误的行,继续执行。
针对这种copy, 其实,如果性能可控,我更愿意用自己实现简单的能用的程序使用batch insert/update来处理,完全可控。复杂的另说。
往期导读: