Redshift の UDF では非 ASCII 文字を含む unicode 型を返すとエラーになる
次のように unicode 型を返す UDF を定義すると、返り値に非 ASCII 文字を含む場合にエラーになります。
CREATE OR REPLACE FUNCTION f_decode_utf8 (s varchar(max)) RETURNS varchar(max) IMMUTABLE AS $$
return s.decode('utf-8')
$$ LANGUAGE plpythonu;
> select f_decode_utf8('a');
f_decode_utf8
---------------
a
(1 row)
> select f_decode_utf8('あ');
ERROR: UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128). Please look at svl_udf_log for more information
DETAIL:
-----------------------------------------------
error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128). Please look at svl_udf_log for more information
code: 10000
context: UDF
query: 0
location: udf_client.cpp:350
process: padbmaster [pid=29161]
-----------------------------------------------
わざわざ decode(‘utf-8’) する人はいないと思いますが、例えば JSON を渡して json.loads すると unicode 型になるので、意味不明なエラーに悩まされることになります。
-- json_extract_array_element_text を UDF で定義してみる
CREATE OR REPLACE FUNCTION f_json_extract_array_element_text (json_string varchar(max), pos integer) RETURNS varchar(max) IMMUTABLE AS $$
import json
return json.loads(json_string)[pos]
$$ LANGUAGE plpythonu;
> select f_json_extract_array_element_text('["あ"]', 0);
ERROR: UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128). Please look at svl_udf_log for more information
DETAIL:
-----------------------------------------------
error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128). Please look at svl_udf_log for more information
code: 10000
context: UDF
query: 0
location: udf_client.cpp:350
process: padbmaster [pid=29161]
-----------------------------------------------
次のように最後に encode(‘utf-8’) すればエラーになりません。
CREATE OR REPLACE FUNCTION f_json_extract_array_element_text (json_string varchar(max), pos integer) RETURNS varchar(max) IMMUTABLE AS $$
import json
return json.loads(json_string)[pos].encode('utf-8')
$$ LANGUAGE plpythonu;
> select f_json_extract_array_element_text('["あ"]', 0);
f_json_extract_array_element_text
-----------------------------------
あ
(1 row)