我在postgres有下面的表格
col1 col2 col3 col4
NCT04132960 Drug ds-8201a ds 8201a
NCT04132961 Drug ds 8201a ds 8201a
NCT04132962 Drug hydrogen sulphate hydrogen sulphate
NCT04132962 Drug hydrogen sulphate hydrogen
NCT04132962 Drug hydrogen sulphate sulphate
NCT04132963 Drug bupropion (wellbutrin) wellbutrin
NCT04132964 Drug bupropion (wellbutrin) bupropion
NCT04132964 Drug bupropion (wellbutrin) bupropione
NCT04132964 Drug bupropion (wellbutrin) bup
NCT04132965 Drug Galantimine Galantimine
NCT04132965 Drug Galantimine Galantimin
NCT04132966 Drug Glucose null我正在尝试基于以下过滤器清理上面的表格:
a. col3 should be similar/ identical to col4
b. length of col4 > 3
c. include row if col4 is null当我运行下面的查询时,我能够匹配除row1 (由于特殊字符而遗漏)之外的行,并且我得到了误报。下面是我得到的查询和输出
SELECT *
FROM table
WHERE col3 ~* ('\y' || REGEXP_REPLACE(col4, '\+|\-', '\\+', 'g') || 'y') --replace regexp specific sign
and length(col4) > 3
or col4 is null
order by col1
;如何忽略col3 != col4 (例如,如果我忽略特殊字符( ds -8201a,ds 8201a) (安非他酮(威灵灵),威灵灵),并且如果col4为空,则保留它们。
当前输出:
col1 col2 col3 col4
NCT04132961 Drug ds 8201a ds 8201a
NCT04132962 Drug hydrogen sulphate hydrogen sulphate
NCT04132962 Drug hydrogen sulphate hydrogen
NCT04132962 Drug hydrogen sulphate sulphate
NCT04132963 Drug bupropion (wellbutrin) wellbutrin
NCT04132964 Drug bupropion (wellbutrin) bupropion
NCT04132964 Drug bupropion (wellbutrin) bupropione
NCT04132965 Drug Galantimine Galantimine
NCT04132965 Drug Galantimine Galantimin
NCT04132966 Drug Glucose null所需输出:
col1 col2 col3 col4
NCT04132960 Drug ds-8201a ds 8201a
NCT04132961 Drug ds 8201a ds 8201a
NCT04132962 Drug hydrogen sulphate hydrogen sulphate
NCT04132963 Drug bupropion (wellbutrin) wellbutrin
NCT04132964 Drug bupropion (wellbutrin) bupropion
NCT04132965 Drug Galantimine Galantimine
NCT04132966 Drug Glucose null可以在postgres中处理这种数据清理吗?还是应该通过编程来完成?
谢谢
发布于 2020-06-06 17:57:15
我建议你在进行这样的比较之前,对col3和col4进行“标准化”--例如,用散列替换所有非字母数字字符。然后它就会像你所期望的那样工作。
SELECT *
FROM tbl
WHERE
regexp_replace(col3, '[^\w]', '#', 'g') ~* ('\y'||regexp_replace(col4, '[^\w]', '#', 'g')||'\y')
and length(col4) > 3
or col4 is null
order by col1;结果
col1 |col2|col3 |col4 |
-----------|----|----------------------|-----------------|
NCT04132960|Drug|ds-8201a |ds 8201a |
NCT04132961|Drug|ds 8201a |ds 8201a |
NCT04132962|Drug|hydrogen sulphate |hydrogen |
NCT04132962|Drug|hydrogen sulphate |sulphate |
NCT04132962|Drug|hydrogen sulphate |hydrogen sulphate|
NCT04132963|Drug|bupropion (wellbutrin)|wellbutrin |
NCT04132964|Drug|bupropion (wellbutrin)|bupropion |
NCT04132965|Drug|Galantimine |Galantimine |
NCT04132966|Drug|Glucose | |https://stackoverflow.com/questions/62228390
复制相似问题