perl -i -pe 's/(,\h*"[^\n"]*)\n/$1 /g' /opt/data-integration/transfer/events/processing/Master_Events_List.csv这里发生什么事情?我尝试了一个翻译器,但它有点模糊。这里有哪些可能返回的示例?
发布于 2018-06-23 02:42:31
首先,不要尝试使用正则表达式操作CSV (或XML或HTML)。虽然CSV可能看起来很简单,但它可能是微妙的。请改用Text::CSV。例外情况是,如果您的CSV是畸形的,并且您正在修复它。
现在,看看您的正则表达式正在做什么。首先,让我们将它从s//转换成s{}{},这样看起来更容易一些,并使用\x,这样我们就可以把东西间隔开一点。
s{
# Capture to $1
(
# A comma.
,
# 0 or more `h` "horizontal whitespace": tabs and spaces
\h*
# A quote.
"
# 0 or more of anything which is not a quote or newline.
[^\n"]*
)
# A newline (not captured)
\n
}
# Put the captured bit in with a space after it.
# The `g` says to do it multiple times over the whole string.
{$1 }gx它会将foo, "bar\n更改为foo, "bar。我猜它正在将CSV中包含换行符的文本字段转换为只包含空格的文本字段。
foo, "first
field", "second
field"将会变成
foo, "first field", "second field"这是使用Text::CSV更好的处理方法。我怀疑转换的目的是为了帮助不能处理换行符的CSV解析器。Text::CSV can with a little coercing。
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use autodie;
use Text::CSV;
use IO::Scalar;
use Data::Dumper;
# Pretend our scalar is an IO object so we can use `getline`.
my $str = qq[foo, "bar", "this\nthat"\n];
my $io = IO::Scalar->new(\$str);
# Configure Text::CSV
my $csv = Text::CSV->new({
# Embedded newlines normally aren't allowed, this tells Text::CSV to
# treat the content as binary instead.
binary=> 1,
# Allow spaces between the cells.
allow_whitespace => 1
});
# Use Text::CSV->getline() to do the parsing.
while( my $row = $csv->getline($io) ) {
# Dump the contents of the row
say Dumper $row;
}它将正确地解析该行及其嵌入的换行符。
$VAR1 = [
'foo',
'bar',
'this
that'
];发布于 2018-06-23 02:33:20
将此编辑为second Schwern (也得到了提升):正则表达式似乎不太适合处理CSV。
至于所讨论的正则表达式,让我们对其进行剖析。从顶层开始:
's/(,\h*"[^\n"]*)\n/$1 /g'
s/part1/part2/g表达式的意思是“到处用第二部分替换第一部分”。
现在让我们来看一下“第一部分”:
(,\h*"[^\n"]*)\n
圆括号包含了一个组。只有一个组,所以它变成了组号1。我们将在下一步回到这一点。
然后,查看https://perldoc.perl.org/perlrebackslash.html以了解字符类的解释。\h是水平空格,\n是逻辑换行符。
组中的表达式表示:“以逗号开头,然后是任意数量的水平空格字符,然后是除换行符和引号之外的任何字符;最后,必须有一个尾随的换行符”。因此,它基本上是csv字段后面的逗号。
最后,“第二部分”写道:
$1
这只是对前面捕获的组号1的引用,后面跟着一个空格。
总而言之,整个表达式替换了没有以引号结尾的尾随字符串字段,并删除了它的换行结束符。
发布于 2018-06-27 03:40:34
修复伪装为记录结束的带引号字段中的换行符的最佳方法:
首先,不要尝试用模块操作CSV (或XML或HTML)。虽然CSV可能看起来很棘手,但它非常简单。不要使用Text::CSV。相反,使用带有回调的替代正则表达式。
此外,您还可以使用正则表达式来正确解析csv,而无需替换
换行符,但是您可能希望使用Perl来修复它,以便在其他语言中使用。
正则表达式(带修剪)
/((?:^|,|\r?\n))\s*(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"[^\S\r\n]*(?=$|,|\r?\n))|([^,\r\n]*(?=$|,|\r?\n)))/
解释
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add
# [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)(注意-这需要一个脚本。)
Perl示例
use strict;
use warnings;
$/ = undef;
sub RmvNLs {
my ($delim, $quote, $non_quote) = @_;
if ( defined $non_quote ) {
return $delim . $non_quote;
}
$quote =~ s/\s*\r?\n/ /g;
return $delim . $quote;
}
my $csv = <DATA>;
$csv =~ s/
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
/RmvNLs($1,$2,$3)/xeg;
print $csv;
__DATA__
497,50,2008-08-02T16:56:53Z,469,4,
"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,
PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING
HARD ON BEING A GOOD
SERVENT."
PRO1235, "KEITH SMITH",ENGINEER,"keith has been working
hard on being a good
servent."
PRO1235,"KENNY SMITH",,"keith has been working
hard on being a good
servent."
PRO1235,"RICK SMITH",,, # 输出
497,50,2008-08-02T16:56:53Z,469,4,"foo bar foo bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING HARD ON BEING A GOOD SERVENT."
PRO1235,"KEITH SMITH",ENGINEER,"keith has been working hard on being a good servent."
PRO1235,"KENNY SMITH",,"keith has been working hard on being a good servent."
PRO1235,"RICK SMITH",,,https://stackoverflow.com/questions/50993647
复制相似问题