从大型文件集中获取user@host.com组合的最佳方法是什么?
我假设sed/awk可以做到这一点,但我不太熟悉regexp。
我们有一个文件,即Staff_data.txt,它不仅包含电子邮件,而且希望剥离其余数据,只收集电子邮件地址(即h@south.com)。
我认为最简单的方法是在终端中通过sed/awk,但是考虑到regexp可能有多复杂,我希望得到一些指导。
谢谢。
发布于 2013-05-01 00:56:48
你想在这里使用grep,而不是sed或awk。例如,显示来自域south.com的所有电子邮件
grep -o '[^ ]*@south\.com ' file发布于 2013-05-01 02:10:00
这是我几年前为完成这项工作而写的一个有点尴尬但显然可以工作的脚本:
# Get rid of any Message-Id line like this:
# Message-ID: <AANLkTinSDG_dySv_oy_7jWBD=QWiHUMpSEFtE-cxP6Y1@mail.gmail.com>
#
# Change any character that can't be in an email address to a space.
#
# Print just the character strings that look like email addresses.
#
# Drop anything with multple "@"s and change any domain names (i.e.
# the part after the "@") to all lower case as those are not case-sensitive.
#
# If we have a local mail box part (i.e. the part before the "@") that's
# a mix of upper/lower and another that's all lower, keep them both. Ditto
# for multiple versions of mixed case since we don't know which is correct.
#
# Sort uniquely.
cat "$@" |
awk '!/^Message-ID:/' |
awk '{gsub(/[^-_.@[:alnum:]]+/," ")}1' |
awk '{for (i=1;i<=NF;i++) if ($i ~ /.+@.+[.][[:alpha:]]+$/) print $i}' |
awk '
BEGIN { FS=OFS="@" }
NF != 2 { printf "Badly formatted %s skipped.\n",$0 | "cat>&2"; next }
{ $2=tolower($2); print }
' |
tr '[A-Z]' '[a-z]' |
sort -u它不是很漂亮,但看起来很健壮。
https://stackoverflow.com/questions/16305155
复制相似问题