文章/答案/技术大牛

发布

社区首页 >问答首页 >Bash -提取除特定urls以外的所有urls

问Bash -提取除特定urls以外的所有urls
EN

Unix & Linux用户

提问于 2018-03-08 02:16:30

回答 2查看 412关注 0票数 0

我有一个包含多行URL的文件。有些URL我对处理或查看不感兴趣，希望它们被忽略。我只想要其他的东西，但他们显示作为一个输出。

到目前为止，我的命令如下：

grep 'http://' data.txt | sed 's/.*\(http:.*\)\".*/\1/'

我想排除urls，包括以下内容：

http://schemas.openxmlformats.org...

我对此非常陌生，真诚地感谢您的帮助。

更新:这是我正在处理的文件：

Relationships Xmlns             : http://schemas.openxmlformats.org/package/2006/relationships
Relationships Relationship Type : http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties
Style Sheet Xmlns               : http://schemas.openxmlformats.org/spreadsheetml/2006/main
Relationships Xmlns             : http://schemas.openxmlformats.org/package/2006/relationships
Relationships Relationship Type : http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings
Workbook Xmlns                  : http://schemas.openxmlformats.org/spreadsheetml/2006/main
Relationships Xmlns             : http://schemas.openxmlformats.org/package/2006/relationships
Relationships Relationship Type : http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink
Relationships Relationship Target: http://www.yahoo.com/
Worksheet Xmlns                 : http://schemas.openxmlformats.org/spreadsheetml/2006/main
Sst Xmlns                       : http://schemas.openxmlformats.org/spreadsheetml/2006/main
Types Xmlns                     : http://schemas.openxmlformats.org/package/2006/content-types
Properties Xmlns                : http://schemas.openxmlformats.org/officeDocument/2006/extended-properties

我希望单独提取http://www.yahoo.com，忽略rest，因为它们包含schemas.openxmlformats

shell-script

bash

回答 2

Unix & Linux用户

回答已采纳

发布于 2018-03-08 02:30:59

我认为单靠sed是可以做到的：

sed -n '\,http://schemas.openxmlformats.org,!s/.*\(http:.*\).*/\1/p'

-n禁用文本的自动打印，因此只能打印选定的行。
\,http://schemas.openxmlformats.org,!只在不匹配的行上运行以下命令(因此，在结尾处是! ) http://schemas.openxmlformats.org。我在这里使用了,而不是/作为正则分隔符，因此在开始时使用了\,。这减少了在模式中转义\的需要。
s命令与您的命令相同，但我在它之后使用p打印行，行现在只包含URL。

我假设每一行只有一个URL。

删除额外的引号将使输出正确：

$ sed -n '\,http://schemas.openxmlformats.org,!s/.*\(http:.*\).*/\1/p' inpu-file
http://www.yahoo.com/

票数 1

Unix & Linux用户

发布于 2018-03-08 02:37:25

使用grep和-v选项可以让您选择不匹配的行。例如，提供一个包含以下内容的文件file.txt：

first line
second line
third line
fourth text

使用此命令：

grep "line" file.txt | grep -v "second"

其结果将是：

first line
third line

如果要同时排除多个单词，可以使用这样的正则表达式：

grep "line" file.txt | grep -vE "(second|first)"

其结果将是：

    third line

问题更新后：

在这种情况下，您可以使用以下任何一种方法：

grep 'http://www.yahoo' data.txt | sed 's/.*\(http:.*\)/\1/'
grep 'http://' data.txt | sed 's/.*\(http:.*\)/\1/' | grep yahoo

第一种方法将只给您www.yahoo。

第二个URL将提供包含yahoo的所有URL作为其中的一个单词。

用于提取所有URL，但一些URL：

grep 'http://' data.txt | sed 's/.*\(http:.*\)/\1/' | grep -vE "(openxmlformats|<Another URL to exclude>)"

票数 1

页面原文内容由Unix & Linux提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://unix.stackexchange.com/questions/428901

复制

相似问题

问Bash -提取除特定urls以外的所有urls
EN

回答 2

Unix & Linux用户

Unix & Linux用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Bash -提取除特定urls以外的所有urlsEN

回答 2

Unix & Linux用户

Unix & Linux用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Bash -提取除特定urls以外的所有urls
EN