首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在Linux中,使用sed、awk、cat或grep将xml中的urls导入单独的文件。

在Linux中,使用sed、awk、cat或grep将xml中的urls导入单独的文件。
EN

Stack Overflow用户
提问于 2014-06-06 15:59:37
回答 1查看 229关注 0票数 0

我有一个xml文件,其中包含许多产品,如下所示。

我想要grep在这个文档中的所有urls,并将它们管道到一个新的文档。例如,我希望在以下之间获取url:

代码语言:javascript
复制
<url></url>

并将这些输入到一个新的txt文件中,每个url都放在一个新行中。因此,输出看起来像urls列表,如下所示:

代码语言:javascript
复制
http://www.example.com/nav/rooms/kitchens/kitchen-worktops/gemstone_solid_surface_worktops/-specificproducttype-worktops/Cooke-and-Lewis-Gemstone-Triassic-Worktop-3050mm-13128613
http://www.example.com/nav/fix/nails-screws-fixings-hardware/furniture-hardware/legs___supports/-specificproducttype-furniture_legs/Rothley-Furniture-Leg-Angled-L501XN-Brushed-Nickel-Effect-H128mm-9281999
http://www.example.com/nav/fix/electrical/cable-management/cable_clips/Corelectric-Clips-Cable-Round-Polybag-Pk20-11348134
http://www.example.com/nav/fix/power-tool-accessories/router-bits/jointing_biscuits/Trend-T-Tech-Beech-Biscuit-No-10-TT-BSC-10-100-Pack-9288386
etc... 

以下是xml的一个示例,它多次重复用于许多产品:

代码语言:javascript
复制
<product>
                          <id>13128613</id>
                          <name>Cooke &amp; Lewis Gemstone Triassic Worktop 3050mm</name>
                          <categoryId>9372151</categoryId>
                          <features>Edged 1 long, 2 short sides, No templating required reducing fitting complexities, time and cost, This stunning design is made from 85% recycled material including glass and shell, supporting environmental sustainability, A 6mm solid material bonded to a 28mm solid chipboard core, backed with a moisture resistant balance paper for complete water resistance, A hard surface that is resistant to daily wear and tear</features>
                          <url>http://www.example.com/nav/rooms/kitchens/kitchen-worktops/gemstone_solid_surface_worktops/-specificproducttype-worktops/Cooke-and-Lewis-Gemstone-Triassic-Worktop-3050mm-13128613</url>
                          <productHierarchy>Rooms &gt; Kitchens &gt; Kitchen Worktops &gt; Gemstone Solid Surface Worktops &gt; Worktops</productHierarchy>
                          <quantity/>
                          <sku>
                                    <id>13619319</id>
                                    <name>Cooke &amp; Lewis Gemstone Triassic Worktop 3050mm</name>
                                    <description>A 6mm solid material bonded to a 28mm high performance chipboard core, Cooke &amp; Lewis Gemstone is the perfect green choice, formulated with 85% recycled material.</description>
                                    <ean>5397007119039</ean>
                                    <condition>new</condition>
                                    <price>582.00</price>
                                    <wasPrice/>
                                    <deliveryCost>0.0</deliveryCost>
                                    <deliveryTime>Delivery usually within 5 weeks</deliveryTime>
                                    <stockAvailability>1</stockAvailability>
                                    <skuAvailableInStore>0</skuAvailableInStore>
                                    <skuAvailableOnline>1</skuAvailableOnline>
                                    <channel>Home Delivery Only</channel>
                                    <buyerCats>
                <catLevel0>KITCHENS</catLevel0>
                <catLevel1>SOLID SURFACE WORKTOPS</catLevel1>
                <catLevel2>SPEEDSTONE SOLID SURFACE</catLevel2>
            </buyerCats>
                                    <affiliateCats>
                <affiliateCat0>Home &amp; Garden</affiliateCat0>
            </affiliateCats>
                                    <manufacturersPartNumber/>
                                    <specificationsModelNumber/>
                                    <featuresBrand>Cooke &amp; Lewis Gemstone</featuresBrand>
                                    <imageUrl>http://example.com/is/image/5397007119039_001c_v001_zp</imageUrl>
                                    <thumbnailUrl>http://example.com/is/image/5397007119039_001c_v001_zp?$75x75_generic$=</thumbnailUrl>
                                    <skuNavAttributes>
                                              <ecoGrowFoods>false</ecoGrowFoods>
                                              <ecoDLME>false</ecoDLME>
                                              <ecoRecycle>false</ecoRecycle>
                                              <ecoSavesWater>false</ecoSavesWater>
                                              <ecoHealthyHomes>false</ecoHealthyHomes>
                                              <ecoNurtureNature>false</ecoNurtureNature>
                                              <ecoSavesEnergy>false</ecoSavesEnergy>
                                    </skuNavAttributes>
                          </sku>
                </product>

我只想得到产品的主url,我不关心xml结构中的其他url,比如imageUrl和thumbnailUrl。

我试过:

代码语言:javascript
复制
sed -rn '/<url>([^"]*)<\/url>/' file.xml > file.txt

然而,到目前为止,输出是空的。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-06-06 16:09:13

您可以首先对<url>行进行grep (如果XML文件的格式如您所示),最后删除XML标记:

代码语言:javascript
复制
grep '<url>' file.xml | sed 's/.*>\([^<]*\)<.*/\1/' >> file.txt

您可以完全删除标记。

代码语言:javascript
复制
grep '<url>' a.txt | sed 's/<\/*url>//g'

在用空格替换<>之后,可以选择第二列:

代码语言:javascript
复制
grep '<url>' a.txt | tr '<>' ' ' | awk '{print $2}'

另外,与使用grep不同,您可以使用xpath来选择适当的标记,例如:

代码语言:javascript
复制
xpath -q -e '//product/url' file.xml | ... > file.txt
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/24086107

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档