我正在将我的书签从kippt.com移动到pinboard.in。
我从Kippt中导出了我的书签,出于某种原因,它们将标签(以#开头)和描述存储在同一个字段中。插接板将标签和描述分开。
这是导出后Kippt书签的外观:
<DT><A HREF="http://www.example.org/" ADD_DATE="1412337977" LIST="Bookmarks">This is a title</A>
<DD>#tag1 #tag2 This is a description在导入到插接板之前,它应该是这样的:
<DT><A HREF="http://www.example.org/" ADD_DATE="1412337977" LIST="Bookmarks" TAGS="tag1,tag2">This is a title</A>
<DD>This is a description所以基本上,我需要将#tag1 #tag2替换为TAGS="tag1,tag2",并将其移动到<A>中的第一行。
我在这里读到了关于移动数据块的文章:sed or awk to move one chunk of text betwen first pattern pair into second pair?
到目前为止,我还没有想出一个好的食谱。有什么见解吗?
编辑:
下面是输入文件的实际示例(3500个条目中的3个):
<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks">Phabricator</A>
<DD>#bug #tracking
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland发布于 2014-10-12 06:46:07
这可能不是最好的解决方案,但既然它似乎是一次性的,那么它就应该足够了。
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd如果代码的某些部分不清楚,请告诉我。当然,您可以使用python将行写入文件,而不是打印它们,甚至可以修改原始文件。
编辑:添加if子句,这样空的<DD>行就不会出现在结果中。
发布于 2014-10-13 19:25:27
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}awk -f script.awk kippt > pinboard 输入
<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks">Phabricator</A>
<DD>#bug #tracking
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland输出:
<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks", TAGS="bug,tracking">Phabricator</A>
<DD>
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks", TAGS="iceland,tour,car,drive,self">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD> Self-driving tour of Icelandhttps://stackoverflow.com/questions/26319280
复制相似问题