我在站点urls.tmp文件中使用了3个urls获得了这个文件:
https://site1.com.br/wp-content/uploads/2020/06/?SD
https://site2.com.br/wp-content/uploads/tp-datademo/home-4/data/tp-hotel-booking/?SD
https://site3.com.br/wp-content/uploads/revslider/hotel-home/?MD我想删除每个"com.br/“之后的所有内容。
我试过这段代码:
# open the file
sys.stdout = open("urls.tmp", "w")
# start remove
for i in "urls.tmp":
url_parts = urllib.parse.urlparse(i)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=url_parts)
print(result) #overwrite the file
# close the file
sys.stdout.close()但是输出给了我一个奇怪的东西:
:///
:///
:///
:///
:///
:///
:///
:///我是初学者,我做错什么了?
发布于 2022-02-15 15:24:36
您正在迭代"urls.tmp"字符串本身,但希望逐行遍历打开的文件对象。
所以,试一试吧:
with open("urls.tmp", "r") as urls_file:
for line in urls_file:
url_parts = urllib.parse.urlparse(line)
result = "{uri.scheme}://{uri.netloc}/".format(uri=url_parts)
print(result)编辑:作者更新了最初的问题,提到源文件内容应该用处理过的urls重写,下面是示例:
new_urls = []
with open("urls.tmp", "r") as urls_file:
old_urls = urls_file.readlines()
for line in old_urls:
url_parts = urllib.parse.urlparse(line)
proc_url = "{uri.scheme}://{uri.netloc}/\n".format(uri=url_parts)
new_urls.append(proc_url)
with open("urls.tmp", "w") as urls_file:
urls_file.writelines(new_urls)发布于 2022-02-15 15:15:03
请参阅Savva Surenkov答案来解决您的问题。
您可以使用字符串的拆分方法,例如:
url = r"https://site1.com.br/wp-content/uploads/2020/06/?SD"
split_by = "com.br/"
new_url = url.split(split_by)[0] + split_by
# this gives you the part before <split_by> and then we can attach it again
new_url == r"https://site1.com.br"如果您想要添加一些额外的检查,您可以查看正则表达式。
那些你没有要求的东西,但作为初学者可能会帮助你。我建议使用
with open("urls.tmp", "w") as f:
# do something with f或
import pathlib
urls = pathlib.Path("urls.tmp").read_text()
# which gives you all lines in single string在平原open上空。如果您想了解更多这方面的信息,我建议您查看上下文管理器。
还有从Python3.6开始的f-strings,在我看来,它们比"{}".format更容易阅读。
发布于 2022-02-15 15:21:46
您可以继续使用string的find()方法。
urllist=[
'https://site1.com.br/wp-content/uploads/2020/06/?SD',
'https://site2.com.br/wp-content/uploads/tp-datademo/home-4/data/tp-hotel-booking/?SD',
'https://site3.com.br/wp-content/uploads/revslider/hotel-home/?MD']
newlist=[]
breaktext='com.br/'
for item in urllist:
position=item.find(breaktext)
newlist.append(item[:position+len(breaktext)])
print (newlist)https://stackoverflow.com/questions/71128817
复制相似问题