url="https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt"
data=urllib.request.urlopen(url)
list_line=[str(x) for x in data]
for line in list_line:
line.replace("b'","")
line.replace("\\n","")
line.replace("\\t","")
print (list_line)它生成的列表如下:
“b”-开始隐私增强消息"b‘V5559QRyTgPe9PfVt0db9Q==\n'“、"b'\n'”、"b'0000950170-98-000413.txt :19980309\n“、"b'0000950170-98-000413.hdr.sgml : 19980309\n'”<-样本
我想删除b',\n和\t,字符串拆分和替换无效,怎么做呢?
发布于 2018-03-14 16:16:51
与其试图替换东西,不如将数据解码为utf-8,以获得结果文本:
import urllib.request
url = "https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt"
data = urllib.request.urlopen(url).read()
text = data.decode('utf-8')
text = text.replace('\t', '') # Remove tabs if still needed
print(text)这将显示案文的开头如下:
-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
EvPdKfnjzBIjWkEk2RgNCk1/52qXomHpN+LDwL/XTT/XBuAzk70AYYrsxlQbyiqr
V5559QRyTgPe9PfVt0db9Q==
<SEC-DOCUMENT>0000950170-98-000413.txt : 19980309
<SEC-HEADER>0000950170-98-000413.hdr.sgml : 19980309
ACCESSION NUMBER: 0000950170-98-000413
CONFORMED SUBMISSION TYPE: 10-K405
PUBLIC DOCUMENT COUNT:如果您想要一行列表,请添加:
lines = text.splitlines()https://stackoverflow.com/questions/49282589
复制相似问题