我有一个机器人给我带来了这样的html代码:
<div class="std">
<p>CAR:
<span>Onix</span>
</p>
<p>MODEL: LTZ</p>
<p>
<span>COLOR:
<span>Black</span>
</p>
<p>ACESSORIES:
<span>ABS</span>
</p>
<p>
<span>DESCRIPTION:</span>
<span>The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.</span>
</p>
<p>TECHNICAL DETAIL:
<span>The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..</span>
</p>
</div>我应用了下面的代码来删除HTML标记:
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr,'\n', html_code).strip()它回到了我的身边:
CAR: Onix
MODEL: LTZ
COLOR:
Black
ACESSORIES:
ABS
DESCRIPTION:
The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL:
The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..现在,我需要删除换行符,这样的内容如下:
CAR: Onix
MODEL: LTZ
COLOR: Black
ACESSORIES: ABS
DESCRIPTION: The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL: The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..我在下面尝试了下面的代码,但它与换行不匹配:
cleantext = re.sub(r':\s*[\r\n]*', ': ', cleantext)我还尝试了另一种代码:
cleantext = cleantext.replace(': \n', ': ')也不起作用。我该怎么处理呢?
发布于 2016-10-16 15:47:43
我认为你的问题有两部分,第一部分是用两行连接字符串,如下所示
COLOR: Black
至
COLOR: black
然后删除所有空行
对于第一部分,可以使用以下方法替换re.sub
cleantext = re.sub(r'(.*):\s*[\r\n](.*)', '\g<1>: \g<2>', cleantext)
为了删除空行,通过re.sub执行这一操作将很困难,所以我建议使用cleantext = "\n".join([line for line in cleantext.split('\n') if line.strip() != ''])
这将给出你所期望的答案。
发布于 2016-10-16 15:42:24
我觉得这个应该适合你
>>> string = """
CAR: Onix
MODEL: LTZ
COLOR:
Black
ACESSORIES:
ABS
DESCRIPTION:
The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL:
The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..
"""
>>> list_string = string.split("\n\n\n")
>>> for each in list_string:
print each.replace("\n","").strip()
CAR: Onix
MODEL: LTZ
COLOR:Black
ACESSORIES:ABS
DESCRIPTION:
The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL:The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..https://stackoverflow.com/questions/40071919
复制相似问题