我对python和数据抓取很陌生。
我正在尝试使用python脚本获取一些汽车模型的数据。
我遇到的问题是python将响应解码为混合的文本,并且与响应内容不匹配。
发现我需要的信息包含在html head元素中的一个脚本标记中。
下面是我使用的简化脚本:
import requests
import lxml.html
urls = "https://www.ultimatespecs.com/car-specs/Audi/119438/Audi-A3-(8Y)-Sedan-35-TDI.html"
res = requests.get(urls)
print(res.headers)
tree = lxml.html.fromstring(res.content)
helem = lxml.html.tostring(tree.xpath('//head/script[@type=\'application/ld+json\']')[0])
print(helem)
print(helem.decode('utf-8'))响应标头
'__cfduid=d938bb826c443ab15f20272199e2f18141613300048;{'Date':'Sun,2021年2月14日10:54:09格林尼治时间‘,'Content-Type':'text/html;charset=UTF-8','Transfer-Encoding':’分块‘,’连接‘:’保持活着‘,'Set-Cookie': expires=Tue,16-21 10:54:08 GMT;path=/;domain=.ultimatespecs.com;HttpOnly;SameSite=Lax,PHPSESSID=ea60d27909207143c5ccd860e6fb3b76;path=/',“过期”:‘清华,1981年11月19日08:52:00格林尼治时间’,'Cache-Control':‘无存储,无缓存,必须重新验证’,'Pragma':' no-cache ',‘Cache’:'Accept-Encoding,User‘,'CF-Cache-Status':’动态‘,’cf-请求-id‘:’0841c63a9c0000b61bda3810000001‘,’Expect CT‘:’max-CT‘=604800,报告uri=“https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',‘’Report To‘:{“group”:“CF-nel”,"endpoints":{"url":"https:\/\/a.nel.cloudflare.com\/report?s=kB6vGZn5zLDoI%2FeQt9AF8174Aanh5La%2Bvh2beLKlCdnrHv5jbEIhC0h3FUVb56wTidKKSMFq1zuWhbakIydNto3EBXMZRt%2BwLD2FZgMsmHH53aJpanc%3D"},“max_age”:604800},'NEL':{“max_age”:604800,“report_to”:“CF-nel”},'Server':'cloudflare','CF-RAY':'62163fd76b76b61b-TLL',“内容-编码”:'gzip'}
作为字节的helem:
b‘\r\t’t\t\t‘t’http://schema.org/",\r\t\t"@type":“汽车”,\r\t“品牌”:“奥迪”,\t“制造商”:“奥迪”,\r“名称”:“奥迪A3 (8Y)轿车35 TDI",”描述“:”35 TDI规格:功率150 PS (148马力);柴油;平均消耗量:3.6升/100公里(65公斤);尺寸:长度:449.5厘米(176.97英寸);宽度:181.6厘米(71.5英寸);高度:142.5厘米(56.1英寸);重量:1390公斤(3064磅);202021年示范年,"productionDate":"2020","mainEntityOfPage":"https://www.ultimatespecs.com/car-specs/Audi/119438/Audi-A3-(8Y)-Sedan-35-TDI.html","image":{r“@type:"ImageObject",\r”contentUrl“:"https://www.ultimatespecs.com/wallpaper.php?id=7243"\r\t\t\t\t\t}\r\t\t\t\t\t,"height":{r\t\t”@type:"QuantitativeValue",\r t“unitCode”:"CMT",\t\t“值”:“142.5”\r\t},“宽度”:{r\t“@type”:"QuantitativeValue",\r\t“unitCode”:"CMT",\r\t“值”:“181.6”\r\t},“权重”:{r\t“@type”:"QuantitativeValue",“unitCode”:"KGM",“t”“值”:“1390”\t},"accelerationTime":{r“@type”:"QuantitativeValue",“unitCode”:"SEC",“t”“值”:“8.4”"driveWheelConfiguration":{r\t“@type”:"DriveWheelConfigurationValue",\r\t“@id”:"https://schema.org/FrontWheelDriveConfiguration"},"bodyType":“轿车”“,"cargoVolume":”@type“:"QuantitativeValue",”unitCode“:"LTR",”值“:"425"},"emissionsCO2":"96","fuelCapacity":{r\t“@type”:"QuantitativeValue",\r\t“unitCode”:"LTR",“unitCode”:“50”\r\t},"fuelConsumption":{r“@type”:"QuantitativeValue",\r\t“unitText”:"L/100 km",\r\t“valueReference”:“平均”,"fuelEfficiency":{r\t“@type”:"QuantitativeValue",“unitText”:"US“,”valueReference“:”平均值“,”t“”值“:”65“,"fuelType":”柴油机“,"numberOfDoors":"4","vehicleSeatingCapacity":"5","numberOfForwardGears":"7","vehicleTransmission":“双离合器自动”,“轴距”:{r\t“@type”:"QuantitativeValue",\r\t\t“unitCode”:"CMT",“值”:“263.6”\r\t\t},“速度”:{r\t“@type”:"QuantitativeValue",“unitCode”:"KMH","value":“232”\r},"vehicleConfiguration":"35 TDI",fuelType:“fuelType”,"engineDisplacement":{r“@type”:"QuantitativeValue",“unitCode”:"QuantitativeValue",“unitCode”:"NU",“值”:"360"},"enginePower":{r“@type”:"QuantitativeValue",“unitCode”:"N12",“值”:“150”}‘
作为文本的helem:
“值”:“150”}:{水泥“:{eEngine":[SeatingCapacity":"5","numberOfForwardGears":"7","vehicleTransmission":”双离合器自动“,”轴距“:{(176.97英寸);宽度:181.6厘米(71.5英寸);高度:142.5厘米(56.1英寸);重量:1390公斤(3064磅);模型年20202021年,"productionDate":"2020","mainEntityOfPage":"https://www.ultimatespecs.com/car-specs/Audi/119438/Audi-A3-(8Y)-Sedan-35-TDI.html","image":{
如您所见,解码后的文本本身重叠多次。
我做错什么了?
发布于 2021-02-14 13:21:22
如果我理解正确,您将查找以下数据。

码
import requests
import lxml.html
import json
import pprint as pp
urls = "https://www.ultimatespecs.com/car-specs/Audi/119438/Audi-A3-(8Y)-Sedan-35-TDI.html"
res = requests.get(urls)
tree = lxml.html.fromstring(res.content)
helem = tree.xpath('//head/script[@type=\'application/ld+json\']')[0].text
data = json.loads(helem)
pp.pprint(data,)输出
{'@context': 'http://schema.org/',
'@type': 'Car',
'accelerationTime': {'@type': 'QuantitativeValue',
'unitCode': 'SEC',
'value': '8.4'},
'bodyType': 'Sedan',
'brand': 'Audi',
'cargoVolume': {'@type': 'QuantitativeValue',
'unitCode': 'LTR',
'value': '425'},
'description': '35 TDI Specs:Power 150 PS (148 hp); Diesel;Average '
'consumption:3.6 l/100km (65 MPG);Dimensions: Length:449.5 cm '
'(176.97 inches); Width:181.6 cm (71.5 inches);Height:142.5 cm '
'(56.1 inches);Weight:1390 kg (3064 lbs);Model Years 2020,2021',
'driveWheelConfiguration': {'@id': 'https://schema.org/FrontWheelDriveConfiguration',
'@type': 'DriveWheelConfigurationValue'},
'emissionsCO2': '96',
'fuelCapacity': {'@type': 'QuantitativeValue',
'unitCode': 'LTR',
'value': '50'},
'fuelConsumption': {'@type': 'QuantitativeValue',
'unitText': 'L/100 km',
'value': '3.6',
'valueReference': 'Average'},
'fuelEfficiency': {'@type': 'QuantitativeValue',
'unitText': 'US MPG',
'value': '65',
'valueReference': 'Average'},
'fuelType': 'Diesel',
'height': {'@type': 'QuantitativeValue', 'unitCode': 'CMT', 'value': '142.5'},
'image': {'@type': 'ImageObject',
'contentUrl': 'https://www.ultimatespecs.com/wallpaper.php?id=7243'},
'mainEntityOfPage': 'https://www.ultimatespecs.com/car-specs/Audi/119438/Audi-A3-(8Y)-Sedan-35-TDI.html',
'manufacturer': 'Audi',
'name': 'Audi A3 (8Y) Sedan 35 TDI',
'numberOfDoors': '4',
'numberOfForwardGears': '7',
'productionDate': '2020',
'speed': {'@type': 'QuantitativeValue', 'unitCode': 'KMH', 'value': '232'},
'vehicleConfiguration': '35 TDI',
'vehicleEngine': [{'@type': 'EngineSpecification',
'engineDisplacement': {'@type': 'QuantitativeValue',
'unitCode': 'CMQ',
'value': '1968'},
'enginePower': {'@type': 'QuantitativeValue',
'unitCode': 'N12',
'value': '150'},
'fuelType': 'Diesel',
'torque': {'@type': 'QuantitativeValue',
'unitCode': 'NU',
'value': '360'}}],
'vehicleSeatingCapacity': '5',
'vehicleTransmission': 'Dualclutch Automatic',
'weight': {'@type': 'QuantitativeValue', 'unitCode': 'KGM', 'value': '1390'},
'wheelbase': {'@type': 'QuantitativeValue',
'unitCode': 'CMT',
'value': '263.6'},
'width': {'@type': 'QuantitativeValue', 'unitCode': 'CMT', 'value': '181.6'}}
Process finished with exit code 0https://stackoverflow.com/questions/66195006
复制相似问题