我正在尝试使用python中的套接字模块提出一个请求。它成功地发出请求、获取响应并对其进行解码。当我查看HTML文档时,除了HTML文档中有3-4个长的随机字符串外,所有这些都是正确的。我认为我的代码是正确的,但我不能百分之百肯定。这是我的代码:
def recive_data(get, timeout):
ready = select.select([get], [], [], timeout)
if ready[0]:
return get.recv(4096)
return b""
def get_file(website, port, file, https=False):
data = []
new_data = ""
if https:
get = ssl.create_default_context().wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM), server_hostname=website)
else:
get = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
get.connect((website, port))
get.sendall(f"GET {file} HTTP/1.1\r\nHost: {website}:{port}\r\n\r\n".encode())
while True:
new_data = recive_data(get, 5).decode()
if new_data != "" and new_data != None:
data.append(new_data)
new_data = ""
else:
break
data = "".join(data)
header = data[0:data.find(newline+newline)]
data = data[data.find(newline+newline):data.rfind(f"{newline}0{newline}{newline}")]
data = BeautifulSoup(data, 'html.parser').prettify()
get.close()
return (header, data)如果我输入https://stackoverflow.com,它就会输出:
30d
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
<head>
<title>
Stack Overflow - Where Developers Learn, Share, & Build Careers
</title>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
<link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
<meta content="Stack Overflow is the largest, most trusted online communi
20d0
ty for developers to learn, share their programming knowledge, and build their careers." name="description"/>
<meta content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
<meta content="website" property="og:type">等等。然而,有些网站有比其他网站更多,我也搞不懂。任何帮助都是非常感谢的!
发布于 2021-03-03 03:20:07
响应中标题的最后一行为您提供了一个线索:
HTTP/1.1 200 OK
Connection: keep-alive
cache-control: private
...
transfer-encoding: chunkedtransfer-encoding的意思是,标题后面的内容并不是纯HTML。来自规格
The chunked encoding modifies the body of a message in order to
transfer it as a series of chunks, each with its own size indicator,
followed by an OPTIONAL trailer containing entity-header fields
...
The chunk-size field is a string of hex digits indicating the size of
the chunk. The chunked encoding is ended by any chunk whose size is
zero, followed by the trailer, which is terminated by an empty line.换句话说,您看到的是一个十六进制数,显示下一个块中的字节数。可能有不止一大块。您将需要检查该HTTP头,如果它存在,则在将页面解析为HTML之前找到所有块并将它们连接在一起。
https://stackoverflow.com/questions/66449988
复制相似问题