Python HTTP库之requests模块
一、介绍
requests
是一个用于发送 HTTP 请求的 Python 库,其 API 比较简洁,使用起来比 urllib
更加便捷(本质是封装了 urllib3
)。
注意事项
requests
库发送请求将网页内容下载下来以后,并不会执行 JavaScript 代码,这需要我们自己分析目标站点然后发起新的请求。- 在正式学习
requests
前,建议先熟悉 HTTP 协议
安装
常见请求方式
GET
和 POST
是最常用的请求方式。- 其他请求方式包括
PUT
, DELETE
, HEAD
, OPTIONS
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| import requests
r = requests.get('https://api.github.com/events')
r = requests.post('http://httpbin.org/post', data={'key': 'value'})
r = requests.put('http://httpbin.org/put', data={'key': 'value'})
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')
|
官方文档
二、基于 GET 请求
1. 基本请求
1 2 3 4
| import requests
response = requests.get('http://dig.chouti.com/') print(response.text)
|
2. 带参数的 GET 请求
自己拼接 GET 参数
1 2 3 4 5 6 7
| import requests
response = requests.get('https://www.baidu.com/s?wd=python&pn=1', headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', }) print(response.text)
|
URL 编码
如果查询关键词是中文或其他特殊字符,需要进行 URL 编码。
1 2 3 4 5 6 7 8 9 10 11 12 13
| from urllib.parse import urlencode
wd = 'jerry老师' encode_res = urlencode({'k': wd}, encoding='utf-8') keyword = encode_res.split('=')[1]
url = f'https://www.baidu.com/s?wd={keyword}&pn=1'
response = requests.get(url, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', }) print(response.text)
|
使用 params
参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| import requests
wd = 'jerry老师' pn = 1
response = requests.get('https://www.baidu.com/s', params={ 'wd': wd, 'pn': pn }, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', }) print(response.text)
|
3. 带头部的 GET 请求
1 2 3 4 5 6 7 8
| import requests
headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36', }
response = requests.get('https://www.zhihu.com/explore', headers=headers) print(response.status_code)
|
4. 带 Cookie 的 GET 请求
1 2 3 4 5 6 7 8
| import requests
cookies = { 'sessionid': 'abc123', }
response = requests.get('https://example.com', cookies=cookies) print(response.text)
|
好的,以下是优化后的关于基于 POST 请求的部分:
三、基于 POST 请求
1. 介绍
GET 请求:
- 默认的请求方法。
- 没有请求体。
- 数据必须在 1K 之内。
- 数据会暴露在浏览器的地址栏中。
- 常用操作:
- 在浏览器的地址栏中直接给出 URL。
- 点击页面上的超链接。
- 提交表单时,默认使用 GET 请求,但可以设置为 POST。
POST 请求:
- 数据不会出现在地址栏中。
- 数据的大小没有上限。
- 有请求体。
- 请求体中如果存在中文,会使用 URL 编码。
requests.post()
用法与 requests.get()
完全一致,特殊的是 requests.post()
有一个 data
参数,用来存放请求体数据。
2. 发送 POST 请求
模拟浏览器的登录行为
示例:自动登录 GitHub
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| import requests import re
r1 = requests.get('https://github.com/login') r1_cookie = r1.cookies.get_dict() authenticity_token = re.findall(r'name="authenticity_token".*?value="(.*?)"', r1.text)[0]
data = { 'commit': 'Sign in', 'utf8': '✓', 'authenticity_token': authenticity_token, 'login': 'your_username', 'password': 'your_password' }
r2 = requests.post('https://github.com/session', data=data, cookies=r1_cookie) login_cookie = r2.cookies.get_dict()
r3 = requests.get('https://github.com/settings/emails', cookies=login_cookie) print('your_username' in r3.text)
|
使用 requests.Session
自动管理 Cookie
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| import requests import re
session = requests.Session()
r1 = session.get('https://github.com/login') authenticity_token = re.findall(r'name="authenticity_token".*?value="(.*?)"', r1.text)[0]
data = { 'commit': 'Sign in', 'utf8': '✓', 'authenticity_token': authenticity_token, 'login': 'your_username', 'password': 'your_password' }
r2 = session.post('https://github.com/session', data=data) login_cookie = r2.cookies.get_dict()
r3 = session.get('https://github.com/settings/emails') print('your_username' in r3.text)
|
3. 补充
自定义请求头
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| import requests
url = 'http://httpbin.org/post' headers = { 'Content-Type': 'application/json' } data = { 'key': 'value' }
response = requests.post(url, data=data, headers=headers) print(response.json())
response = requests.post(url, json=data) print(response.json())
|
处理 JSON 数据
1 2 3 4 5 6 7 8 9
| import requests
url = 'http://httpbin.org/post' data = { 'key': 'value' }
response = requests.post(url, json=data) print(response.json())
|
处理表单数据
1 2 3 4 5 6 7 8 9
| import requests
url = 'http://httpbin.org/post' data = { 'key': 'value' }
response = requests.post(url, data=data) print(response.json())
|
四、响应处理
1. 响应属性
1 2 3 4 5 6 7 8 9 10 11 12
| import requests
response = requests.get('http://www.jianshu.com')
print(response.text) print(response.content) print(response.status_code) print(response.headers) print(response.cookies) print(response.url) print(response.history) print(response.encoding)
|
2. 关闭响应
1 2 3 4 5
| from contextlib import closing
with closing(requests.get('http://example.com', stream=True)) as response: for line in response.iter_content(): pass
|
3. 编码问题
1 2 3 4 5
| import requests
response = requests.get('http://www.autohome.com/news') response.encoding = 'gbk' print(response.text)
|
4. 获取二进制数据
1 2 3 4 5 6
| import requests
response = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b36e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg')
with open('a.jpg', 'wb') as f: f.write(response.content)
|
5. 流式下载大文件
1 2 3 4 5 6 7 8
| import requests
response = requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4', stream=True)
with open('b.mp4', 'wb') as f: for chunk in response.iter_content(chunk_size=1024): if chunk: f.write(chunk)
|
6. 解析 JSON
1 2 3 4 5 6
| import requests
response = requests.get('http://httpbin.org/get')
res1 = response.json() print(res1)
|
7. 重定向和历史
1 2 3 4 5 6
| import requests
r = requests.get('http://github.com') print(r.url) print(r.status_code) print(r.history)
|
五、高级用法
1. SSL 证书验证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| import requests from requests.packages.urllib3.exceptions import InsecureRequestWarning
response = requests.get('https://www.12306.cn')
response = requests.get('https://www.12306.cn', verify=False) print(response.status_code)
requests.packages.urllib3.disable_warnings(InsecureRequestWarning) response = requests.get('https://www.12306.cn', verify=False) print(response.status_code)
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key')) print(response.status_code)
|
2. 使用代理
1 2 3 4 5 6 7 8 9
| import requests
proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
response = requests.get('https://api.github.com', proxies=proxies) print(response.status_code)
|
3. 会话对象
1 2 3 4 5 6 7 8
| import requests
session = requests.Session() session.headers.update({'User-Agent': 'MyApp/1.0'})
response = session.get('https://api.github.com') print(response.status_code) print(response.json())
|
4. 异常处理
1 2 3 4 5 6 7 8 9 10
| import requests from requests.exceptions import RequestException
try: response = requests.get('https://api.github.com') response.raise_for_status() except RequestException as e: print(f"An error occurred: {e}") else: print(response.json())
|
六、登录示例
GitHub 登录
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| import requests import re
r1 = requests.get('https://github.com/login') r1_cookie = r1.cookies.get_dict() authenticity_token = re.findall(r'name="authenticity_token".*?value="(.*?)"', r1.text)[0]
data = { 'commit': 'Sign in', 'utf8': '✓', 'authenticity_token': authenticity_token, 'login': 'your_username', 'password': 'your_password' }
r2 = requests.post('https://github.com/session', data=data, cookies=r1_cookie) print(r2.status_code) print(r2.url) print(r2.history)
r2 = requests.post('https://github.com/session', data=data, cookies=r1_cookie, allow_redirects=False) print(r2.status_code) print(r2.url) print(r2.history)
|