Python HTTP库之requests模块

墨颜丶2019-12-112024-11-13

Python HTTP库之requests模块

一、介绍

requests 是一个用于发送 HTTP 请求的 Python 库，其 API 比较简洁，使用起来比 urllib 更加便捷（本质是封装了 urllib3）。

注意事项

requests 库发送请求将网页内容下载下来以后，并不会执行 JavaScript 代码，这需要我们自己分析目标站点然后发起新的请求。
在正式学习 requests 前，建议先熟悉 HTTP 协议

安装

1	pip3 install requests

常见请求方式

GET 和 POST 是最常用的请求方式。
其他请求方式包括 PUT, DELETE, HEAD, OPTIONS。

import requests

# GET 请求
r = requests.get('https://api.github.com/events')

# POST 请求
r = requests.post('http://httpbin.org/post', data={'key': 'value'})

# PUT 请求
r = requests.put('http://httpbin.org/put', data={'key': 'value'})

# DELETE 请求
r = requests.delete('http://httpbin.org/delete')

# HEAD 请求
r = requests.head('http://httpbin.org/get')

# OPTIONS 请求
r = requests.options('http://httpbin.org/get')

官方文档

官方文档

二、基于 GET 请求

1. 基本请求

import requests

response = requests.get('http://dig.chouti.com/')
print(response.text)

2. 带参数的 GET 请求

自己拼接 GET 参数

import requests

response = requests.get('https://www.baidu.com/s?wd=python&pn=1',
                        headers={
                            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                        })
print(response.text)

URL 编码

如果查询关键词是中文或其他特殊字符，需要进行 URL 编码。

from urllib.parse import urlencode

wd = 'jerry老师'
encode_res = urlencode({'k': wd}, encoding='utf-8')
keyword = encode_res.split('=')[1]

url = f'https://www.baidu.com/s?wd={keyword}&pn=1'

response = requests.get(url,
                        headers={
                            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                        })
print(response.text)

使用 params 参数

import requests

wd = 'jerry老师'
pn = 1

response = requests.get('https://www.baidu.com/s',
                        params={
                            'wd': wd,
                            'pn': pn
                        },
                        headers={
                            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                        })
print(response.text)

3. 带头部的 GET 请求

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
}

response = requests.get('https://www.zhihu.com/explore', headers=headers)
print(response.status_code)  # 200

import requests

cookies = {
    'sessionid': 'abc123',
}

response = requests.get('https://example.com', cookies=cookies)
print(response.text)

好的，以下是优化后的关于基于 POST 请求的部分：

三、基于 POST 请求

1. 介绍

GET 请求：
- 默认的请求方法。
- 没有请求体。
- 数据必须在 1K 之内。
- 数据会暴露在浏览器的地址栏中。
- 常用操作：
  - 在浏览器的地址栏中直接给出 URL。
  - 点击页面上的超链接。
  - 提交表单时，默认使用 GET 请求，但可以设置为 POST。
POST 请求：
- 数据不会出现在地址栏中。
- 数据的大小没有上限。
- 有请求体。
- 请求体中如果存在中文，会使用 URL 编码。
- requests.post() 用法与 requests.get() 完全一致，特殊的是 requests.post() 有一个 data 参数，用来存放请求体数据。

2. 发送 POST 请求

模拟浏览器的登录行为

示例：自动登录 GitHub

import requests
import re

# 第一次请求：获取初始 Cookie 和 CSRF Token
r1 = requests.get('https://github.com/login')
r1_cookie = r1.cookies.get_dict()  # 拿到初始 cookie (未被授权)
authenticity_token = re.findall(r'name="authenticity_token".*?value="(.*?)"', r1.text)[0]  # 从页面中拿到 CSRF TOKEN

# 第二次请求：带着初始 cookie 和 TOKEN 发送 POST 请求给登录页面，带上账号密码
data = {
    'commit': 'Sign in',
    'utf8': '✓',
    'authenticity_token': authenticity_token,
    'login': 'your_username',
    'password': 'your_password'
}

r2 = requests.post('https://github.com/session', data=data, cookies=r1_cookie)
login_cookie = r2.cookies.get_dict()

# 第三次请求：以后的登录，拿着 login_cookie 就可以
r3 = requests.get('https://github.com/settings/emails', cookies=login_cookie)
print('your_username' in r3.text)  # True

使用 requests.Session 自动管理 Cookie

import requests
import re

session = requests.Session()

# 第一次请求：获取初始 Cookie 和 CSRF Token
r1 = session.get('https://github.com/login')
authenticity_token = re.findall(r'name="authenticity_token".*?value="(.*?)"', r1.text)[0]  # 从页面中拿到 CSRF TOKEN

# 第二次请求：带着初始 cookie 和 TOKEN 发送 POST 请求给登录页面，带上账号密码
data = {
    'commit': 'Sign in',
    'utf8': '✓',
    'authenticity_token': authenticity_token,
    'login': 'your_username',
    'password': 'your_password'
}

r2 = session.post('https://github.com/session', data=data)
login_cookie = r2.cookies.get_dict()

# 第三次请求：以后的登录，拿着 login_cookie 就可以
r3 = session.get('https://github.com/settings/emails')
print('your_username' in r3.text)  # True

3. 补充

自定义请求头

import requests

url = 'http://httpbin.org/post'
headers = {
    'Content-Type': 'application/json'
}
data = {
    'key': 'value'
}

# 使用 data 参数
response = requests.post(url, data=data, headers=headers)
print(response.json())

# 使用 json 参数
response = requests.post(url, json=data)
print(response.json())

处理 JSON 数据

import requests

url = 'http://httpbin.org/post'
data = {
    'key': 'value'
}

response = requests.post(url, json=data)
print(response.json())

处理表单数据

import requests

url = 'http://httpbin.org/post'
data = {
    'key': 'value'
}

response = requests.post(url, data=data)
print(response.json())

四、响应处理

1. 响应属性

import requests

response = requests.get('http://www.jianshu.com')

print(response.text)        # 响应内容的字符串形式
print(response.content)     # 响应内容的二进制形式
print(response.status_code) # HTTP 状态码
print(response.headers)     # 响应头
print(response.cookies)     # 响应的 Cookie
print(response.url)         # 实际请求的 URL
print(response.history)     # 重定向历史
print(response.encoding)    # 响应内容的编码

2. 关闭响应

from contextlib import closing

with closing(requests.get('http://example.com', stream=True)) as response:
    for line in response.iter_content():
        pass

3. 编码问题

import requests

response = requests.get('http://www.autohome.com/news')
response.encoding = 'gbk'  # 设置编码
print(response.text)

4. 获取二进制数据

import requests

response = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b36e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg')

with open('a.jpg', 'wb') as f:
    f.write(response.content)

5. 流式下载大文件

import requests

response = requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4', stream=True)

with open('b.mp4', 'wb') as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)

6. 解析 JSON

import requests

response = requests.get('http://httpbin.org/get')

res1 = response.json()  # 直接获取 JSON 数据
print(res1)

7. 重定向和历史

import requests

r = requests.get('http://github.com')
print(r.url)           # 最终的 URL
print(r.status_code)   # 状态码
print(r.history)       # 重定向历史

五、高级用法

1. SSL 证书验证

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning

# 默认情况下，SSL 请求会检查证书是否合法
response = requests.get('https://www.12306.cn')

# 忽略证书验证
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 关闭警告
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 指定证书
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

2. 使用代理

import requests

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

response = requests.get('https://api.github.com', proxies=proxies)
print(response.status_code)

3. 会话对象

import requests

session = requests.Session()
session.headers.update({'User-Agent': 'MyApp/1.0'})

response = session.get('https://api.github.com')
print(response.status_code)  # 打印状态码
print(response.json())       # 解析 JSON 响应

4. 异常处理

import requests
from requests.exceptions import RequestException

try:
    response = requests.get('https://api.github.com')
    response.raise_for_status()  # 如果响应状态码不是 200，抛出 HTTPError
except RequestException as e:
    print(f"An error occurred: {e}")
else:
    print(response.json())

六、登录示例

GitHub 登录

import requests
import re

# 第一次请求
r1 = requests.get('https://github.com/login')
r1_cookie = r1.cookies.get_dict()  # 拿到初始 cookie (未被授权)
authenticity_token = re.findall(r'name="authenticity_token".*?value="(.*?)"', r1.text)[0]  # 从页面中拿到 CSRF TOKEN

# 第二次请求：带着初始 cookie 和 TOKEN 发送 POST 请求给登录页面，带上账号密码
data = {
    'commit': 'Sign in',
    'utf8': '✓',
    'authenticity_token': authenticity_token,
    'login': 'your_username',
    'password': 'your_password'
}

# 不指定 `allow_redirects=False`，则响应头中出现 Location 就跳转到新页面
r2 = requests.post('https://github.com/session', data=data, cookies=r1_cookie)
print(r2.status_code)  # 200
print(r2.url)  # 跳转后的页面
print(r2.history)  # 跳转前的 response

# 指定 `allow_redirects=False`，则响应头中即便出现 Location 也不会跳转到新页面
r2 = requests.post('https://github.com/session', data=data, cookies=r1_cookie, allow_redirects=False)
print(r2.status_code)  # 302
print(r2.url)  # 跳转前的页面
print(r2.history)  # []

Python HTTP库之requests模块

一、介绍

注意事项

安装

常见请求方式

官方文档

二、基于 GET 请求

1. 基本请求

2. 带参数的 GET 请求

3. 带头部的 GET 请求

4. 带 Cookie 的 GET 请求

三、基于 POST 请求

1. 介绍

2. 发送 POST 请求

3. 补充

四、响应处理

1. 响应属性

2. 关闭响应

3. 编码问题

4. 获取二进制数据

5. 流式下载大文件

6. 解析 JSON

7. 重定向和历史

五、高级用法

1. SSL 证书验证

2. 使用代理

3. 会话对象

4. 异常处理

六、登录示例

GitHub 登录

墨颜丶