一 介绍
原来scrapy的Scheduler维护的是本机的任务队列(存放Request对象及其回调函数等信息)+本机的去重队列(存放访问过的url地址)

所以实现分布式爬取的关键就是,找一台专门的主机上运行一个共享的队列比如Redis,
然后重写Scrapy的Scheduler,让新的Scheduler到共享队列存取Request,并且去除重复的Request请求,所以总结下来,实现分布式的关键就是三点:
1 2 3
| ##1、共享队列 ##2、重写Scheduler,让其无论是去重还是任务都去访问共享队列 ##3、为Scheduler定制去重规则(利用redis的集合类型)
|
以上三点便是scrapy-redis组件的核心功能

1 2 3 4 5
| ##安装: pip3 install scrapy-redis
##源码: D:\python3.6\Lib\site-packages\scrapy_redis
|
二 scrapy-redis组件
1、只使用scrapy-redis的去重功能

使用共享去重队列+源码分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
|
REDIS_HOST = 'localhost' REDIS_PORT = 6379 REDIS_URL = 'redis://user:pass@hostname:9001' REDIS_PARAMS = {} REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' REDIS_ENCODING = "utf-8"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
def request_seen(self, request): """Returns True if request was already seen.
Parameters ---------- request : scrapy.http.Request
Returns ------- bool
""" fp = self.request_fingerprint(request) added = self.server.sadd(self.key, fp) return added == 0
from scrapy.http import Request from scrapy.utils.request import request_fingerprint
req = Request(url='http://www.baidu.com') result=request_fingerprint(req) print(result)
- URL参数位置不同时,计算结果一致; - 默认请求头不在计算范围,include_headers可以设置指定请求头 - 示范: from scrapy.utils import request from scrapy.http import Request req = Request(url='http://www.baidu.com?name=8&id=1',callback=lambda x:print(x),cookies={'k1':'vvvvv'}) result1 = request.request_fingerprint(req,include_headers=['cookies',]) print(result) req = Request(url='http://www.baidu.com?id=1&name=8',callback=lambda x:print(x),cookies={'k1':666}) result2 = request.request_fingerprint(req,include_headers=['cookies',]) print(result1 == result2)
|
2、使用scrapy-redis的去重+调度实现分布式爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
SCHEDULER_QUEUE_KEY = '%(spider)s:requests'
SCHEDULER_PERSIST = True
SCHEDULER_FLUSH_ON_START = False
SCHEDULER_IDLE_BEFORE_CLOSE = 10
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
|
3、持久化
1 2 3 4 5 6 7
|
REDIS_ITEMS_KEY = '%(spider)s:items' REDIS_ITEMS_SERIALIZER = 'json.dumps'
|
4、从Redis中获取起始URL
1 2 3 4 5 6 7 8 9 10 11
| scrapy程序爬取目标站点,一旦爬取完毕后就结束了,如果目标站点更新内容了,我们想重新爬取,那么只能再重新启动scrapy,非常麻烦 scrapy-redis提供了一种供,让scrapy从redis中获取起始url,如果没有scrapy则过一段时间再来取而不会关闭 这样我们就只需要写一个简单的脚本程序,定期往redis队列里放入一个起始url。
REDIS_START_URLS_KEY = '%(name)s:start_urls'
REDIS_START_URLS_AS_SET = False
|