教你一种1分钟下载1万个网页的方法，你学吗？_爬虫

一：模块介绍

Pycurl是一个用C语言编写的libcurl Python实现，功能非常强大，支持操作协议有FTP，HTTP，HTTPS，TELNET等。与urllib相比，Pycurl的速度要快很多。

二：安装

大家可以去官网下载与本地Python一直的whl或exe包。

也可以使用下面的命令行直接安装。

pip install pycurl

pycurl安装.png

三：主要方法

pycurl.Curl() #创建一个pycurl对象的方法

pycurl.Curl().setopt(pycurl.URL, http://www.pythontab.com) #设置要访问的URL

pycurl.Curl().setopt(pycurl.MAXREDIRS, 5) #设置最大重定向次数

pycurl.Curl().setopt(pycurl.CONNECTTIMEOUT, 60)

pycurl.Curl().setopt(pycurl.TIMEOUT, 300) #连接超时设置

pycurl.Curl().setopt(pycurl.USERAGENT, "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)") #模拟浏览器

pycurl.Curl().perform() #服务器端返回的信息

pycurl.Curl().getinfo(pycurl.HTTP_CODE) #查看HTTP的状态 类似urllib中status属性

pycurl.HTTP_CODE	 HTTP 响应代码

pycurl.REDIRECT_COUNT 重定向的次数

pycurl.SIZE_DOWNLOAD 下载的数据大小

pycurl.CONTENT_LENGTH_DOWNLOAD 下载内容长度

pycurl.RESPONSE_CODE 响应代码

pycurl.SPEED_DOWNLOAD 下载速度

四：示例DEMO

下面的示例代码是我2015年开发通用爬虫时，使用的一部分，现在已经改成并发模式了(后面会有简单的介绍)，不过大家还是可以做个参考。

# encoding=utf-8
import pycurl , traceback
from com.fy.utils.html.HtmlCode import HtmlCodeUtils
from io import BytesIO  # <-- 这个用到里面的write函数
class Utils_PyCurl:
	def __init__(self,):
    self.hcu = HtmlCodeUtils() #根据HTML源码获取网页编码
    self.c = pycurl.Curl() 
    self.c.setopt(pycurl.CONNECTTIMEOUT, 60)#链接超时
    self.c.setopt(pycurl.TIMEOUT, 60) #下载超时
    self.c.setopt(pycurl.MAXREDIRS, 5)#设置重定向次数
    self.c.setopt(pycurl.USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)')#模拟浏览器
    
def PyCurl_Html(self, url):
    self.requestCode = 0#URL请求返回码
    host = url
    for i in range(2):
        del  i
        html, code = self.fetch_html_pycurl(host)
        if int(self.requestCode) == 200:
            return  html, code 
        elif html == None:
            continue
    return None, None

def fetch_html_pycurl(self, host):
    result = None
    try:
        self.c.setopt(self.c.URL, host) 
        self.c.setopt(self.c.HTTPHEADER, ['Accept:'])
        e = BytesIO()
        self.c.setopt(self.c.WRITEFUNCTION, e.write)
        self.c.setopt(self.c.FOLLOWLOCATION, 1) 
        self.c.perform()
        data = e.getvalue();
        html_code = self.hcu.getChardet(data)#获取网页编码
        result = str(data.decode(html_code, 'ignore'))
        self.requestCode = self.c.getinfo(self.c.HTTP_CODE)
        if result != None:print(len(result), self.requestCode)
        else:print(self.requestCode)
    except :print(traceback.print_exc())
    return result, self.requestCode
pyc = Utils_PyCurl()
html, code = pyc.PyCurl_Html("文章地址")
print("code：", code)
print(html)

五：并发

在大批量数据采集时，我们需要对上述的代码进行二次封装，实现并发处理，提高下载的速度。多线程是实现并发最基础的方式，这里就不再次介绍。今天我们介绍一下基Gevent协程的并发处理。

1：Gevent介绍

gevent是Python世界中最重要的异步网络库，可以大幅度提高系统的性能。最可贵的是，它允许我们几乎不修改代码，把同步程序变为异步程序。

2016年是Python3的重要时刻。Scrapy刚宣布支持Python3不久，Gevent又宣布支持Python3了。

给大家一个编码建议：如果用Python2，只用Python2.7；如果使用Python3，请至少使用Python3.4，最好使用Python3.5。

支持Python版本：

Python2.7及Python>=3.4

Gevent是基于协程的Python网络库。

包含的特性：

• 基于libev的快速事件循环(Linux上epoll，FreeBSD上kqueue）。

• 基于greenlet的轻量级执行单元。

•API的概念和Python标准库一致(如事件，队列)。

•可以配合socket，ssl模块使用。

•能够使用标准库和第三方模块创建标准的阻塞套接字(gevent.monkey)。

•默认通过线程池进行DNS查询,也可通过c-are(通过GEVENT_RESOLVER=ares环境变量开启）。

•TCP/UDP/HTTP服务器

•子进程支持（通过gevent.subprocess）

•线程池

** 2：安装**

pip install wheel

pip install gevent 或者 pip3 install gevent

具体安装命令视安装的Python 的版本而定。当出现 Successfully installed ...... 的提示后，即表示 gevent 安装成功。

如果中途出现异常，一定是依赖的包没有提前安装。具体的问题，根据提示处理，基本上就可以解决了。

3：基于的Gevent和pycurl 并发模式改造后的完整代码如下：

import pycurl , time, traceback
#如果没有给gevent打上补丁的话，它是检测不到除gevent它本省自带的IO操作的，当打上了补丁，它就能检测到程序其他所有的IO操作
from com.fy.utils.html.HtmlCode import HtmlCodeUtils
from io import BytesIO  # <-- 这个用到里面的write函数
from gevent import monkey; 
monkey.patch_all()#把当前程序的所有IO操作给我单独的做上标记
#如果没有给gevent打上补丁的话，它是检测不到除gevent它本省自带的IO操作的，当打上了补丁，它就能检测到程序其他所有的IO操作
import  gevent  
class Utils_PyCurl:
def __init__(self,):
    self.hcu = HtmlCodeUtils() #根据HTML源码获取网页编码
    self.c = pycurl.Curl() 
    self.c.setopt(pycurl.CONNECTTIMEOUT, 60)#链接超时
    self.c.setopt(pycurl.TIMEOUT, 60) #下载超时
    self.c.setopt(pycurl.MAXREDIRS, 5)#设置重定向次数
    self.c.setopt(pycurl.USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)')#模拟浏览器

def geventControl(self, tasks, urlFieldName): 
    print("待处理的任务【" + str(len(tasks)) + "】个...")
    start_time = time.time()
    self.urlFieldName = urlFieldName#集合中存放地址的key
    jobs = [gevent.spawn(self.fetch_html_pycurl, task) for task in tasks]
    htmls = gevent.joinall(jobs, timeout=60, raise_error=False)
    end_time = time.time()
    print("下载页面【完毕】，共历时：【" + str((end_time - start_time))[0:4] + "】s.....\n")
    del htmls
    return tasks, (end_time - start_time)
    
def fetch_html_pycurl(self, task):
    host = task[self.urlFieldName]#待处理的地址
    html = None#HTML源码
    requestCode = 200#返回的状态码
    try:
        self.c.setopt(self.c.URL, host) 
        self.c.setopt(self.c.HTTPHEADER, ['Accept:'])
        e = BytesIO()
        self.c.setopt(self.c.WRITEFUNCTION, e.write)
        self.c.setopt(self.c.FOLLOWLOCATION, 1) 
        self.c.perform()
        data = e.getvalue();
        html_code = self.hcu.getChardet(data)#获取网页编码
        html = str(data.decode(html_code, 'ignore'))
        requestCode = self.c.getinfo(self.c.HTTP_CODE)
        if html != None:print("HTML-length：", len(html), "HTTP_CODE：", requestCode, "request-url：", host)
        else:print("HTML-length：", 0, "HTTP_CODE：", requestCode, "request-url：", host)
    except :
        requestCode = 400
        print(traceback.print_exc())
    task['HTML'] = html#HTML源码
    task['HTTP_CODE'] = requestCode#返回的状态码
    return task
pyc = Utils_PyCurl()
start_time = time.time()
for pageNo in range(1, 11):
	tasks = []
	for i in  range(1, 100):
    	task = {}
    	task['url'] = "文章地址"
    	tasks.append(task) 
	pyc.geventControl(tasks, "url")  
end_time = time.time()
print((end_time - start_time))

通过上述改造完以后，对其做了个简单的测试。

样本URL：1000个，分十批，每批次100个URL。共历时49秒。

用的是我的一个用了三年的笔记本，性能较差，如果是放在服务器的Linux环境下的话，30秒内下载完毕，绝对不在话下。

在分布式大批量采集中，使用多进程+协程并发的模式，单台服务器，一分钟下载1W个网页，绝对没有问题。

今天就说这么多了，改天介绍一下采集的整体架构。

博主QQ

博主QQ：

博主微信

博主微信：

博主公号

博主公众号：

回到顶部

教你一种1分钟下载1万个网页的方法，你学吗？

一：模块介绍

二：安装

三：主要方法

四：示例DEMO

五：并发

相关阅读：

全部评论: 0 条

博主公众号: 博主微信:

热门文章

最新发布

最新评论