新年新气象,祝大家牛转乾坤,牛气冲天!
过年期间收到了很多朋友的新年祝福,没有一一回应,见谅!
很久没写爬虫了,手生了,在吾爱找了一个练手网站,国外的壁纸网站,wallhaven,这里采集下载热门图片为例,重温一下python图片爬虫,感兴趣的不妨自行练手尝试一番!
目标网址:https://wallhaven.cc/toplist
通过初步的观察,可以很清晰的看到网站的翻页情况
https://wallhaven.cc/toplist?page=1 https://wallhaven.cc/toplist?page=2 https://wallhaven.cc/toplist?page=2
这里我们就可以应用python字符串链接来构造列表页的网址链接
f"https://wallhaven.cc/toplist?page={pagenum}"
pagenum即为页码
进一步观察图片数据
封面图地址:https://th.wallhaven.cc/small/rd/rddgwm.jpg
大图地址:https://w.wallhaven.cc/full/rd/wallhaven-rddgwm.jpg
这里我们同样可以应用python字符串链接来构造图片的网址链接
img = imgsrc.replace("th", "w").replace("small", "full")
imgs = img.split('/')
imgurl = f"{'/'.join(imgs[:-1])}/wallhaven-{imgs[-1]}"
不过这里可能会存在一个BUG,比如小图的后缀格式是jpg,但是大图的后缀格式png,这个时候你以jpg的后缀格式访问图片,下载的话无疑是会出错的,这里本渣渣的处理方式可能还是存在bug,笨方法无疑是访问到详情页拿到大图的访问地址。
如果你有更好的处理方式,不妨交流分享!
初次基础版本:
#wallhaven热门图片采集下载 #author 微信:huguo00289 # —*—coding: utf-8 -*- import requests from lxml import etree from fake_useragent import UserAgent url = "https://wallhaven.cc/toplist?page=1" ua = UserAgent().random html = requests.get(url=url, headers={'user-agent': ua}, timeout=6).content.decode('utf-8') tree = etree.HTML(html) imgsrcs = tree.xpath('//ul/li/figure/img/@data-src') print(len(imgsrcs)) print(imgsrcs) i = 1 for imgsrc in imgsrcs: img = imgsrc.replace("th", "w").replace("small", "full") imgs = img.split('/') imgurl = f"{'/'.join(imgs[:-1])}/wallhaven-{imgs[-1]}" print(imgurl) try: r = requests.get(url=imgurl, headers={'user-agent': ua}, timeout=6) with open(f'{i}.jpg', 'wb') as f: f.write(r.content) print(f"保存 {i}.jpg 图片成功!") except Exception as e: print(f"下载图片出错,错误代码:{e}") imgurl = imgurl.replace('jpg', 'png') r = requests.get(url=imgurl, headers={'user-agent': ua}, timeout=6) with open(f'{i}.png', 'wb') as f: f.write(r.content) print(f"保存 {i}.png 图片成功!") i = i + 1
优化版本,添加了类,多线程,以及超时重试处理
#wallhaven热门图片采集下载 #author 微信:huguo00289 # —*—coding: utf-8 -*- import requests from lxml import etree from fake_useragent import UserAgent import time from requests.adapters import HTTPAdapter import threading class Top(object): def __init__(self): self.ua=UserAgent().random self.url="https://wallhaven.cc/toplist?page=" def get_response(self,url): response=requests.get(url=url, headers={'user-agent': self.ua}, timeout=6) return response def get_third(self,url,num): s = requests.Session() s.mount('http://', HTTPAdapter(max_retries=num)) s.mount('https://', HTTPAdapter(max_retries=num)) print(time.strftime('%Y-%m-%d %H:%M:%S')) try: r = s.get(url=url, headers={'user-agent': self.ua},timeout=5) return r except requests.exceptions.RequestException as e: print(e) print(time.strftime('%Y-%m-%d %H:%M:%S')) def get_html(self,response): html = response.content.decode('utf-8') tree = etree.HTML(html) return tree def parse(self,tree): imgsrcs = tree.xpath('//ul/li/figure/img/@data-src') print(len(imgsrcs)) return imgsrcs def get_imgurl(self,imgsrc): img = imgsrc.replace("th", "w").replace("small", "full") imgs = img.split('/') imgurl = f"{'/'.join(imgs[:-1])}/wallhaven-{imgs[-1]}" print(imgurl) return imgurl def down(self,imgurl,imgname): #r=self.get_response(imgurl) r = self.get_third(imgurl,3) with open(f'{imgname}', 'wb') as f: f.write(r.content) print(f"保存 {imgname} 图片成功!") time.sleep(2) def downimg(self,imgsrc,pagenum,i): imgurl = self.get_imgurl(imgsrc) imgname = f'{pagenum}-{i}{imgurl[-4:]}' try: self.down(imgurl, imgname) except Exception as e: print(f"下载图片出错,错误代码:{e}") if "jpg" in imgname: ximgname = f'{pagenum}-{i}.png' if "png" in imgname: ximgname = f'{pagenum}-{i}.jpg' self.down(imgurl, ximgname) def get_topimg(self,pagenum): url=f'{self.url}{pagenum}' print(url) response=self.get_response(url) tree=self.get_html(response) imgsrcs=self.parse(tree) i=1 for imgsrc in imgsrcs: self.downimg(imgsrc,pagenum,i) i=i+1 def get_topimgs(self,pagenum): url=f'{self.url}{pagenum}' print(url) response=self.get_response(url) tree=self.get_html(response) imgsrcs=self.parse(tree) i=1 threadings = [] for imgsrc in imgsrcs: t = threading.Thread(target=self.downimg, args=(imgsrc,pagenum,i)) i = i + 1 threadings.append(t) t.start() for x in threadings: x.join() print("多线程下载图片完成") def main(self): num=3 for pagenum in range(1,num+1): print(f">>正在采集第{pagenum}页图片数据..") self.get_topimgs(pagenum) if __name__=='__main__': spider=Top() spider.main()
采集下载效果
福利
源码打包,
同时附上两个多线程以及一个多进程,
感兴趣,尤其是想要研究多线程的不妨自行获取,
公众号后台回复“多线程”,即可获取!