关键词采集翻车之旅,站长之家反爬,会员登陆,vip购买限制,大概率是分享的代码过多,被大佬们爆菊次数太多,自从站长之家改版更新之后,割韭菜的力度加大,反爬力度也增多。
所以就有了这次的翻车体验之旅,很久没写爬虫,翻车是必然。
反爬限制:
翻页到第六页,需要登陆
翻页到第十页,需要购买vip
使用网页cookies然并软,访问第六页,还是直接跳转到第一页数据!
使用session保持访问,访问第六页,还是直接跳转到第一页数据!
使用refer,访问第六页,还是直接跳转到第一页数据!
无法理解其具体的反爬机制,换浏览器测试访问发现连续翻页,到第六页,需要登陆才能正常访问。
附翻车完整参考源码:
#站长之家关键词挖掘 20201014 #author/微信:huguo00289 # -*- coding: utf-8 -*- import requests,random,time from lxml import etree from urllib import parse class Httprequest(object): ua_list = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36Chrome 17.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0Firefox 4.0.1', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', ] @property #把方法变成属性的装饰器 def random_headers(self): cookies = cookies return { 'User-Agent': random.choice(self.ua_list), 'Cookie':cookies, } class Getwords(Httprequest): def __init__(self): self.url="https://data.chinaz.com/keyword/allindex/" def get_num(self,keyword): key = parse.quote(keyword) url=f'{self.url}{key}' html = requests.get(url,headers=self.random_headers,timeout=5).content.decode("utf-8") time.sleep(2) req = etree.HTML(html) num=req.xpath('//span[@class="c-red"]/text()')[0] num=int(num) if num>0: print(f'>> {keyword} 存在 {num} 个长尾关键词!') if num<50: self.get_words(req) else: self.get_pagewords(num,key) else: print(f'>> {keyword} 不存在长尾关键词!') def get_words(self,req): words=req.xpath('//li[@class="col-220 nofoldtxt"]/a/@title') indexs=req.xpath('//li[@class="col-88"]/a/text()') for word,index in zip(words,indexs): data=word,index print(data) def get_pagewords(self,num,key): pagenum=int(num/50) for i in range(1,pagenum+1): if i<=10: print(f'>> 正在采集 第{i}页 长尾词数据..') url = f'{self.url}{key}/{i}' print(url) html = requests.get(url,headers=self.random_headers,timeout=5).content.decode("utf-8") time.sleep(2) req = etree.HTML(html) self.get_words(req) if __name__ == '__main__': spider=Getwords() spider.get_num("培训")