Categories
Uncategorized

08 UA application pool and the agent pool in Scrapy

Download Middleware Introduction

In Scrapy, there is a component between the engine and the downloader called Downloader Middlewares. Since it is a hook somewhere between Scrapy’s request/response processing, there are 2 things to do:

(1) passes the request to the engine during downloading, download middleware Requests for a series of processes. Such as setting the User-Agent request, setting agent like ip

(2) When the downloader completes passing Response to the engine, download middleware can do a series of processing on Responses. such as gzip decompression, etc.

 

Download middleware has the following functions are managed

– process_request request when the method is called by downloading the middLEware

– Process_response results are processed by this method when downloaded through middlware

– Process_exception is called when an exception occurs during download

 

When writing middlers, you need to think about which method to write the function to be implemented most suitable for processing in which process. Middlware can be used to process requests, process results or combine the use of some methods of signal coordination, etc., can also be added to the original crawler to adapt to the project, which can also be written in the extension, in fact the extension is more decoupled, it is recommended to use the extension. In crawlers, mainly use download middlware to process requests, usually set a random User-Agent for the request, and set a random proxy ip. An anti-crawler strategy designed to prevent crawling websites.

 

I. UA pool: User-Agent pool

– role: as many requests scrapy project disguised as a different type of browser identity.

– Operating flow:

1. interception request download middleware

2. Tamper and disguise UA in the request header information intercepted to the request

3. Turn on the download middgets in the configuration file

 

Middleware.py part code shows:

from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware        #

Guide package

import random #

UA write code pool (UA pool to separate the package into a class)

class RandomUserAgent(UserAgentMiddleware): def process_request(self, request, spider): ua = random.choice(user_agent_list) request.headers.setdefault('User-Agent',ua) #

The current intercepted write operation request ua

user_agent_list = [ "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " ]

 

II. Agent pools

– role: as many as possible will scrapy project requests to a different IP settings.

– Operating procedures:

1. interception request download middleware

2. Modify the IP of the intercepted request to a proxy IP

3. Open the downloaded middleware in the configuration file

 

Middleware code display: bulk ip replacement of intercepted requests, separate package download middleware class

class Proxy(object):

def process_request(self, request, spider):

        #

Judging by the URL intercepted to the request (whether the protocol header is http or https), request.url return value: http://www.xxx.com

h = request.url.split(':')[0] #

The protocol header of the request

if h == 'https': ip = random.choice(PROXY_https) request.meta['proxy'] = 'https://'+ip else: ip = random.choice(PROXY_http) request.meta['proxy'] = 'http://' + ip #

Proxy IP can be selected

PROXY_http = [ '153.180.102.104:80', '195.208.131.189:56055', ] PROXY_https = [ '120.83.49.90:9000', '95.189.112.214:35508', ]

Proxy ip is usually done when the request is unsuccessful, so we can write proxy ip to process_exception later.

 

Three, UA agent pool in pools and exemplary middleware

To the wheat fields of real estate, for example, will display the code below and shown in detail how to use the pool and UA agent pool in Scrapy framework.

   

 

Reptile File: maitian.py

import scrapy
from houseinfo.items import HouseinfoItem               #

Import item

class MaitianSpider(scrapy.Spider): name = 'maitian' # start_urls = ['http://bj.maitian.cn/zfall/PG{}'.format(page for page in range(1,101))] start_urls = ['http://bj.maitian.cn/zfall/PG100'] #

Parsing function

def parse(self, response): li_list = response.xpath('//div[@class="list_wrap"]/ul/li') for li in li_list: item = HouseinfoItem( title = li.xpath('./div[2]/h1/a/text()').extract_first().strip(), price = li.xpath('./div[2]/div/ol/strong/span/text()').extract_first().strip(), square = li.xpath('./div[2]/p[1]/span[1]/text()').extract_first().replace('',''), area = li.xpath('./div[2]/p[2]/span/text()[2]').extract_first().strip().split('\xa0')[0], adress = li.xpath('./div[2]/p[2]/span/text()[2]').extract_first().strip().split('\xa0')[2] ) yield item #

to the pipeline, which then defines the storage method

items file: items.py

import scrapy

class HouseinfoItem(scrapy.Item):
    title = scrapy.Field()          #

Store headers, which can store any type of data

price = scrapy.Field() square = scrapy.Field() area = scrapy.Field() adress = scrapy.Field()

Pipelines.py file: pipelines.py

class HouseinfoPipeline(object):
    def __init__(self):
        self.file = None

    #

When starting to crawl, perform once

def open_spider(self,spider): self.file = open('maitian.csv','a',encoding='utf-8') #

Selected append mode

self.file.write(",".join(["

Title

","

Monthly rent

","

area

","

region

","

Address

","\n"])) print("

Begin to crawl.

") #

Because this method is called multiple times, the opening and closing operations of the file are written in the other two methods that will only be executed once.

def process_item(self, item, spider): content = [item["title"], item["price"], item["square"], item["area"], item["adress"], "\n"] self.file.write(",".join(content)) return item #

At the end of the crawler, perform once

def close_spider(self,spider): self.file.close() print("

End the crawler.

")

Middleware file Middlewares.py

from scrapy import signals

class HouseinfoDownloaderMiddleware(object):

    #

UA pool

user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] PROXY_http = [ '153.180.102.104:80', '195.208.131.189:56055', ] PROXY_https = [ '120.83.49.90:9000', '95.189.112.214:35508', ] def process_request(self, request, spider): #

Set up a requested UA using a UA pool

request.headers['User-Agent'] = random.choice(self.user_agent_list) return None def process_response(self, request, response, spider): return response #

Intercept the request object exception occurred

def process_exception(self, request, exception, spider): if request.url.split(':')[0] == 'http': request.meta['proxy'] = 'http://'+random.choice(self.PROXY_http) else: request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)

Profile: settings.py

# -*- coding: utf-8 -*-
BOT_NAME = 'houseinfo'

SPIDER_MODULES = ['houseinfo.spiders']
NEWSPIDER_MODULE = 'houseinfo.spiders'

# Obey robots.txt rules ROBOTSTXT_OBEY = False #

Open pipeline

ITEM_PIPELINES = { 'houseinfo.pipelines.HouseinfoPipeline': 300, #

The value of 300 is the priority, the smaller the value, the higher the priority

}

# Open the downloaded middleware

DOWNLOADER_MIDDLEWARES = { 'houseinfo.middlewares.HouseinfoDownloaderMiddleware': 543, }

 

Leave a Reply