Categories
Uncategorized

selenium reptiles

Web automated testing tools can be run in a browser, the browser operation according to the instruction, is only a tool to be used in conjunction with third-party browser, compared to before the study reptiles just a little bit slower. And what this method crawling do not care when ajax dynamic loading mechanism and other anti-climb. So look for the label directly F12 can find, without determining the source code exists.

installation

Linux: sudo pip3 install selenium

Windows: python -m pip install selenium

phantomjs browser

phantomjs browser interface, also known as non-browser (also known as the headless browser), a page is loaded in memory to run efficiently.

Installation (phantomjs (no browser interface), chromedriver (Google Chrome), geckodriver (Firefox))

Windows

1, download the corresponding version of phantomjs, chromedriver, geckodriver

2, chromedriver download the Google browser corresponding version of the chromedriver.exe copied to the Scripts directory python installation directory (added to the system environment variable) to view the python installation path: where python

3, verify, cmd command line: chromedriver

Linux

1, must extract: tar -zxvf geckodriver.tar.gz

2, copy the file to extract the / usr / bin / (add environmental variables): sudo cp geckodriver / usr / bin /

3, change permissions

  sudo -i

  cd /usr/bin/

  chmod 777 geckodriver

Sample code: using selenium + Google browser to open Baidu, and Baidu Home screenshots

from selenium import webdriver

browser = webdriver.Chrome()            #

Create a browser object

browser.get('http://www.baidu.com/') #

Open Baidu

browser.save_screenshot('baidu.png') #

Screenshots

browser.quit() #

Exit browser

Sample Code 2: Open Baidu, search Zhao Liying

from selenium import webdriver
import time

#

Create a browser object - it has opened the browser

browser = webdriver.Chrome() browser.get('http://www.baidu.com/') #

Open Baidu

ele = browser.find_element_by_xpath('//*[@id="kw"]') #

Find the search box

ele.send_keys('

Zhao Liying

') #

Send a text to the search box: Zhao Liying

time.sleep(1) #

Baidu to find a button, click

browser.find_element_by_xpath('//*[@id="su"]').click() time.sleep(2) browser.quit() #

Close the browser

 

browser browser object method

    browser = webdriver.Chrome (executable_path = ‘path’) path drives the address to the browser

    browser.get (url) open path Path

    browser.page_source: view the content of the response (page source code)

    browser.page_source.find ( ‘string’): Searching html specified string from source code is not found Return: -1

    browser.quit (): Close your browser

Find elements

Find a single element (a node object)

  1. browser.find_element_by_id(”)
  2. browser.find_element_by_name(”)
  3. browser.find_element_by_class_name(”)
  4. browser.find_element_by_xpath(”)
  5. browser.find_element_by_link_text(”)
  6. … …

Multi-element lookup ([node object list])

  1. browser.find_elements_by_id(”)
  2. browser.find_elements_by_name(”)
  3. browser.find_elements_by_class_name(”)
  4. browser.find_elements_by_xpath(”)
  5. … …

Node object manipulation

    .send_keys ( ”) transmits the content search box

    .click () Click

    .text get the text content

    .get_attribute ( ‘src’) retrieving properties

    .find ( “”) to find the string in the response

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.qiushibaike.com/text/')

#

Find a single element

div = browser.find_element_by_class_name('content') print(div.text) #

Multi-element look for: [, ]

divs = browser.find_elements_by_class_name('content') for div in divs: print('*************************') print(div.text) print('*************************') browser.quit() #

Exit browser

Jingdong reptiles Case

Destination URL: https: //www.jd.com/ crawl target: product name, product price, quantity evaluation, commodity merchants

Ideas remind

    Jingdong open to the Product Search page

    Matching a list of all goods node objects

    The text node objects taken out to see the law, is there a better approach?

    After a complete extraction, if the judgment is not the last one, then click Next

Implementation steps

Find node

    Home Search box: // * [@ id = “key”]

    Home Search Button: // * [@ id = “search”] / div / div [2] / button

    Object list node commodity product information page: // * [@ id = “J_goodsList”] / ul / li

JS script execution, access to dynamic load data

  browser.execute_script(‘window.scrollTo(0,document.body.scrollHeight)’)

from selenium import webdriver
import time


class JdSpider(object):
    def __init__(self):
        self.i = 0
        self.url = 'https://www.jd.com/'
        self.browser = webdriver.Chrome()

    #

Get information page - the page specific commodities

def get_html(self): self.browser.get(self.url) self.browser.find_element_by_xpath('//*[@id="key"]').send_keys('

Reptile book

') #

Search box, enter "Reptile book"

self.browser.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click() #

Click Search

time.sleep(3) #

To the commodity page load time

#

Parsing the page

def parse_html(self): #

Pull the pull-down menu at the bottom, JS script execution

self.browser.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) #

Extract a list of all goods node objects li list

li_list = self.browser.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li') for li in li_list: info_list = li.text.split('\n') if info_list[0].startswith('

Each full

') or info_list[1].startswith(''): price = info_list[1] name = info_list[2] comment = info_list[3] shop = info_list[4] elif info_list[0].startswith('

Single

'): price = info_list[3] name = info_list[4] comment = info_list[5] shop = info_list[6] else: price = info_list[0] name = info_list[1] comment = info_list[2] shop = info_list[3] print(price, comment, shop, name) #

The main function

def main(self): self.get_html() while True: self.parse_html() #

Determine whether the click Next, did not find the explanation is not the last page

if self.browser.page_source.find('pn-next disabled') == -1: self.browser.find_element_by_class_name('pn-next').click() time.sleep(2) else: break print(self.i) if __name__ == '__main__': spider = JdSpider() spider.main()

provided no interface mode chromedriver

from selenium import webdriver

options = webdriver.ChromeOptions()   #

No interface settings

options.add_argument('--headless') #

Add no interface parameters

browser = webdriver.Chrome(options=options) browser.get('http://www.baidu.com/') browser.save_screenshot('baidu.png') browser.quit()

The above code to interface mode without

from selenium import webdriver
import time


class JdSpider(object):
    def __init__(self):
        self.url = 'https://www.jd.com/'
        self.options = webdriver.ChromeOptions()  #

No interface settings

self.options.add_argument('--headless') #

Add no interface parameters

#

It can create a normal browser object

self.browser = webdriver.Chrome(options=self.options) self.i = 0 #

Commodity statistics

#

Get information page - the page specific commodities

def get_html(self): self.browser.get(self.url) self.browser.find_element_by_xpath('//*[@id="key"]').send_keys('

Reptile book

') #

Search box, enter "Reptile book"

self.browser.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click() #

Click Search

time.sleep(3) #

To the commodity page load time

def parse_html(self): #

The progress bar is pulled to the bottom, so that all data is dynamically loaded

self.browser.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) #

Wait for the dynamic data load is complete

#

Extract a list of all goods node objects li list

li_list = self.browser.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li') item = {} for li in li_list: #

find_element: Find a single element

item['name'] = li.find_element_by_xpath('.//div[@class="p-name"]/a/em').text.strip() item['price'] = li.find_element_by_xpath('.//div[@class="p-price"]').text.strip() item['comment'] = li.find_element_by_xpath('.//div[@class="p-commit"]/strong').text.strip() item['shop'] = li.find_element_by_xpath('.//div[@class="p-shopnum"]').text.strip() print(item) self.i += 1 def main(self): self.get_html() while True: self.parse_html() #

Determine whether the last page

if self.browser.page_source.find('pn-next disabled') == -1: self.browser.find_element_by_class_name('pn-next').click() time.sleep(3) else: break print('

Number of Products:

', self.i) self.browser.quit() if __name__ == '__main__': spider = JdSpider() spider.main()

View Code

Keyboard

from selenium.webdriver.common.keys import Keys
​
browser = webdriver.Chrome()
browser.get('http://www.baidu.com/')
#

1, in the search box, enter "selenium"

browser.find_element_by_id('kw').send_keys('

Zhao Liying

') #

2, enter a space

browser.find_element_by_id('kw').send_keys(Keys.SPACE) #

3, Ctrl + a simulation Select

browser.find_element_by_id('kw').send_keys(Keys.CONTROL, 'a') #

4, Ctrl + c analog copy

browser.find_element_by_id('kw').send_keys(Keys.CONTROL, 'c') #

5, Ctrl + v paste Analog

browser.find_element_by_id('kw').send_keys(Keys.CONTROL, 'v') #

6, press enter, instead of the Search button

browser.find_element_by_id('kw').send_keys(Keys.ENTER)

Mouse operation

import time
from selenium import webdriver
#

Importing mouse events

from selenium.webdriver import ActionChains browser = webdriver.Chrome() browser.get('http://www.baidu.com/') #

Find the "Settings" node

element = browser.find_element_by_xpath('//*[@id="u1"]/a[8]') #

Move the mouse to set node, move_to_element ()

actions = ActionChains(browser) actions.move_to_element(element) actions.perform() #

perform () is actually performing operations

time.sleep(1) #

Find Advanced Settings node, and click

browser.find_element_by_link_text('

Advanced Search

').click()

Switch pages

Application and page links midpoint open a new page of the website appears, but the object browser or the browser before the page object

all_handles = browser.window_handles get all current handle (window)

browser.switch_to_window (all_handles [1]) to switch to a new browser window, a new window object acquires

Ministry of Civil Affairs website

The civil division code crawled into the database, according to the hierarchy (sub-table – the provincial table, table city, county table)

Build a database table

# 建库
create database govdb charset utf8;
use govdb;
# 建表
create table province(
        p_name varchar(20),
        p_code varchar(20)
        )charset=utf8;
        create table city(
        c_name varchar(20),
        c_code varchar(20),
        c_father_code varchar(20)
        )charset=utf8;
        create table county(
        x_name varchar(20),
        x_code varchar(20),
        x_father_code varchar(20)
        )charset=utf8;

Thinking

    selenium + Chrome open a page, two pages and extract new links

    Incremental crawling: version to compare the database tables, and, before determining whether the climb (if an update)

    If there is no update, prompt the user directly without continued crawling

    If the data in the table is updated, deleted before re-crawling and inserted into the database table

    After final completion: disconnect from the database, close the browser

from selenium import webdriver
import pymysql


class GovSpider(object):
    def __init__(self):
        #

No interface settings

options = webdriver.ChromeOptions() options.add_argument('--headless') self.browser = webdriver.Chrome(options=options) #

Add parameters

self.one_url = 'http://www.mca.gov.cn/article/sj/xzqh/2019/' #

Create a database and related variables

self.db = pymysql.connect('localhost', 'root', '123456', 'govdb', charset='utf8') self.cursor = self.db.cursor() #

Create a list of three for executemany () to insert records in table 3

self.province_list = [] self.city_list = [] self.county_list = [] #

Get home, and extract the secondary pages link (false links)

def get_incr_url(self): self.browser.get(self.one_url) #

Extracts the latest link, to determine whether an incremental crawl

td = self.browser.find_element_by_xpath('

// td [@ class = "arlisttd"] / a [contains (@title, "Code")]

') #

Extract links and databases do comparison, determine whether you need both how the crawl

#

get_attribute () will auto-complete extraction link

two_url = td.get_attribute('href') #

result is affected by the number of pieces returned

result = self.cursor.execute('select url from version where url=%s', [two_url]) if result: print('

Without crawling

') else: td.click() #

Switch handle

all_handlers = self.browser.window_handles self.browser.switch_to.window(all_handlers[1]) self.get_data() #

Data Capture

#

The URL address is stored in version table

self.cursor.execute('delete from version') self.cursor.execute('insert into version values(%s)', [two_url]) self.db.commit() #

Two pages to extract administrative division code

def get_data(self): #

Benchmark xpath

tr_list = self.browser.find_elements_by_xpath('//tr[@height="19"]') for tr in tr_list: code = tr.find_element_by_xpath('./td[2]').text.strip() name = tr.find_element_by_xpath('./td[3]').text.strip() print(name, code) #

Analyzing hierarchy, added to the corresponding database table (correspondence table field)

# province: p_name p_code # city : c_name c_code c_father_code # county : x_name x_code x_father_code #

To add data to the corresponding table

if code[-4:] == '0000': self.province_list.append([name, code]) if name in ['

Beijing

', '

Tianjin

', '

Shanghai

', '

Chongqing

']: self.city_list.append([name, code, code]) elif code[-2:] == '00': self.city_list.append([name, code, (code[:2] + '0000')]) else: if code[:2] in ['11', '12', '31', '50']: self.county_list.append([name, code, (code[:2] + '0000')]) else: self.county_list.append([name, code, (code[:4] + '00')]) #

# For loop with indented, all data after climbing unified excutemany (),

#

Perform database insert statement

self.insert_mysql() def insert_mysql(self): #

Be sure to delete table records 1. Update

self.cursor.execute('delete from province') self.cursor.execute('delete from city') self.cursor.execute('delete from county') #

2. Insert the new data

self.cursor.executemany('insert into province values(%s,%s)', self.province_list) self.cursor.executemany('insert into city values(%s,%s,%s)', self.city_list) self.cursor.executemany('insert into county values(%s,%s,%s)', self.county_list) #

3. Submit to the database to perform

self.db.commit() print('

Data Capture completed successfully stored in the database

') def main(self): self.get_incr_url() self.cursor.close() #

Once all the data processed is disconnected

self.db.close() self.browser.quit() #

Close the browser

if __name__ == '__main__': spider = GovSpider() spider.main()

SQL command exercises

1. All provinces and counties query (multi-table query implementation)

select province.p_name,city.c_name,county.x_name from province,city,county  where province.p_code=city.c_father_code and city.c_code=county.x_father_code;

2. All provincial cities and counties query (join query realization)

select province.p_name,city.c_name,county.x_name from province inner join city on province.p_code=city.c_father_code inner join county on city.c_code=county.x_father_code;

Web client authentication

You can fill in the URL address

url = ‘http: // username: password @ address normal’

Example: crawling a day notes

from selenium import webdriver
​
url = 'http://tarenacode:[email protected]/AIDCode/aid1904/15-spider/spider_day06_note.zip'
browser = webdriver.Chrome()
browser.get(url)

iframe sub-frame

iframe sub-frame for the web page in nested, this should iframe to switch to the sub-frame, then perform other operations.

browser.switch_to.iframe(iframe_element)

Example – Login qq-mail

import time
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://mail.qq.com/cgi-bin/loginpage')

#

Find iframe sub-frame and switch to this iframe

login_frame = browser.find_element_by_id('login_frame') browser.switch_to.frame(login_frame) #

qq + Password + Log

browser.find_element_by_id('u').send_keys('

account number

') browser.find_element_by_id('p').send_keys('

password

') browser.find_element_by_id('login_button').click() time.sleep(5) #

Reservation page record time

#

Extract data

ele = browser.find_element_by_id('useralias') print(ele.text)

 

Leave a Reply