Categories
Uncategorized

Reptile –cookie simulated landing

cookie applies to crawl behind a login page website

cookie and session mechanisms

http protocol is a connectionless protocol, cookie: stored in the client browser, session: stored in a Web server

 

Renren Login Case

Method one: visit the website manually grab Cookie

1, the first successful login one time, get to carry the login information of Cookie

Login success – personal home page (http://www.renren.com/971989504/profile) – F12 Ethereal – Refresh personal home page – find the package home page (home)

General cookie in all -> home packet,

2, carrying the cookie send request

import requests
​
class RenRenLogin(object):
   def __init__(self):
       #

url is the need to log in to the normal access address

self.url = 'http://www.renren.com/967469305/profile' #

headers in the cookie after a successful login cookie to crawl

self.headers = { #

Note here that cookie, you want to grab yourself

"Cookie": "xxx", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36", } ​ #

Get personal home page response

def get_html(self): html = requests.get(url=self.url,headers=self.headers,verify=False).text print(html) self.parse_html(html) ​ #

You can obtain and resolve addresses need to be logged in to access across all networks

def parse_html(self,html): passif __name__ == '__main__': spider = RenRenLogin() spider.get_html()

Method two: requests processing module Cookie

requests module provides session class to achieve the client and server to maintain session

1, session object instantiation

  session = requests.session()

2, let the session object is sent get or post request

  res = session.post(url=url,data=data,headers=headers)

  res = session.get(url=url,headers=headers)​

3. carding ideas

Browser principle: access to the page will need to log in with a login cookie ever before

Procedural principle: before, with the same login to access cookie – completed by the session object

1, session object instantiation

2, visit the website: session object sends a request to log the corresponding Web site, the cookie stored in the session object

3, visit the page: session object request needs to access the login page, session cookie can automatically carry this before, request

Specific steps

1, find the address of the login POST

In the login screen to view the page source code form to view the form, find the address of the corresponding action: http://www.renren.com/PLogin.do

2, sends the user name and password information to address POST

* User name and password information is sent in the dictionary of fashion

Key: tag name value (email, password)

Value: the real user name and password

    post_data = {’email’:”,’password’:”}

Program realization

1, the first POST: user name and password information to an address in POST

2, then GET: normal request to acquire information page

import requests
from lxml import etree


class RenrenSpider(object):
    def __init__(self):
        self.post_url = 'http://www.renren.com/PLogin.do'
        self.get_url = 'http://www.renren.com/967469305/profile'
        #

email and password attribute value node name of

self.form_data = { 'email': '******', #

username

'password': '*******'} #

password

self.session = requests.session() #

Instantiate session objects held session

self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36', 'Referer': 'http://www.renren.com/SysHome.do'} #

First post and then get

def get_html(self): #

First POST, the user name and password information to an address POST

self.session.post(url=self.post_url, data=self.form_data, headers=self.headers) #

Then session.get () profile

html = self.session.get(url=self.get_url, headers=self.headers).text self.parse_html(html) def parse_html(self, html): parse_html = etree.HTML(html) r_list = parse_html.xpath('//li[@class="school"]/span/text()') print(r_list) if __name__ == '__main__': spider = RenrenSpider() spider.get_html()

Method Three

1, the processing of the grabbed cookie dictionary 2, requests.get () parameters: cookies

import requests
from lxml import etree


class RenrenLogin(object):
    def __init__(self):
        #

url is the need to log in to the normal access address

self.url = 'http://www.renren.com/967469305/profile' self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'} #

The cookie string into the dictionary format

def get_cookie_dict(self): cookie_dict = {} cookies = 'td_cookie=18446744073093166409; anonymid=jzc3yiknvd9kwr; depovince=GW; jebecookies=67976425-f482-44a7-9668-0469a6a14d16|||||; _r01_=1; JSESSIONID=abcp_jUgWA4RdcgwXqtYw; ick_login=f502b729-d6cb-4085-8d74-4308a0a8a17d; _de=4DBCFCC17D9E50C8C92BCDC45CC5C3B7; p=cae86d9f12c5a1ba30901ad3d6ac992f5; first_login_flag=1; ln_uact=13603263409; ln_hurl=http://hdn.xnimg.cn/photos/hdn221/20181101/1550/h_main_qz3H_61ec0009c3901986.jpg; t=6d191b90a0236cea74f99b9d88d3fbd25; societyguester=6d191b90a0236cea74f99b9d88d3fbd25; id=967469305; xnsid=6cbc5509; ver=7.0; loginfrom=null; jebe_key=bd6eb791-92b2-4141-b8ed-53d17551d830%7C2012cb2155debcd0710a4bf5a73220e8%7C1565838783310%7C1%7C1565838784555; jebe_key=bd6eb791-92b2-4141-b8ed-53d17551d830%7C2012cb2155debcd0710a4bf5a73220e8%7C1565838783310%7C1%7C1565838784558; wp_fold=0' for kv in cookies.split('; '): # kv: 'td_cookie=184xxx' key = kv.split('=')[0] value = kv.split('=')[1] cookie_dict[key] = value return cookie_dict #

Get personal home page response

def get_html(self): #

Get cookies

cookies = self.get_cookie_dict() print(cookies) html = requests.get(url=self.url, headers=self.headers, cookies=cookies, ).text self.parse_html(html) #

You can obtain and resolve addresses need to be logged in to access across all networks

def parse_html(self, html): parse_html = etree.HTML(html) r_list = parse_html.xpath('//*[@id="operate_area"]/div[1]/ul/li[1]/span/text()') print(r_list) if __name__ == '__main__': spider = RenrenLogin() spider.get_html()

 

 

Leave a Reply