Categories
Uncategorized

Python web crawler combat (four) analog Login

Home for a site, it may require you to log in, for example, know peace, under the same URL, your login and personal information is not logged in the upper right corner of course there is not the same.

(Logged in)

(Not logged in)

So when you get a page with a crawling reptile what exactly is it?

The second is certainly not possible to say that you can access without having to log into a user’s own home page information, then what makes the situation with a different content URL appears in the reptile visit?

In the first article we mentioned a concept, cookie, because HTTP is stateless, so the other server does not know who this request in the end come from, if you suddenly received a letter, above let you send him something, but did not contact his letter.

In HTTP as well, our common requests are similar to anonymous letters, the emergence of a cookie is to allow covered with your own name on the letter.

Included with your cookie in the request, the server will know to put this request comes from who, and then you click on the personal information page, the server will know to return this cookie information page of the corresponding user.

In Google Chrome, you can find all the cookie key for the current site in the console Application inside. Generally used to confirm your personal information only for a key, but you can also put all of them to spend, it does not guarantee that the other server is not on some key-value pairs were also examined.

Many sites in case you do not have to log does not give too much data to let you see, so you need to conduct a simulation of crawler login.

Analog login need to start from the login screen of a website, because we want to send a post request here comes the account password to login to the site with other reptiles.

Take an example to all networks.

Log all network address: http: //www.renren.com/

Open the console, we can try to observe a network request login. It is easy to find a / POST request login on the inside, will be the login request POST, GET request because the parameters will be placed on the URL, it can easily be intercepted see your account password.

In the form parameters, we need to pay attention only email, password and rkey, the other according to fill.

email is our account name, it can be a mailbox can also be a telephone number.

password is the password, which is obviously encrypted, in this case, we have to use the same algorithm to encrypt the password when requested, but how do we know what each other using encryption algorithms it?

Most such cases you can find sources inside the console, where you can find all the files loaded through the website, and an encryption algorithm generally js file in.

sources inside Clearly there is a file called login.js, then it must sign in and have a relationship, because open js ugly, shrink to a row sources inside, so I opened it up in the console.

The password is positioned here.

The algorithm can be found on logon password encryption, by the way, all network functions of this login.js name really is not saved, abcd … xyz used over and over, do not know their own people to nausea or nausea the login analyze it written by someone else reptiles. But it is regrettable that all networks to reptiles left a very convenient way to log on, we can not even have their own encrypted text! ! !

Although this can not escape their own encryption algorithm, but in fact, many sites do not like all this exposes a network of action so that we can do an event called directly, it is best to have some portion of the ciphertext encrypted .

Let’s look at reptiles Renren exposed convenient login interface.

We only need to call this do to complete the login.

In other words, we simulate this form to complete the request before the request, the information we need to fill in the form.

Install scrapy dependent

pip install scrapy

import scrapy
url = "http://www.renren.com/PLogin.do"
data = {"email": "xxxxx", "password": "xxxxx"}
response = scrapy.FormRequest(url,formdata=data,callback=self.parse_page)

After login is successful, we can get a response from the cookie, and then, after the request is accompanied by a cookie, so that other servers will know who we are up.

If too many failed login attempts before the page can cause reptiles simulation requires login verification code, but without the need to consider here is the code, you may login failures, the solution may be to clean the machine Cookie.

Leave a Reply