Categories
Uncategorized

Easy Data Analysis 12 | Web Scraper flip – grab pager flip pages (Web Scraper advanced usage)

This is a simple data analysis 12 article in this series.

Previous articles we introduced the Web Scraper solutions to cope with various flip, for example, modify the page to load data link, click on “More Buttons” drop down automatically load the data and load data. Today we talk about a more common type flip – pager.

Originally intended to explain Shajiao pager, turned a bunch of definitions feel very cumbersome, we do not first year online, watch Photo will know. I found a most versatile example, to support digital page adjustment, Next and specify the number of pages jump.

Today we learned learning, Web Scraper how to deal with this type of web page.

In fact, our first example of this tutorial, grab IMDb TOP rankings, watercress this movie list data is divided by pager:

But when we are looking for links to pages crawled law, does not use the pager to grab. Because when the variation of a page of links, the control parameters to crawl links is to achieve the lowest cost; if this can flip pages forward, but the link is not a change in the law, will have to go to this one will be a pager.

Some of these theories say boring, we give an example page linked irregular.

August 2 Cai Xu Kun’s birthday, to express celebration, on the microblogging fans to kun kun brush forwarding amount of 300W, forwarding data microblogging is just a pager division, we analyze forward microblogging information page to see how this type of data with Web Scraper crawl.

Direct link to this microblogging is:

https://weibo.com/1776448504/I0gyT8aeQ?type=repost

He watched so many videos, in order to express gratitude, we can go out to point kun kun plus an amount of reading.

First, we look forward links on page 1, long like this:

https://weibo.com/1776448504/I0gyT8aeQ?type=repost

The first two long like this, noting that more than a # _rnd1568563840036 parameters:

https://weibo.com/1776448504/I0gyT8aeQ?type=repost#_rnd1568563840036

The first three parameters for the # _rnd1568563861839

https://weibo.com/1776448504/I0gyT8aeQ?type=repost#_rnd1568563861839

The first four parameters # _rnd1568563882276:

https://weibo.com/1776448504/I0gyT8aeQ?type=repost#_rnd1568563882276

Look at a few links you can find the URL of the page forwarded no law at all, you can only go to page load data via pager. Here began our real teaching.

1. Create SiteMap

We first create a SiteMap, this time named cxk, starting links for https://weibo.com/1776448504/I0gyT8aeQ?type=repost.

2. Create a container selector

Because we have to click on the pager, the type of container outside of our elected Element Click, specific parameters can be explained see the figure, we simplified data analysis before 08 once explained in detail, here is not to dwell on.

The container is a preview appears as follows:

Process pager may be selected with reference to the following figure:

3. Create a child selector

These sub-selectors are relatively simple types are text selector, we chose to comment Username, Review and comment time three types of content.

4. The data fetch

According Sitemap cxk -> Scrape the operating path can fetch the data.

5. Some questions

If you read my tutorial immediately climb above data, the first problem you may encounter is that, 300w of data, all of which I do climb down?

It sounds not very realistic, after all, Web Scraper for the amount of data is relatively small, the data are considered tens of thousands more, and the data then you have to consider the big crawl time is too long, how the data is stored, how to deal with anti-crawler system URL (for example, rather abruptly out of a verification code, the Web Scraper is powerless).

Taking into account this problem, crawling in front of the automatic control and the number of tutorials you’ve seen, you may thinking with: nth-of-type (-n + N) N grab control of data. If you try, you will find this method is useless.

Cause of failure in fact relate to a little bit of knowledge of the web page, interested can look at the following explanation, not interested to see the final conclusion can be direct.

Click for more like-loaded web page and drop-down load type I described earlier, their newly loaded data, the current page is added, you have been pulled down, the data has been loaded and the scroll bar on the page will be shorter and shorter, it means that all data on the same page.

When we use: nth-of-type (-n + N) when the number of load control, in fact, is equivalent to set up a counter on this page, when the data has been accumulated to the number we want, it will stop crawling.

But for web page using the device each time the page is equivalent to refresh the current page, this will be a time to set up a counter.

For example, you want to grab 1000 data, but the first one is only 20 pages of data, the last one caught, there is still 980; then a flip, and will set a new counter, grab finished second last page a data short of 980, the counter is reset a flip, and …… Therefore, this becomes a 1000 number control method becomes ineffective.

So the conclusion is that if the type of web page would like to advance the end of the crawl, this method is only broken network. Of course, if you have a better solution, I can reply in the comments, we can discuss with each other.

6. Summary

Paging is a very common method of web pages, we can handle this type of Web Scraper pages through the Element click, and off the network by means of the end of the crawl.

7. Recommended reading

Easy Data Analysis 05 | Web Scraper flip – Batch fetching data link control

Easy Data Analysis 08 | Web Scraper page – click on ‘more button’ page

Easy Data Analysis 10 | Web Scraper flip – Crawling “rolling load” type page

Easy Data Analysis 09 | Web Scraper automatic control crawling number & Web Scraper parent-child selector

Leave a Reply