Netpeak Spider 2.1.1.4: Custom Search and Extraction

Alex WiseCEO & Founder at Netpeak Software

7902

December 19, 2016

Netpeak Spider 2.1.1.4: Custom Search and Extraction

Dear friends, Christmas is just round the corner and we have a little present for you – new functionality of Netpeak Spider that allows you to scrape websites' source code and extract data from web pages. In this review we are going to tell you about the functionality itself, its unique features and show some examples of how you can use it. Let’s get started!

Canned knowledge is a piece of useless knowledge. So we recommend you to flick through this comprehensive guide on scraping with actionable examples: 'Comprehensive Guide: How to Scrape Data from Online Stores With a Crawler.'

1. Description

The ‘Custom Search and Extraction’ function covers a wide range of tasks that SEO specialists and webmasters perform on a daily basis: checking the integration of web analytics, microdata, meta tags for social networks, the list can go on an on. The new functionality expands the use of Netpeak Spider to such an extent that even we ourselves are a little scared of it :-)

Types of search

In total, there are 4 types of search available:

Contains → counts the number of search expression matches. Works in the ‘Only Search’ format, which means it doesn’t extract any data. This is the easiest type of search one can come up with: as if while going through the source code of the page, you press Ctrl+F and enter the necessary search expression – the program automatically scans all pages and shows how many matches have been found.
RegExp → extracts all values that match the regular expression. Works in the ‘Search and Extraction’ format. This type allows to customize the process, considerably expanding the capabilities of the search; however, it requires some basic knowledge of regular expressions. Read more about regular expressions.
CSS Selector → extracts all values of necessary HTML elements on the basis of their CSS selectors. Works in the ‘Search and Extract’ format. A simple and yet powerful way to extract data: all you need to do is to specify one letter ‘a’ to pull all links from the page. Read more about CSS selectors.
XPath → extracts all values of necessary HTML elements on the basis of their XPath. Works in the ‘Search and Extract’ format. The most powerful way to select data, though it requires certain knowledge and experience. Read more about XPath.

Setting

To add a new search, go to ‘Crawling Settings’ → ‘Custom Search’ tab. Here you can add up to 15 simultaneous searches and set their parameters:

1. Name

This field is not required but can save you a lot of time when you have many different searches and need to find any specific one quickly.

2. Type of search

One of the four types: ‘Contains’, ’RegExp’, ‘CSS Selector’, or ‘XPath’.

3. Search expressions

Syntax depends on the chosen type of search. Each type has input validation that will quickly show you whether the expression has been entered correctly.

4. Search space

For ‘Contains’ and ‘RegExp’ only. You can choose what exactly you want to search:

All source code → all HTML tags will be searched
Only text (excluding HTML tags) → as the name implies, HTML tags won’t be included in custom search

5. Data extraction

For ‘CSS Selector’ and ‘XPath’ only. You can choose what data you want to extract – as an example, we will search ‘a’ with a CSS selector or ‘//a’ with XPath in the following source code:

<a href=”https://example.com/”>Anchor text</a>

Inner text → only text will be extracted from the element and all child elements, excluding HTML tags. The result of the extraction for the above-mentioned example will be:

Anchor text

Inner HTML content → all content will be extracted from the element. In our example it is:

Anchor text

Entire HTML element → the whole element will be extracted. In our example it is:

<a href=”https://example.com/”>Anchor text</a>

Attribute → here you can specify what attribute you want to extract. For instance, in case ‘href’ is specified, only the link will be extracted:

https://example.com/

6. Ignore case

For ‘Contains’ and ‘RegExp’ only. You can use the ‘Ignore case’ setting – it is on by default and facilitates the search. If your search is case-sensitive, deselect this box.

View results

In the right panel of the main window, you can find a new ‘Search’ tab. The results of custom search appear there after the crawling is completed.

Be attentive: the search works only for HTML pages that return 2xx response code – the number of such pages is displayed in the same block under ‘Analyzed URLs’.

If you have used a custom search that extracts certain data, you can select this search and click the ‘Extraction Results’ button to see the overview table on the extracted data. It is similar to how the ‘Current Table Summary’ button works.

In the main table, the results of each search are added to a separate column that reminds of the way parameters work – the columns can be sorted and filtered, hidden, exported in Excel / CSV, etc. If you double click the left mouse button on a number in the column, you will be able to see the data for the selected URL.

In additional tables dedicated to a particular search, you can find detailed information about its settings – so you will always be able to see what settings have been applied and what data has been extracted.

2. Unique features

✔ Scaling

We would like to share our vision of the new functionality – how you can look at the received data from a different angle:

‘Search’ tab in the right panel of the main window → this section allows you to simultaneously see all current searches, as well as the number of URLs that contain the expression (number of entries > 0) and don’t contain the expression (number of entries = 0): this information (Found / Not found) is clickable and will take you to the corresponding filters.
the main results table → next to every URL you can see the number of all entries of the expression.
additional table with the extraction results → here data grouping takes place, you will be able to see all unique entries, their number, and length.

We have tried to show the maximum of information on different ways to use the new functions.

✔ Search Space

In the ‘Contains’ and ‘RegExp’ type of search you can choose the search space – ‘All source code’ or ‘Only text (excluding HTML tags)’.

This unique feature can be very useful in case you need to analyze the text only: for example, find all unigrams and count their number on every page.

✔ High productivity

This part is just like a cherry on top :-)

All data from the user search is gathered into a separate database that allows to take the load off the RAM, which is always a scarce resource. You can simultaneously start many separate user searches, even if each of them will perform difficult operations or store a lot of information (for example, extract all symbols and calculate their number or retrieve the entire HTML source code).

3. Examples of use

We have prepared the selection of the most popular tasks that can be solved with the help of user search. However, it is important to remember that with the new functionality you will be able to solve almost any problem of site analysis, it’s just the matter of experience and the ability to use different types of search.

RegExp

1. Integrating GTM / GTM ID

Sometimes you need to check whether GTM (Google Tag Manager) has been integrated correctly. You may want to be sure that web analytics, which connects with the help of GTM, is working properly. You might also find out that some pages have extra GTM code.

Regular expression:

['"](GTM-\w+)['"]

Search space: all source code

2. Unigrams

Unigrams are single words. Now, with the help of Netpeak Spider you can extract this data, which considerably extends the capabilities of analyzing texts.

Regular expression for unigrams:

\w+

Search space: only text (excluding HTML tags)

3. Email

Here you can find a regular expression for searching email addresses. Sometimes it might be useful to check whether all pages have contact email or collect addresses from a large number of analyzed websites (don’t forget about the ‘List of URLs’ crawling mode).

Regular expression:

[a-zA-Z0-9][a-zA-Z0-9\.+-]+\@[\w-\.]+\.\w+

Search space: all source code

CSS Selector

4. Extracting the entire source code

The architecture of the program allows to extract the entire source code. After opening the extraction results, you can search, filter necessary results (for example, with the help of regular expressions).

CSS Selector:

html

Data extraction: entire HTML element

5. Links in the <body> section

Use the space to specify that in one HTML tag you want to find and extract another one.

CSS Selector:

body a

Data extraction: entire HTML element

6. Texts in tags of the <body> section

One of the built-in parameters counts the number of symbols in the texts of tags in the <body> section. Now you’ll be able to see the texts themselves.

CSS Selector:

body p

Data extraction: inner text

7. Tags strong/b, em/i

Recently, we have received a number of requests from our users who wanted to count the number of tags (for instance) and know their contents. Now, such data can be easily extracted: note that in our example all the listed tags are separated by commas – this is totally ok when you are extracting all tags at once.

CSS Selector:

strong, b, em, i

Data extraction: entire HTML element

8. hreflang

Correct integration of hreflang tags is very important for multilingual and multi-regional websites. You can now check whether there are any corresponding tags with the help of the new functionality: in the first example below, all values will be extracted; in the second one, only the line with hreflang=”en-GB” attribute, that is, with the link to the English version of the page for GB users.

CSS Selectors:

link[hreflang]

link[hreflang='en-GB']

Data extraction: entire HTML element

9. Microdata (Schema.org)

Similar to GTM – it may be necessary to check whether microdata has been integrated correctly, or whether there is any microdata at all. The first example shows how to extract all lines where the microdata is used; the second one allows to extract only those lines that contain the itemprop=”url” attribute.

CSS Selector:

[itemprop]

[itemprop='url']

Data extraction: entire HTML element

XPath

10. First h1 tag

This parameter can already be found in the built-in kit with the name ‘h1 value’. The goal of the example is to show what syntax to use to be able to extract the data you need (not necessarily h1 headings).

XPath:

/descendant::h1[1]

Data extraction: inner text

11. Social meta tags

With the help of XPath, you can extract Open Graph, Facebook, and Twitter tags that help social networks understand your content better.

XPath:

/html/head/meta[starts-with(@property,"og:") or starts-with(@property,"fb:") or starts-with(@name,"twitter:")]

Data extraction: entire HTML element

12. Interactive telephone numbers

This XPath allows to find and extract all <a> tags that use special links for immediate dialing from mobile phones or with the help of special desktop software.

XPath:

//a[starts-with(@href, 'tel:')]

Data extraction: entire HTML element

13. Embedded video

The example shows how to find and extract links to YouTube videos that are embedded with the help of <iframe>.

XPath:

//iframe[contains(@src ,'www.youtube.com/embed/')]

Data extraction: src attribute

In a nutshell

This update introduces custom search of source code / text according to the 4 types of search: ‘Contains’, ‘RegExp’, ‘CSS Selector’, or ‘XPath’. The new feature is full of unique characteristics, so feel free to experiment and get the most out of your SEO.

This is the last major release of Netpeak Spider for Windows. Now we begin the global rebuild of Netpeak Checker – stay tuned, it’s going to be awesome! :-)

Digging This Update? Let's Discuss Netpeak Spider Perks in Person

Netpeak Spider 2.1.1.4: Custom Search and Extraction

1. Description

Types of search

Setting

1. Name

2. Type of search

3. Search expressions

4. Search space

5. Data extraction

6. Ignore case

View results

2. Unique features

✔ Scaling

✔ Search Space

✔ High productivity

3. Examples of use

RegExp

1. Integrating GTM / GTM ID

2. Unigrams

3. Email

CSS Selector

4. Extracting the entire source code

5. Links in the <body> section

6. Texts in <p> tags of the <body> section

7. Tags strong/b, em/i

8. hreflang

9. Microdata (Schema.org)

XPath

10. First h1 tag

11. Social meta tags

12. Interactive telephone numbers

13. Embedded video

In a nutshell

Digging This Update? Let's Discuss Netpeak Spider Perks in Person

Digging this post? Share with friends: