Netpeak Spider 188.8.131.52: Custom Search and ExtractionUpdates
Dear friends, Christmas is just round the corner and we have a little present for you – new functionality of Netpeak Spider that allows you to scrape websites' source code and extract data from web pages. In this review we are going to tell you about the functionality itself, its unique features and show some examples of how you can use it. Let’s get started!
The ‘Custom Search and Extraction’ function covers a wide range of tasks that SEO specialists and webmasters perform on a daily basis: checking the integration of web analytics, microdata, meta tags for social networks, the list can go on an on. The new functionality expands the use of Netpeak Spider to such an extent that even we ourselves are a little scared of it :-)
Types of search
In total, there are 4 types of search available:
- Contains → counts the number of search expression matches. Works in the ‘Only Search’ format, which means it doesn’t extract any data. This is the easiest type of search one can come up with: as if while going through the source code of the page, you press Ctrl+F and enter the necessary search expression – the program automatically scans all pages and shows how many matches have been found.
- RegExp → extracts all values that match the regular expression. Works in the ‘Search and Extraction’ format. This type allows to customize the process, considerably expanding the capabilities of the search; however, it requires some basic knowledge of regular expressions. Read more about regular expressions.
- CSS Selector → extracts all values of necessary HTML elements on the basis of their CSS selectors. Works in the ‘Search and Extract’ format. A simple and yet powerful way to extract data: all you need to do is to specify one letter ‘a’ to pull all links from the page. Read more about CSS selectors.
- XPath → extracts all values of necessary HTML elements on the basis of their XPath. Works in the ‘Search and Extract’ format. The most powerful way to select data, though it requires certain knowledge and experience. Read more about XPath.
To add a new search, go to ‘Crawling Settings’ → ‘Custom Search’ tab. Here you can add up to 15 simultaneous searches and set their parameters:
This field is not required but can save you a lot of time when you have many different searches and need to find any specific one quickly.
2. Type of search
One of the four types: ‘Contains’, ’RegExp’, ‘CSS Selector’, or ‘XPath’.
3. Search expressions
Syntax depends on the chosen type of search. Each type has input validation that will quickly show you whether the expression has been entered correctly.
4. Search space
For ‘Contains’ and ‘RegExp’ only. You can choose what exactly you want to search:
- All source code → all HTML tags will be searched
- Only text (excluding HTML tags) → as the name implies, HTML tags won’t be included in custom search
5. Data extraction
For ‘CSS Selector’ and ‘XPath’ only. You can choose what data you want to extract – as an example, we will search ‘a’ with a CSS selector or ‘//a’ with XPath in the following source code:
<a href=”https://example.com/”><strong>Anchor text</strong></a>
- Inner text → only text will be extracted from the element and all child elements, excluding HTML tags. The result of the extraction for the above-mentioned example will be:
- Inner HTML content → all content will be extracted from the element. In our example it is:
- Entire HTML element → the whole element will be extracted. In our example it is:
<a href=”https://example.com/”><strong>Anchor text</strong></a>
- Attribute → here you can specify what attribute you want to extract. For instance, in case ‘href’ is specified, only the link will be extracted:
6. Ignore case
For ‘Contains’ and ‘RegExp’ only. You can use the ‘Ignore case’ setting – it is on by default and facilitates the search. If your search is case-sensitive, deselect this box.
In the right panel of the main window, you can find a new ‘Search’ tab. The results of custom search appear there after the crawling is completed.
Be attentive: the search works only for HTML pages that return 2xx response code – the number of such pages is displayed in the same block under ‘Analyzed URLs’.
If you have used a custom search that extracts certain data, you can select this search and click the ‘Extraction Results’ button to see the overview table on the extracted data. It is similar to how the ‘Current Table Summary’ button works.
In the main table, the results of each search are added to a separate column that reminds of the way parameters work – the columns can be sorted and filtered, hidden, exported in Excel / CSV, etc. If you double click the left mouse button on a number in the column, you will be able to see the data for the selected URL.
In additional tables dedicated to a particular search, you can find detailed information about its settings – so you will always be able to see what settings have been applied and what data has been extracted.
2. Unique features
We would like to share our vision of the new functionality – how you can look at the received data from a different angle:
- ‘Search’ tab in the right panel of the main window → this section allows you to simultaneously see all current searches, as well as the number of URLs that contain the expression (number of entries > 0) and don’t contain the expression (number of entries = 0): this information (Found / Not found) is clickable and will take you to the corresponding filters.
- the main results table → next to every URL you can see the number of all entries of the expression.
- additional table with the extraction results → here data grouping takes place, you will be able to see all unique entries, their number, and length.
We have tried to show the maximum of information on different ways to use the new functions.
✔ Search Space
In the ‘Contains’ and ‘RegExp’ type of search you can choose the search space – ‘All source code’ or ‘Only text (excluding HTML tags)’.
This unique feature can be very useful in case you need to analyze the text only: for example, find all unigrams and count their number on every page.
✔ High productivity
This part is just like a cherry on top :-)
All data from the user search is gathered into a separate database that allows to take the load off the RAM, which is always a scarce resource. You can simultaneously start many separate user searches, even if each of them will perform difficult operations or store a lot of information (for example, extract all symbols and calculate their number or retrieve the entire HTML source code).
3. Examples of use
We have prepared the selection of the most popular tasks that can be solved with the help of user search. However, it is important to remember that with the new functionality you will be able to solve almost any problem of site analysis, it’s just the matter of experience and the ability to use different types of search.
1. Integrating GTM / GTM ID
Sometimes you need to check whether GTM (Google Tag Manager) has been integrated correctly. You may want to be sure that web analytics, which connects with the help of GTM, is working properly. You might also find out that some pages have extra GTM code.
Unigrams are single words. Now, with the help of Netpeak Spider you can extract this data, which considerably extends the capabilities of analyzing texts.
Regular expression for unigrams:
Here you can find a regular expression for searching email addresses. Sometimes it might be useful to check whether all pages have contact email or collect addresses from a large number of analyzed websites (don’t forget about the ‘List of URLs’ crawling mode).
4. Extracting the entire source code
The architecture of the program allows to extract the entire source code. After opening the extraction results, you can search, filter necessary results (for example, with the help of regular expressions).
5. Links in the <body> section
Use the space to specify that in one HTML tag you want to find and extract another one.
6. Texts in <p> tags of the <body> section
One of the built-in parameters counts the number of symbols in the texts of <p> tags in the <body> section. Now you’ll be able to see the texts themselves.
7. Tags strong/b, em/i
Recently, we have received a number of requests from our users who wanted to count the number of <b> tags (for instance) and know their contents. Now, such data can be easily extracted: note that in our example all the listed tags are separated by commas – this is totally ok when you are extracting all tags at once.
strong, b, em, i
Correct integration of hreflang tags is very important for multilingual and multi-regional websites. You can now check whether there are any corresponding tags with the help of the new functionality: in the first example below, all values will be extracted; in the second one, only the line with hreflang=”en-GB” attribute, that is, with the link to the English version of the page for GB users.
9. Microdata (Schema.org)
Similar to GTM – it may be necessary to check whether microdata has been integrated correctly, or whether there is any microdata at all. The first example shows how to extract all lines where the microdata is used; the second one allows to extract only those lines that contain the itemprop=”url” attribute.
10. First h1 tag
This parameter can already be found in the built-in kit with the name ‘h1 value’. The goal of the example is to show what syntax to use to be able to extract the data you need (not necessarily h1 headings).
11. Social meta tags
With the help of XPath, you can extract Open Graph, Facebook, and Twitter tags that help social networks understand your content better.
/html/head/meta[starts-with(@property,"og:") or starts-with(@property,"fb:") or starts-with(@name,"twitter:")]
12. Interactive telephone numbers
This XPath allows to find and extract all <a> tags that use special links for immediate dialing from mobile phones or with the help of special desktop software.
13. Embedded video
The example shows how to find and extract links to YouTube videos that are embedded with the help of <iframe>.
In a nutshell
This update introduces custom search of source code / text according to the 4 types of search: ‘Contains’, ‘RegExp’, ‘CSS Selector’, or ‘XPath’. The new feature is full of unique characteristics, so feel free to experiment and get the most out of your SEO.
This is the last major release of Netpeak Spider for Windows. Now we begin the global rebuild of Netpeak Checker – stay tuned, it’s going to be awesome! :-)