Configurable Crawling Jobs

User can define any number of jobs and can activate concurrently a number of them defined by subscription limit. Place relevant jobs on auto-reload to collect fresh content. You can define topics, sites, regular expression, crawling intervals, general, seed, and news crawling modes. Built-in features make our crawlers more efficient as they ignore near duplicate content, spam pages, link farms, and have a real time domain relevancy algoritm which gets you the most relevant content for your topic. Our backlink and SEO queries can help you detect crawl trends and update crawl settings in real time.

Machine Learning

Provide a few quality example links from different web-sites to teach our system what topic to crawl. Topic, as understood by the machine, is defined by statistics of words found in pages, and is language independent.

info For example, if you provide pages about a single person or location, that name will NOT be the only hint for the machine, but word statistics will also detect other significant concepts, and results will contain pages that do not relate to that person or location only. Similarly, our algorithm neither detects a 'style' of writing nor finds all literary works of an author.

Crawler Features

Parameter Details
examples URLs to define your topic.
The better examples you provide, the better pages crawler finds. Very specific topics require few links only, for others ideally 30+ example pages are needed.
seeds URLs where crawling starts.
Provide few URLs for hub pages as seeds (avoid search engines, some block access). Submit up to 100,000 links.
allowed sites List or rules to narrow a crawl.
Allow specific subdomains and file paths, up to 5,000 of them.
forbidden sites List or rules to narrow a crawl.
Block specific subdomains and file paths, up to 5,000 of them.
forbidden href patterns List or rules to narrow a crawl.
Use regular expressions for forbidden (or allowed) URL patterns, up to 100 of them.
forbidden atext patterns List or rules to narrow a crawl.
Use regular expressions for forbidden (or allowed) link text patterns, up to 100 of them.
content phrases List of phrases which must or must not appear in detected core content of on-topic pages.
To narrow results to few brands or persons, provide lists of optional phrases per line, up to 10 of them. Content then must match at least one phrase from each list.
allowed languages Select a language.
Processes only pages in selected language. We support: Arabic, Czech, Danish, English, Finnish, French, German, Hindi, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish.
crawl duration Time in minutes.
Maximum duration of processing time. Crawl may finish sooner if other settings are more limiting. Processes may work in parallel or be in sleep mode, so actual time may be longer or shorter.
number of pages Number of on-topic pages to find.
Crawl may finish sooner if seeds or other parameters are more restrictive. When set to 1, crawls only seed pages.
explore seeds frontier Controlled crawl.
Visits only seeds and pages linked from seeds, or even new links only.
reload interval Time in minutes.
Repeat the crawl if it has finished. Available in seeds/seeds+frontier modes. Finds new links and pages, until subscription limit is reached.
purge & reload interval Time in minutes.
Delete all data and restart the crawl. This way you can have a continuous crawling of new pages.

Modes of operation

A crawling job is defined with a crawl depth type and a combination of crawl focus parameters.

CRAWL DEPTH TYPE seeds only, seeds and links found in seed pages, new links found on seed pages after initial crawl, unlimited.
CRAWL FOCUS PARAMETERS general or topical, site specific, language specific, href & atext pattern specific.

Seeds limited crawls can be set for auto-reload at certain time intervals.

Built-in Features

Feature Details
representative sampling Collected web pages are not a random or biased sample but constitute a representative sample of the topical web. Our crawler preserves global web graph properties during crawl; sites are represented proportionally to their topical merit. In a well connected topic, final results are quite robust to variations in initial starting points (provided seed urls). Top sites from our crawl most often happen to be highly ranked in Google, have high traffic in Alexa, and have high PageRank. This is invaluable information for SEO consultants as well as for anyone exploring the niche and competition, even more so now when Google discontinued PageRank service, and Yahoo and Blekko backlink searches are no more available.
low level of SPAM Our algorithms prevent most spam sites from appearing in results. More complex link farm schemes may get through but are easy to detect with our 'similar domains to domain' query which shows all sites closely connected by the linking pattern, and they are easy to remove with 'REMOVE DOMAIN' link. During months of testing on dozens of verticals, we had to do this only once.
optimized for SEO Some aggregate SEO queries require optimized database tables.

SEO queries

A few examples of link and keyword related queries. There are more than 50 in total.

prospect pages Pages with a few relevant links to different domains, indicating a link building opportunity for your niche!
atext Most frequent phrases in anchor texts of links found in this topical graph (unique phrases per link).
atext to example.com Most frequent phrases in anchor texts of links from this topical graph pointing to the domain.
my phrase in atext to http://example.com/page.html Anchor texts in relevant links to the page containing given query phrase.
my phrase in atext from example.com Anchor texts in relevant links from domain containing query.
title word stems Common word bases found in verified on-topic page titles.
most linked pages from example.com Pages from domain with most relevant inbound links.
links to http://example.com/page.html Relevant links to the page.
popular pages from example.com Pages from domain with highest topical PageRank.
popular pages linking to http://example.com/page.html with my phrase in atext Pages with highest topical PageRank pointing to the page with query in anchor text.
urls with my phrase in atext to example.com Relevant links to the domain with query in anchor text.
most relevant domains Most relevant domains based on page relevance to the topic as defined by example links and on the number of relevant domains linking to them.
domains with most linking domains Relevant domains with most other relevant domains linking to them.
similar domains to example.com Similar domains to domain, based on other relevant domains linking to them.

Structured data

If you need structured data, we can scrape the content of your crawl results, at additional cost.

 

PRICING | SIGN UP

contact | terms | privacy
© 2017 semanticjuice.com