Highly customizable focused crawler

Configurable Crawling Jobs

You can define topics, domains, url paths, regular expression, crawling intervals, general, seed, and news crawling modes. Built-in features make our crawlers more efficient as they ignore near duplicate content, spam pages, link farms, and have a real time domain relevancy algoritm which gets you the most relevant content for your topic.

Our backlink and SEO queries can help you detect crawl trends and update crawl settings in real time.

Machine Learning

Provide a few quality example links from different web-sites to teach our system what topic to crawl. Topic, as understood by the machine, is defined by word combinations found in pages, and is language independent. Quick SEO mode get these example links for the given phrase from major search engines.

For example, if you provide pages about a single person or location, that name will NOT be the only hint for the machine, but word statistics will also detect other significant concepts, and results will contain pages that do not relate to that person or location only. Similarly, our algorithm neither detects a 'style' of writing nor finds all literary works of an author. Technically this can be done, but requires different machine learning approach.

Crawler Features

Parameter	Details
`examples`	URLs to define your topic. _{The better examples you provide, the better pages crawler finds. Very specific topics require few links only, for others ideally 30+ example pages are needed.}
`seeds`	URLs where crawling starts. _{Provide few URLs for hub pages as seeds (avoid search engines, some block access). Submit up to 100,000 links.}
`allowed sites`	List or rules to narrow a crawl. _{Allow specific subdomains and file paths, up to 5,000 of them.}
`forbidden sites`	List or rules to narrow a crawl. _{Block specific subdomains and file paths, up to 5,000 of them.}
`forbidden href patterns`	List or rules to narrow a crawl. _{Use regular expressions for forbidden (or allowed) URL patterns, up to 100 of them.}
`forbidden atext patterns`	List or rules to narrow a crawl. _{Use regular expressions for forbidden (or allowed) link text patterns, up to 100 of them.}
`content phrases`	List of phrases which must or must not appear in detected core content of on-topic pages. _{To narrow results to few brands or persons, provide lists of optional phrases per line, up to 10 of them. Content then must match at least one phrase from each list.}
`allowed languages`	Select a language. _{Processes only pages in selected language. For topical crawls we support: Arabic, Czech, Danish, English, Finnish, French, German, Hindi, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish. For general crawls we also support Afrikaans, Aragonese, Belarusian, Breton, Catalan, Bulgarian, Bengali, Welsh, Greek, Estonian, Basque, Persian, Irish, Galician, Gujarati, Hebrew, Croatian, Haitian, Indonesian, Icelandic, Japanese, Khmer, Kannada, Korean, Lithuanian, Latvian, Macedonian, Malayalam, Marathi, Malay, Maltese, Nepali, Occitan, Punjabi, Romanian, Slovak, Slovene, Somali, Albanian, Serbian, Swahili, Tamil, Telugu, Thai, Tagalog, Ukrainian, Urdu, Vietnamese, Walloon, Yiddish, Simplified Chinese, Traditional Chinese}
`crawl duration`	Time in minutes. _{Maximum duration of processing time. Crawl may finish sooner if other settings are more limiting. Processes may work in parallel or be in sleep mode, so actual time may be longer or shorter.}
`number of pages`	Number of on-topic pages to find. _{Crawl may finish sooner if seeds or other parameters are more restrictive. When set to 1, crawls only seed pages.}
`explore seeds frontier`	Controlled crawl. _{Visits only seeds and pages linked from seeds, or even new links only.}
`reload interval`	Time in minutes. _{Repeat the crawl if it has finished. Available in seeds/seeds+frontier modes. Finds new links and pages, until subscription limit is reached.}
`purge & reload interval`	Time in minutes. _{Delete all data and restart the crawl. This way you can have a continuous crawling of new pages.}

Modes of operation

A crawling job is defined with a crawl depth type and a combination of crawl focus parameters.

CRAWL DEPTH TYPE	seeds only seeds and links found in seed pages new links found on seed pages after initial crawl unlimited
CRAWL FOCUS PARAMETERS	general or topical site(s) specific language specific href & atext pattern specific

Seeds limited crawls can be set for auto-reload at certain time intervals.

Built-in Features

Feature	Details
representative sampling	Collected web pages are not a random or biased sample but constitute a representative sample of the topical web. Our crawler preserves global web graph properties during crawl; sites are represented proportionally to their topical merit. In a well connected topic, final results are quite robust to variations in initial starting points (provided seed urls). Top sites from our crawl most often happen to be highly ranked in Google, have high traffic in Alexa, and have high PageRank. This is invaluable information for SEO consultants as well as for anyone exploring the niche and competition, even more so now when Google discontinued PageRank service, and Yahoo and Blekko backlink searches are no more available.
low level of SPAM	Our algorithms prevent most spam sites from appearing in results. More complex link farm schemes may get through but are easy to detect with our 'similar domains to domain' query which shows all sites closely connected by the linking pattern, and they are easy to remove with 'REMOVE DOMAIN' link. During years of testing on dozens of verticals, we had to do this only once.
optimized for SEO	Some aggregate SEO queries require optimized database tables.

SEO queries

A few examples of link and keyword related queries. There are more than 50 in total.

`prospect pages`	Pages with a few relevant links to different domains, indicating a link building opportunity for your niche!
`atext`	Most frequent phrases in anchor texts of links found in this topical graph (unique phrases per link).
`atext to example.com`	Most frequent phrases in anchor texts of links from this topical graph pointing to the domain.
`my phrase in atext to http://example.com/page.html`	Anchor texts in relevant links to the page containing given query phrase.
`my phrase in atext from example.com`	Anchor texts in relevant links from domain containing query.
`title word stems`	Common word bases found in verified on-topic page titles.
`most linked pages from example.com`	Pages from domain with most relevant inbound links.
`links to http://example.com/page.html`	Relevant links to the page.
`popular pages from example.com`	Pages from domain with highest topical PageRank.
`popular pages linking to http://example.com/page.html with my phrase in atext`	Pages with highest topical PageRank pointing to the page with query in anchor text.
`urls with my phrase in atext to example.com`	Relevant links to the domain with query in anchor text.
`most relevant domains`	Most relevant domains based on page relevance to the topic as defined by example links and on the number of relevant domains linking to them.
`domains with most linking domains`	Relevant domains with most other relevant domains linking to them.
`similar domains to example.com`	Similar domains to domain, based on other relevant domains linking to them.

Data

You can download full results or new results since the last crawl in CSV format, or have access to real time data feed and advanced search queries over titles, ahrefs, and urls in JSON format.

Hosted search

We can also setup a fully developed search engine on your server interacting with this data, or setup a hosted indexing for your crawls with advanced search features over the full content of your crawls, not only titles and urls.

Structured data

Currently we provide extracted content, detected date, detected main image, extractive summary. If you need more structured data, we can scrape the content of your crawl results, at additional cost.

Need a multilingual parallel corpus?

We have a cutting edge sentence and word alignment algorithm able to amass multilingual parallel sentences from non parallel corpora, so we can run a crawl on your custom niche in different languages and detect parallel sentences in collected texts.

PRICING | SIGN UP