Are you a company or a developer with crawling or scrapping needs? Hello! My name is Pu_iN, and I crawl the Web for you. I don't overwhelm you with billions of irrelevant pages. I follow only links that you might care for and deliver million verified on-topic pages among millions of promissing links identified, in few days! I am polite and obey netiquette rules. Webmasters can identify me in logs by the string:
Pu_iN (+http://semanticjuice.com/) Some clients use my technology but have different bot identifiers.
Our proprietary algorithms enable very efficient crawling. We use a "best-first search" where we continuously update properties of graph nodes and edges and follow the edges deemed to either be on-topic or good for further exploration, while preserving the representative sampling of the web. We crawl more often more valuable sites. A single machine with a business account can crawl more than one million pages per day. Given that the main focus of semantic juice is on topical links, a single machine is enough for most verticals. A good setup can yield very high percent of on-topic pages in bigger crawls as well, even ⅔ of visited pages!
Each web crawl can have these parameters customized: title, example pages (for focused crawling), seed URLs, allowed sites (partial URLs), forbidden sites (partial URLs), forbidden href parameters (PCRE regex), forbidden atext parameters (PCRE regex), required content phrases, allowed languages, maximum duration, explore seeds frontier, reload interval, purge reload interval, maximum on-topic pages (this is less than total number of pages crawled, and much less than total promissing URLs collected), DB optimization for SEO backlink queries, store full content of pages (URLs, titles, atexts and graph are always stored).
Speed of each crawl is affected by the number of running crawls on a server, distribution of distinct hosts in URL queue (robots.txt delay sets maximum speed per host), and few other factors as well. After de-duplication and other analysis, this comes up to tens of millions of most promissing pages crawled per month among many more promissing URLs placed in a queue, per server. This amount of links should cover almost any topic!