Are you a company or a developer with crawling or scrapping needs? Hello! My name is Pu_iN, and I crawl the Web for you. I don't overwhelm you with billions of irrelevant pages. I follow only links that you might care for and deliver million verified on-topic pages among millions of promissing links identified, in few days! I am polite and obey netiquette rules. Webmasters can identify me in logs by the string:
Pu_iN (+http://semanticjuice.com/) Some clients use my technology but have different bot identifiers.
Our proprietary algorithms enable very efficient crawling. We use a "best-first search" where we continuously update properties of graph nodes and edges and follow the edges deemed to either be on-topic or good candidates for further exploration, while preserving the representative sampling of the web. We crawl more often more valuable sites. A single machine with a business account can crawl more than one million pages per day. Given that the main focus of semantic juice is on topical links, a single machine is enough for most verticals. A good setup can yield very high percent of on-topic pages in bigger crawls as well, even ⅔ of visited pages!
Each web crawl can have these parameters customized: title, example pages (for focused crawling), seed URLs, allowed sites (partial URLs), forbidden sites (partial URLs), forbidden href parameters (PCRE regex), forbidden atext parameters (PCRE regex), required content phrases, allowed languages, maximum duration, explore seeds frontier, reload interval, purge reload interval, maximum on-topic pages (this is less than total number of pages crawled, and much less than total promissing URLs collected), DB optimization for SEO backlink queries, store full content of pages (URLs, titles, atexts and graph are always stored).
Speed of each crawl is affected by the number of running crawls on a server, distribution of distinct hosts in URL queue (robots.txt delay sets maximum speed per host), and few other factors as well. We obey robots.txt Disallow / Allow rules, and do not crawl any host more than once per seconds even if more jobs on a machine want to crawl the same site. We also avoid crawling sites at times they are not responsive, so not to make your website slow for other users. After de-duplication and other analysis, this comes up to tens of millions of most promissing pages crawled per month among many more promissing URLs placed in a queue, per server.