onTopicPages < processedPages / 10
in the beginning when seeds (assuming seeds are on topic) are processed, something is wrong. content phrases
, then this can go lower, down to onTopicPages < processedPages / 100
. Make sure your example links have actual content, as they may have been removed and show 404 errors, which often happens with old bookmarks.
We cannot currently crawl PDF or DOC files, so only links to plain HTML pages can be submitted as examples.
If detected on-topic pages are not on topic you are looking for, you need to provide more examples. If you are targeting specific entities or locations, you need to read more about machine learning we employ.
If above situation repeats, fix your crawl settings.
forbidden sites
, forbidden atext patterns
and forbidden href patterns
. content phrases
. This can make a crawl very inefficient if above steps are not done properly! Especially if you want geolocation targeting, you need to forbid few most common top level domains, countries, and cities (that appear in top of 'Most relevant domains' and 'Keyword ideas' tools) in forbidden sites
, forbidden href patterns
, and forbidden atext patterns
settings! This simple 'brute force' solution can produce MUCH better crawl results.Crawl may stop before onTopicPages
is reached if crawl settings is detected to be anomalous (too low percent of detected on-topic pages).
If you find above process too complicated or time consuming, you can have an expert setup your crawl.