Path ascending crawlers and robots.txt file

techie | June 8 - 2010

We want the crawler of the search engine  to download as many resources as possible from a particular Web site. A crawler would  normaly ascend to every path in each URL that it intends to crawl. For example, when given a seed URL of  ****.org/a/b/page.     , it will attempt to crawl //b/, /a/, and /

Path-ascending crawler is effective in that they are very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling.

Targetted  webcrawling
The importance of a page for a crawler can  be expressed as a function of the similarity of a page with respect to a given query. In this context we can intend web crawler to download pages that are similar to each other, thus it will be called focused crawler or topical crawler. The main problem in focused crawling is that in the situation of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. One such predictor is the anchor text of links. To use the complete content of the pages already visited to understand the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually depends on a general Web search engine for providing starting points. Robots.txt file is written for most of the websites in root direcotry that defines the path  for the crawlers. Webcrawlers are given the disallow coomand to restrict visting specified pages.

Leave a Reply