spider trap, a nightmare for information retrieval

a spider trap is a website that generate random pages when visited by a web crawler. according to wikipedia a spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. this spider trap generate random pages that contains many links in one page sometimes the link could be in the counts of thousand links, these links is valid but for only a short time and when these link were shown in the search result retrieve by a crawler then the link would show an “error 404 not found” because the links already gone.

this behaviour could be depressing for a web crawler, as it’s spend so much time indexing a website and put too much stress on the server side. in the page that has a spider trap it can also contains many posting for words based on dictionary. so web crawler thinks the site is highly relevant for any search query and will give the website a higher pagerank. but actually the webpage only copying the words in dictionary and post it in the page and made the webpage relevant to any search query.

this spider trap can be avoided by not indexing a webpage when it was first found by a crawler, instead the crawler only give a quick scan on the website to know how many webpages there and come back again in a few days to see if the webpages has experienced any changes or not. if the disparity of changes is large then we can safely say that it is a spider trap and must be avoided by a web crawler but if it not then the web crawler can index it. but as precaution the web crawler can also counts how many links are there in a page and determined if it’s a spider trap or not, it is based on assumption that it would be impossible for a webpage to contains so many links (the total links could be in hundreds or thousand) and mark the website as a spider trap. a normal website will only contains link in the counts of below hundreds.

so if asked what is the enemy of information retrieval then spider trap would be a great answer.


