How to create a web spider/crawler?

I'm wondering why this discussion hasn't come up on CE yet. Can anyone here talk about creating our own Web-Spider (crawler)?

I bet it would be an interesting discussion.

Replies

  • sookie
    sookie
    Overview: Web Crawlers work by parsing any particular web page and noting down if any hypertextlinks exist on that page that might be pointing to some other web pages then parsing those web pages and continuing in same way recursively and then indexing is decided for those web pages in sequence. This is a part of(rather an important part) of any search engine. It is an automated program that resides on a single computer and gets called whenever URL is typed in and passed to that program.

    Creating a web spider/crawler[pseudo code]

    Step # 1: Get the user's input -the starting url [optional: content type]
    Step # 2: Get the URL of very first page and add it to a List containing URL s.
    Step # 3: "While" List created in above step is not empty - Go to Step # 4 to Step #6 for each element in list.
    Step # 4: Get the web page for that first URL and check if it is of same content type specified in Step # 1. If it is fine.
    Step # 5: Check the same page if not in any blocked list [Optional]
    Step # 6: If Step # 4 and Step # 5 are true =>Search the obtained web page for nay other hyper links. If any is there, Go to Step # 3. Add it to the list. and again continue recursively same steps in such a way for all URLs.

    This is the basic flow of how a web crawler program should work. Other complications can be obviously added as per the needs and requirements.

    Hoping was of some use. Feel free to correct( or add more information)

    Thanks !

You are reading an archived discussion.

Related Posts

Hey friends, I am using Vista Ultimate. It is good. But the problem is the space in C partition is decreasing day by day for no files. How can i...

hi

Hi iam srujana, iam from india(andhra pradesh)and studying b.tech 3 rd year.My hobbies are playing shettle,doing yoga,watching tv,chatting with friends.
Do you know why the INDIA is still a developing country.!​ Indians are the blooming flowers and buds in the marvelous garden called India. But Indians are in such a...
I also want to know about Linear and Nonlinear Mixing
hai i am completed with my IT and i wanna continue my studies with M.tech can any one suggest me how to prepare and which would be best for my...