Building a Search Engine

This one is out of my own, personal curiosity. What's involved in building a simple web search engine?

Let's say, we have a network of 1000 (or 10?) terminals each hosting a set of 5 web-pages containing images, videos and of course text. How would you go about building a search engine to index all the content available on such network? Specifically, I'd like to know, what would be your -

1. Approach
2. Choice of technology (with reasons)
3. Indexing methodology
4. Technique to improve quality of results

Any takers?

Replies

  • MaRo
    MaRo
    This is most of my work now ๐Ÿ˜

    As you may know, we're indexing media content, so we're making freakishly different indexing techniques.

    For text we're researching in several indexing techniques.

    1- Google's technique.
    2- Keyword analysis, the spider suppose to understand the content of a single article & rank the page according to previously saved keywords.

    means, the spider knows for example the keyword 'Engineer' & has a network of keywords related to the word & crawls for the topic's keyword & analyze the contents of the found pages & rank them according to their relevancy, like if the spider found an article talking about engineers having vacancy will has a rank less than engineers invented a tool with technical details.
  • Kaustubh Katdare
    Kaustubh Katdare
    MaRo - I bet you've enough idea about search engines. I look forward to your posts in this thread.
  • MaRo
    MaRo
    I'm not very good in writing so I'd rather getting questions, please.
  • Kaustubh Katdare
    Kaustubh Katdare
    I'm surprised! No one wants to discuss this? Looks like we should rename this section to Computer Troubleshooting section.
  • Mahesh Dahale
    Mahesh Dahale
    technology:- ASP.NET supports the searching of files using the Windows Indexing Service, Microsoft .NET Framework or above is required. The code are in C#, Building a word index for a website by using a web crawler also not dependent on the underlying technology used on a website

    methods like SetQuery(query as string),GetSearchResults(),

    and still thinking on indexing methodology and web crawling product -The Website Utility
  • Ashraf HZ
    Ashraf HZ
    Biggie, can't you see MaRo is hinting for a Small Talk? ๐Ÿ˜‰
  • MaRo
    MaRo
    The problem not with indexing mahesh, for search engines the problem lies in ranking the cached results.


    @Ash : yea I'd love to, but not now, if Google get down I'll deserve one ๐Ÿ˜
  • Mahesh Dahale
    Mahesh Dahale
    ok MaRo sir, thanks but index is to optimize speed and performance in finding relevant documents for a search query. can you explain in detail
  • MaRo
    MaRo
    Spiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.

    Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index.
  • Mahesh Dahale
    Mahesh Dahale
    Thank you sir
  • Mahesh Dahale
    Mahesh Dahale
    sir,i need to connect to a given search engine and retrieve the html page of that search engine ?
    i am using java. how any idea
  • Kaustubh Katdare
    Kaustubh Katdare
    MaRo
    Spiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.

    Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index.
    Any new algorithm suggestions for ranking the pages? Or quality inlinks is the only way we can achieve better search results?
  • MaRo
    MaRo
    @mahesh : #-Link-Snipped-#

    You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - engineer - Google Search - you have to replace the word engineer to be a string variable, with respect to spaces.

    @Big K: I think the Keyword analysis algorithm even if not faster than the present algorithm but will differ in the number of relevant results.
  • Mahesh Dahale
    Mahesh Dahale
    MaRo
    @mahesh : #-Link-Snipped-#

    You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - engineer - Google Search - you have to replace the word engineer to be a string variable, with respect to spaces.
    Why data remains largely hidden from users by placing it behind form or Web services interfaces (deep web)?
    In a study by BrightPlanet shows the hidden Web contains 7,500 terabytes of information and is 400 to 500 times larger than the visible Web.
  • clarence456
    clarence456
    Hi! Thanks for the well informative post and this is one of the post which impress me a lot and I like to create one of my own search engine so its very effective one.The tips are one of the best.Keep up the nice work.




    ______________________________
    #-Link-Snipped-#
  • ONKSSSSS
    ONKSSSSS
    Good maro but give some more info...
    hers what i know:
    I have no time to explain it myself but google out for you 'WHAT IS GOOGLEBOT?'
    Surely I hope it would help BIG to know more about his topic...
    till then keep posting your findings.....
  • MaRo
    MaRo
    Googlebot is the spider, the software Google relies on to crawl the Internet.
  • Mahesh Dahale
    Mahesh Dahale
    Suppose I launch a new web site then when it get crawl by google.
    when start crawling , and updating suppose i change URL then ?๐Ÿ˜’
  • MaRo
    MaRo
    Google has URL submission, which makes your website on their todo list.

    #-Link-Snipped-#
  • Kaustubh Katdare
    Kaustubh Katdare
    This discussion is going from building a search engine to 'what is a search engine'. Do I have too big expectations from CEans?
  • Mahesh Dahale
    Mahesh Dahale
    MaRo sir Today i found in article that contain

    Google crawls the Web at varying depths and on more than one schedule. These called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. Forth is reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first
  • Manish Goyal
    Manish Goyal
    yeah you are right Mahesh
    there is not as such site until now in which deep crawls occur.......
    but in a magazine i have read that researchers are trying to develop such websites
    correct me if i am wrong
  • madhumurundi
    madhumurundi
    hi,
    search engine works on following techniques: 1.Page ranking: Most activated most visited sites will be considered as an Rank 1 pages . whenever user enter a query in search box first query will pass in to the query optimizer then it will look up the whether page requested is most frequently accessed or not if it is most frequently accessed then it will comes in first page of result..
    by using some techniques we can optimize the speed of search engine , some of the techniques are:
    1. Modeling Score Distributions for Combining the Outputs of Search Engines

    2. Web Crawling
  • ONKSSSSS
    ONKSSSSS
    Yes agree to all. But concepts like web crawler, pageing, indexing, etc that i think every CSEs know but HOW TO BUILD IT???( Hope Big will be satisfied after it)
  • Kaustubh Katdare
    Kaustubh Katdare
    I'm convinced I've bit higher expectations from the members. This discuss was meant to be about building a search engine and not "What is search engine"

You are reading an archived discussion.

Related Posts

Occam Networks New High-Density Gigabit Ethernet FTTP Blade Featuring Highest Port Density in Its Class on Display at ITU Telecom World in Geneva 48 Ports Per Blade Combines with Integrated...
A short article about DARPA funded experiments to remotely control insects - in this case beetles: Free-flying cyborg insects steered from a distance - tech - 01 October 2009 -...
here is a problem which i face when i participate in a online programming competition can any any one solve it Problem Statement: Now comes an interesting math game. How...
Hi friends , can some of you please list some project topics based on Cryptography ; not too tough but on the easier side ? I am a final year...
Wishing many happy returns of the day to CEan SilverScropion! ๐Ÿ˜ May your dreams come true! ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰