CrazyEngineers
  • Building a Search Engine

    Kaustubh Katdare

    Administrator

    Updated: Oct 26, 2024
    Views: 1.1K
    This one is out of my own, personal curiosity. What's involved in building a simple web search engine?

    Let's say, we have a network of 1000 (or 10?) terminals each hosting a set of 5 web-pages containing images, videos and of course text. How would you go about building a search engine to index all the content available on such network? Specifically, I'd like to know, what would be your -

    1. Approach
    2. Choice of technology (with reasons)
    3. Indexing methodology
    4. Technique to improve quality of results

    Any takers?
    0
    Replies
Howdy guest!
Dear guest, you must be logged-in to participate on CrazyEngineers. We would love to have you as a member of our community. Consider creating an account or login.
Replies
  • MaRo

    MemberOct 3, 2009

    This is most of my work now 😁

    As you may know, we're indexing media content, so we're making freakishly different indexing techniques.

    For text we're researching in several indexing techniques.

    1- Google's technique.
    2- Keyword analysis, the spider suppose to understand the content of a single article & rank the page according to previously saved keywords.

    means, the spider knows for example the keyword 'Engineer' & has a network of keywords related to the word & crawls for the topic's keyword & analyze the contents of the found pages & rank them according to their relevancy, like if the spider found an article talking about engineers having vacancy will has a rank less than engineers invented a tool with technical details.
    Are you sure? This action cannot be undone.
    Cancel
  • Kaustubh Katdare

    AdministratorOct 3, 2009

    MaRo - I bet you've enough idea about search engines. I look forward to your posts in this thread.
    Are you sure? This action cannot be undone.
    Cancel
  • MaRo

    MemberOct 3, 2009

    I'm not very good in writing so I'd rather getting questions, please.
    Are you sure? This action cannot be undone.
    Cancel
  • Kaustubh Katdare

    AdministratorOct 4, 2009

    I'm surprised! No one wants to discuss this? Looks like we should rename this section to Computer Troubleshooting section.
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 4, 2009

    technology:- ASP.NET supports the searching of files using the Windows Indexing Service, Microsoft .NET Framework or above is required. The code are in C#, Building a word index for a website by using a web crawler also not dependent on the underlying technology used on a website

    methods like SetQuery(query as string),GetSearchResults(),

    and still thinking on indexing methodology and web crawling product -The Website Utility
    Are you sure? This action cannot be undone.
    Cancel
  • Ashraf HZ

    MemberOct 4, 2009

    Biggie, can't you see MaRo is hinting for a Small Talk? 😉
    Are you sure? This action cannot be undone.
    Cancel
  • MaRo

    MemberOct 4, 2009

    The problem not with indexing mahesh, for search engines the problem lies in ranking the cached results.


    @Ash : yea I'd love to, but not now, if Google get down I'll deserve one 😁
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 4, 2009

    ok MaRo sir, thanks but index is to optimize speed and performance in finding relevant documents for a search query. can you explain in detail
    Are you sure? This action cannot be undone.
    Cancel
  • MaRo

    MemberOct 4, 2009

    Spiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.

    Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index.
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 4, 2009

    Thank you sir
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 4, 2009

    sir,i need to connect to a given search engine and retrieve the html page of that search engine ?
    i am using java. how any idea
    Are you sure? This action cannot be undone.
    Cancel
  • Kaustubh Katdare

    AdministratorOct 4, 2009

    MaRo
    Spiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.

    Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index.
    Any new algorithm suggestions for ranking the pages? Or quality inlinks is the only way we can achieve better search results?
    Are you sure? This action cannot be undone.
    Cancel
  • MaRo

    MemberOct 5, 2009

    @mahesh : #-Link-Snipped-#

    You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - <a href="https://www.google.com/search?hl=en&source=hp&q=engineer&btnG=Google+Search&aq=f&oq=&aqi=g10" target="_blank" rel="nofollow noopener noreferrer">engineer - Google Search</a> - you have to replace the word engineer to be a string variable, with respect to spaces.

    @Big K: I think the Keyword analysis algorithm even if not faster than the present algorithm but will differ in the number of relevant results.
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 7, 2009

    MaRo
    @mahesh : #-Link-Snipped-#

    You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - <a href="https://www.google.com/search?hl=en&source=hp&q=engineer&btnG=Google+Search&aq=f&oq=&aqi=g10" target="_blank" rel="nofollow noopener noreferrer">engineer - Google Search</a> - you have to replace the word engineer to be a string variable, with respect to spaces.
    Why data remains largely hidden from users by placing it behind form or Web services interfaces (deep web)?
    In a study by BrightPlanet shows the hidden Web contains 7,500 terabytes of information and is 400 to 500 times larger than the visible Web.
    Are you sure? This action cannot be undone.
    Cancel
  • clarence456

    MemberOct 7, 2009

    Hi! Thanks for the well informative post and this is one of the post which impress me a lot and I like to create one of my own search engine so its very effective one.The tips are one of the best.Keep up the nice work.




    ______________________________
    #-Link-Snipped-#
    Are you sure? This action cannot be undone.
    Cancel
  • ONKSSSSS

    MemberOct 8, 2009

    Good maro but give some more info...
    hers what i know:
    I have no time to explain it myself but google out for you 'WHAT IS GOOGLEBOT?'
    Surely I hope it would help BIG to know more about his topic...
    till then keep posting your findings.....
    Are you sure? This action cannot be undone.
    Cancel
  • MaRo

    MemberOct 8, 2009

    Googlebot is the spider, the software Google relies on to crawl the Internet.
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 8, 2009

    Suppose I launch a new web site then when it get crawl by google.
    when start crawling , and updating suppose i change URL then ?😒
    Are you sure? This action cannot be undone.
    Cancel
  • MaRo

    MemberOct 8, 2009

    Google has URL submission, which makes your website on their todo list.

    #-Link-Snipped-#
    Are you sure? This action cannot be undone.
    Cancel
  • Kaustubh Katdare

    AdministratorOct 8, 2009

    This discussion is going from building a search engine to 'what is a search engine'. Do I have too big expectations from CEans?
    Are you sure? This action cannot be undone.
    Cancel
  • Mahesh Dahale

    MemberOct 8, 2009

    MaRo sir Today i found in article that contain

    Google crawls the Web at varying depths and on more than one schedule. These called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. Forth is reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first
    Are you sure? This action cannot be undone.
    Cancel
  • Manish Goyal

    MemberOct 8, 2009

    yeah you are right Mahesh
    there is not as such site until now in which deep crawls occur.......
    but in a magazine i have read that researchers are trying to develop such websites
    correct me if i am wrong
    Are you sure? This action cannot be undone.
    Cancel
  • madhumurundi

    MemberOct 12, 2009

    hi,
    search engine works on following techniques: 1.Page ranking: Most activated most visited sites will be considered as an Rank 1 pages . whenever user enter a query in search box first query will pass in to the query optimizer then it will look up the whether page requested is most frequently accessed or not if it is most frequently accessed then it will comes in first page of result..
    by using some techniques we can optimize the speed of search engine , some of the techniques are:
    1. Modeling Score Distributions for Combining the Outputs of Search Engines

    2. Web Crawling
    Are you sure? This action cannot be undone.
    Cancel
  • ONKSSSSS

    MemberOct 14, 2009

    Yes agree to all. But concepts like web crawler, pageing, indexing, etc that i think every CSEs know but HOW TO BUILD IT???( Hope Big will be satisfied after it)
    Are you sure? This action cannot be undone.
    Cancel
  • Kaustubh Katdare

    AdministratorOct 14, 2009

    I'm convinced I've bit higher expectations from the members. This discuss was meant to be about building a search engine and not "What is search engine"
    Are you sure? This action cannot be undone.
    Cancel
Home Channels Search Login Register