Building a Search Engine
This one is out of my own, personal curiosity. What's involved in building a simple web search engine?
Let's say, we have a network of 1000 (or 10?) terminals each hosting a set of 5 web-pages containing images, videos and of course text. How would you go about building a search engine to index all the content available on such network? Specifically, I'd like to know, what would be your -
1. Approach
2. Choice of technology (with reasons)
3. Indexing methodology
4. Technique to improve quality of results
Any takers?
Let's say, we have a network of 1000 (or 10?) terminals each hosting a set of 5 web-pages containing images, videos and of course text. How would you go about building a search engine to index all the content available on such network? Specifically, I'd like to know, what would be your -
1. Approach
2. Choice of technology (with reasons)
3. Indexing methodology
4. Technique to improve quality of results
Any takers?
Replies
-
MaRoThis is most of my work now ๐
As you may know, we're indexing media content, so we're making freakishly different indexing techniques.
For text we're researching in several indexing techniques.
1- Google's technique.
2- Keyword analysis, the spider suppose to understand the content of a single article & rank the page according to previously saved keywords.
means, the spider knows for example the keyword 'Engineer' & has a network of keywords related to the word & crawls for the topic's keyword & analyze the contents of the found pages & rank them according to their relevancy, like if the spider found an article talking about engineers having vacancy will has a rank less than engineers invented a tool with technical details. -
Kaustubh KatdareMaRo - I bet you've enough idea about search engines. I look forward to your posts in this thread.
-
MaRoI'm not very good in writing so I'd rather getting questions, please.
-
Kaustubh KatdareI'm surprised! No one wants to discuss this? Looks like we should rename this section to Computer Troubleshooting section.
-
Mahesh Dahaletechnology:- ASP.NET supports the searching of files using the Windows Indexing Service, Microsoft .NET Framework or above is required. The code are in C#, Building a word index for a website by using a web crawler also not dependent on the underlying technology used on a website
methods like SetQuery(query as string),GetSearchResults(),
and still thinking on indexing methodology and web crawling product -The Website Utility -
Ashraf HZBiggie, can't you see MaRo is hinting for a Small Talk? ๐
-
MaRoThe problem not with indexing mahesh, for search engines the problem lies in ranking the cached results.
@Ash : yea I'd love to, but not now, if Google get down I'll deserve one ๐ -
Mahesh Dahaleok MaRo sir, thanks but index is to optimize speed and performance in finding relevant documents for a search query. can you explain in detail
-
MaRoSpiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.
Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index. -
Mahesh DahaleThank you sir
-
Mahesh Dahalesir,i need to connect to a given search engine and retrieve the html page of that search engine ?
i am using java. how any idea -
Kaustubh Katdare
Any new algorithm suggestions for ranking the pages? Or quality inlinks is the only way we can achieve better search results?MaRoSpiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.
Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index. -
MaRo@mahesh : #-Link-Snipped-#
You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - engineer - Google Search - you have to replace the word engineer to be a string variable, with respect to spaces.
@Big K: I think the Keyword analysis algorithm even if not faster than the present algorithm but will differ in the number of relevant results. -
Mahesh Dahale
Why data remains largely hidden from users by placing it behind form or Web services interfaces (deep web)?MaRo@mahesh : #-Link-Snipped-#
You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - engineer - Google Search - you have to replace the word engineer to be a string variable, with respect to spaces.
In a study by BrightPlanet shows the hidden Web contains 7,500 terabytes of information and is 400 to 500 times larger than the visible Web. -
clarence456Hi! Thanks for the well informative post and this is one of the post which impress me a lot and I like to create one of my own search engine so its very effective one.The tips are one of the best.Keep up the nice work.
______________________________
#-Link-Snipped-# -
ONKSSSSSGood maro but give some more info...
hers what i know:
I have no time to explain it myself but google out for you 'WHAT IS GOOGLEBOT?'
Surely I hope it would help BIG to know more about his topic...
till then keep posting your findings..... -
MaRoGooglebot is the spider, the software Google relies on to crawl the Internet.
-
Mahesh DahaleSuppose I launch a new web site then when it get crawl by google.
when start crawling , and updating suppose i change URL then ?๐ -
MaRoGoogle has URL submission, which makes your website on their todo list.
#-Link-Snipped-# -
Kaustubh KatdareThis discussion is going from building a search engine to 'what is a search engine'. Do I have too big expectations from CEans?
-
Mahesh DahaleMaRo sir Today i found in article that contain
Google crawls the Web at varying depths and on more than one schedule. These called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. Forth is reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first -
Manish Goyalyeah you are right Mahesh
there is not as such site until now in which deep crawls occur.......
but in a magazine i have read that researchers are trying to develop such websites
correct me if i am wrong -
madhumurundihi,
search engine works on following techniques: 1.Page ranking: Most activated most visited sites will be considered as an Rank 1 pages . whenever user enter a query in search box first query will pass in to the query optimizer then it will look up the whether page requested is most frequently accessed or not if it is most frequently accessed then it will comes in first page of result..
by using some techniques we can optimize the speed of search engine , some of the techniques are:
1. Modeling Score Distributions for Combining the Outputs of Search Engines
2. Web Crawling -
ONKSSSSSYes agree to all. But concepts like web crawler, pageing, indexing, etc that i think every CSEs know but HOW TO BUILD IT???( Hope Big will be satisfied after it)
-
Kaustubh KatdareI'm convinced I've bit higher expectations from the members. This discuss was meant to be about building a search engine and not "What is search engine"
You are reading an archived discussion.
Related Posts
Occam Networks New High-Density Gigabit Ethernet FTTP Blade Featuring Highest Port Density in Its Class on Display at ITU Telecom World in Geneva
48 Ports Per Blade Combines with Integrated...
A short article about DARPA funded experiments to remotely control insects - in this case beetles:
Free-flying cyborg insects steered from a distance - tech - 01 October 2009 -...
here is a problem which i face when i participate in a online programming competition
can any any one solve it
Problem Statement:
Now comes an interesting math game.
How...
Hi friends , can some of you please list some project topics based on Cryptography ; not too tough but on the easier side ? I am a final year...
Wishing many happy returns of the day to CEan SilverScropion! ๐ May your dreams come true!
๐๐๐