Building a Search Engine

Building a Search Engine

Kaustubh Katdare

Administrator

Updated: Oct 26, 2024

Views: 1.1K

This one is out of my own, personal curiosity. What's involved in building a simple web search engine?

Let's say, we have a network of 1000 (or 10?) terminals each hosting a set of 5 web-pages containing images, videos and of course text. How would you go about building a search engine to index all the content available on such network? Specifically, I'd like to know, what would be your -

1. Approach
2. Choice of technology (with reasons)
3. Indexing methodology
4. Technique to improve quality of results

Any takers?

0

Replies

Howdy guest!

Dear guest, you must be logged-in to participate on CrazyEngineers. We would love to have you as a member of our community. Consider creating an account or login.

Replies

MaRo

Member • Oct 3, 2009

This is most of my work now 😁

As you may know, we're indexing media content, so we're making freakishly different indexing techniques.

For text we're researching in several indexing techniques.

1- Google's technique.
2- Keyword analysis, the spider suppose to understand the content of a single article & rank the page according to previously saved keywords.

means, the spider knows for example the keyword 'Engineer' & has a network of keywords related to the word & crawls for the topic's keyword & analyze the contents of the found pages & rank them according to their relevancy, like if the spider found an article talking about engineers having vacancy will has a rank less than engineers invented a tool with technical details.

Are you sure? This action cannot be undone.
Cancel
Kaustubh Katdare

Administrator • Oct 3, 2009

MaRo - I bet you've enough idea about search engines. I look forward to your posts in this thread.

Are you sure? This action cannot be undone.
Cancel
MaRo

Member • Oct 3, 2009

I'm not very good in writing so I'd rather getting questions, please.

Are you sure? This action cannot be undone.
Cancel
Kaustubh Katdare

Administrator • Oct 4, 2009

I'm surprised! No one wants to discuss this? Looks like we should rename this section to Computer Troubleshooting section.

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 4, 2009

technology:- ASP.NET supports the searching of files using the Windows Indexing Service, Microsoft .NET Framework or above is required. The code are in C#, Building a word index for a website by using a web crawler also not dependent on the underlying technology used on a website

methods like SetQuery(query as string),GetSearchResults(),

and still thinking on indexing methodology and web crawling product -The Website Utility

Are you sure? This action cannot be undone.
Cancel
Ashraf HZ

Member • Oct 4, 2009

Biggie, can't you see MaRo is hinting for a Small Talk? 😉

Are you sure? This action cannot be undone.
Cancel
MaRo

Member • Oct 4, 2009

The problem not with indexing mahesh, for search engines the problem lies in ranking the cached results.

@Ash : yea I'd love to, but not now, if Google get down I'll deserve one 😁

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 4, 2009

ok MaRo sir, thanks but index is to optimize speed and performance in finding relevant documents for a search query. can you explain in detail

Are you sure? This action cannot be undone.
Cancel
MaRo

Member • Oct 4, 2009

Spiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.

Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index.

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 4, 2009

Thank you sir

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 4, 2009

sir,i need to connect to a given search engine and retrieve the html page of that search engine ?
i am using java. how any idea

Are you sure? This action cannot be undone.
Cancel
Kaustubh Katdare

Administrator • Oct 4, 2009

MaRo
Spiders fetch webpages for the keywords that appears significant for it & indexes the webpages against the keywords.

Also uses in small percentage the HTML meta tag keywords, Now search engines added the auto-complete keywords feature which get much faster result from the index.
Any new algorithm suggestions for ranking the pages? Or quality inlinks is the only way we can achieve better search results?

Are you sure? This action cannot be undone.
Cancel
MaRo

Member • Oct 5, 2009

@mahesh : #-Link-Snipped-#

You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - <a href="https://www.google.com/search?hl=en&source=hp&q=engineer&btnG=Google+Search&aq=f&oq=&aqi=g10" target="_blank" rel="nofollow noopener noreferrer">engineer - Google Search</a> - you have to replace the word engineer to be a string variable, with respect to spaces.

@Big K: I think the Keyword analysis algorithm even if not faster than the present algorithm but will differ in the number of relevant results.

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 7, 2009

MaRo
@mahesh : #-Link-Snipped-#

You have to get the link generated from searching the search engine you like, i.e, this is the link generated from googling "engineer" - <a href="https://www.google.com/search?hl=en&source=hp&q=engineer&btnG=Google+Search&aq=f&oq=&aqi=g10" target="_blank" rel="nofollow noopener noreferrer">engineer - Google Search</a> - you have to replace the word engineer to be a string variable, with respect to spaces.
Why data remains largely hidden from users by placing it behind form or Web services interfaces (deep web)?
In a study by BrightPlanet shows the hidden Web contains 7,500 terabytes of information and is 400 to 500 times larger than the visible Web.

Are you sure? This action cannot be undone.
Cancel
clarence456

Member • Oct 7, 2009

Hi! Thanks for the well informative post and this is one of the post which impress me a lot and I like to create one of my own search engine so its very effective one.The tips are one of the best.Keep up the nice work.

______________________________
#-Link-Snipped-#

Are you sure? This action cannot be undone.
Cancel
ONKSSSSS

Member • Oct 8, 2009

Good maro but give some more info...
hers what i know:
I have no time to explain it myself but google out for you 'WHAT IS GOOGLEBOT?'
Surely I hope it would help BIG to know more about his topic...
till then keep posting your findings.....

Are you sure? This action cannot be undone.
Cancel
MaRo

Member • Oct 8, 2009

Googlebot is the spider, the software Google relies on to crawl the Internet.

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 8, 2009

Suppose I launch a new web site then when it get crawl by google.
when start crawling , and updating suppose i change URL then ?😒

Are you sure? This action cannot be undone.
Cancel
MaRo

Member • Oct 8, 2009

Google has URL submission, which makes your website on their todo list.

#-Link-Snipped-#

Are you sure? This action cannot be undone.
Cancel
Kaustubh Katdare

Administrator • Oct 8, 2009

This discussion is going from building a search engine to 'what is a search engine'. Do I have too big expectations from CEans?

Are you sure? This action cannot be undone.
Cancel
Mahesh Dahale

Member • Oct 8, 2009

MaRo sir Today i found in article that contain

Google crawls the Web at varying depths and on more than one schedule. These called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. Forth is reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first

Are you sure? This action cannot be undone.
Cancel
Manish Goyal

Member • Oct 8, 2009

yeah you are right Mahesh
there is not as such site until now in which deep crawls occur.......
but in a magazine i have read that researchers are trying to develop such websites
correct me if i am wrong

Are you sure? This action cannot be undone.
Cancel
madhumurundi

Member • Oct 12, 2009

hi,
search engine works on following techniques: 1.Page ranking: Most activated most visited sites will be considered as an Rank 1 pages . whenever user enter a query in search box first query will pass in to the query optimizer then it will look up the whether page requested is most frequently accessed or not if it is most frequently accessed then it will comes in first page of result..
by using some techniques we can optimize the speed of search engine , some of the techniques are:
1. Modeling Score Distributions for Combining the Outputs of Search Engines

2. Web Crawling

Are you sure? This action cannot be undone.
Cancel
ONKSSSSS

Member • Oct 14, 2009

Yes agree to all. But concepts like web crawler, pageing, indexing, etc that i think every CSEs know but HOW TO BUILD IT???( Hope Big will be satisfied after it)

Are you sure? This action cannot be undone.
Cancel
Kaustubh Katdare

Administrator • Oct 14, 2009

I'm convinced I've bit higher expectations from the members. This discuss was meant to be about building a search engine and not "What is search engine"

Are you sure? This action cannot be undone.
Cancel