Create your own search engine!

Is there anyone who uses Internet but not Google?? For searching any thing on the web, the first site that comes to our mind is Google. Ever thought to have your own search engine?? Not like the Google but a simple one that runs in your desktop and searches your files. Or what if you need to add search functionality to your website?

Apache Lucene - Welcome to Apache Lucene is there to serve your needs. Lucene is an extremely rich and powerful full-text search API written in Java.

In this post, I will briefly explain how Lucene Directory works.

The first step in implementing full-text searching with Lucene is to build an index. This is easy - you just specify a directory and an analyzer class. The analyzer breaks text fields into indexable tokens; this is a core part of Lucene.

Several types of analyzers are provided out of the box. Below listed some of the more interesting ones.
Lucene analyzers​
StandardAnalyzer: A sophisticated general-purpose analyzer.
WhitespaceAnalyzer: A very simple analyzer that just separates tokens using white space.
StopAnalyzer: Removes common English words that are not usually useful for indexing.
SnowballAnalyzer: An interesting experimental analyzer that works on word roots (a search on rain should also return entries with raining, rained, and so on).
There are even a number of language-specific analyzers, including analyzers for German, Russian, French, Dutch, and others.

Replies

  • PraveenKumar Purushothaman
    PraveenKumar Purushothaman
    Next, we need to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries to this index. You can create an IndexWriter with the StandardAnalyzer analyzer as follows:
    IndexWriter indexWriter = new IndexWriter("index", new StandardAnalyzer(), true);
    The first argument is the directory location in the file system where the index files should be located. The second argument is a StandardAnalyzer object. The third argument is a boolean parameter set to true, which tells the IndexWriter to rebuild the index from scratch if it already exists.

    The next step is to index the business objects. For this, we use the Document class.

    The document is a container for holding a set of indexed fields.
    Document document = new Document();
    Reader reader = new FileReader(file);
    document.add(new Field(FIELD_CONTENTS, reader));  //FIELD_CONTENTS is a String constant having value "contents"
    // i.e It is the name of the field. The value is the contents of the file, as represented by "file" parameter to the reader.
    In above snippet, a Field is created and is being added to the Document. A field is made up of a name and a value (the first two parameters in the class constructor). The value may take the form of a String, or a Reader if the object to be indexed is a file. Field has a lot of overloaded constructors for various needs. For more details on the Field, refer the #-Link-Snipped-#.
  • PraveenKumar Purushothaman
    PraveenKumar Purushothaman
    Now, add the document to index writer.
    indexWriter.addDocument(document);
    So far, we have created an index writer and added the document to it.
    The only step that’s remaining now is to search the indexed values. For this, Lucene provides an IndexSearcher and QueryParser classes. We provide an analyzer object to the QueryParser; note that this must be the same one used during the indexing. You also specify the field that you want to search, and the (user-provided) full-text query.
    // directory is the name of Directory where the indexes will be stored
    IndexReader indexReader = IndexReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
     
        Analyzer analyzer = new StandardAnalyzer();
        QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer);
        Query query = queryParser.parse(searchString); //searchString - this is user given!
        Hits hits = indexSearcher.search(query);
        System.out.println("Number of hits: " + hits.length());
    In above snippet, we are using the QueryParser to create a new Query, and then passing this Query object to IndexSearcher’s search() method. The search method returns a Hits object which contains the values matching searchString. The length() method gives the number of matches.
    Voila! Our search engine is ready!
  • PraveenKumar Purushothaman
    PraveenKumar Purushothaman
    If you want to see the exact matches, then use an Iterator of Hit type on hits and iterate over it to to get the documents that matched the search string.

    The code would look somewhat like this:
    Iterator it = hits.iterator();
            while (it.hasNext()) {
                Hit hit = it.next();
                Document document = hit.getDocument();
               // Get the required value from the document and store in matchedValue
                System.out.println("Hit: " + matchedValue);
            }
    Simple, isn’t it?

    I know a lot of new terms have come into picture - Document, Field, IndexWriter, IndexSearcher, etc. But once you do a sample Java project, things will get simpler.
  • K!r@nS!ngu
    K!r@nS!ngu
    Awesome bro. Let me give a try.....
  • PraveenKumar Purushothaman
    PraveenKumar Purushothaman
    K!r@nS!ngu
    Awesome bro. Let me give a try.....
    Sure, let us know how was the result... 😀
  • greatcoder
    greatcoder
    Praveen-Kumar
    Is there anyone who uses Internet but not Google?? For searching any thing on the web, the first site that comes to our mind is Google. Ever thought to have your own search engine?? Not like the Google but a simple one that runs in your desktop and searches your files. Or what if you need to add search functionality to your website?

    Apache Lucene - Welcome to Apache Lucene is there to serve your needs. Lucene is an extremely rich and powerful full-text search API written in Java.

    In this post, I will briefly explain how Lucene Directory works.

    The first step in implementing full-text searching with Lucene is to build an index. This is easy - you just specify a directory and an analyzer class. The analyzer breaks text fields into indexable tokens; this is a core part of Lucene.

    Several types of analyzers are provided out of the box. Below listed some of the more interesting ones.

    There are even a number of language-specific analyzers, including analyzers for German, Russian, French, Dutch, and others.
    Praveen-Kumar
    Is there anyone who uses Internet but not Google?? For searching any thing on the web, the first site that comes to our mind is Google. Ever thought to have your own search engine?? Not like the Google but a simple one that runs in your desktop and searches your files. Or what if you need to add search functionality to your website?

    Apache Lucene - Welcome to Apache Lucene is there to serve your needs. Lucene is an extremely rich and powerful full-text search API written in Java.

    In this post, I will briefly explain how Lucene Directory works.

    The first step in implementing full-text searching with Lucene is to build an index. This is easy - you just specify a directory and an analyzer class. The analyzer breaks text fields into indexable tokens; this is a core part of Lucene.

    Several types of analyzers are provided out of the box. Below listed some of the more interesting ones.

    There are even a number of language-specific analyzers, including analyzers for German, Russian, French, Dutch, and others.

    No need to do such hard Work.... JUST DOWNLOAD GOOGLE DESKTOP and you can search ur entire computer with any keyword. It will not only search txt files (all sorts of files ppt, doc), but will search the content written inside the file. Also it will search Outlook emails!!

    BEst Tool To keep with You. After all U Cannot compete with Google in Searching 😎
  • PraveenKumar Purushothaman
    PraveenKumar Purushothaman
    greatcoder
    No need to do such hard Work.... JUST DOWNLOAD GOOGLE DESKTOP and you can search ur entire computer with any keyword. It will not only search txt files (all sorts of files ppt, doc), but will search the content written inside the file. Also it will search Outlook emails!!

    BEst Tool To keep with You. After all U Cannot compete with Google in Searching 😎
    Dude, this is something to learn to make yourself... We know everything exists... How it feels when others use the one which you made...

    i.e., Creating is better than Using!!! 😀

You are reading an archived discussion.

Related Posts

Social network is a place where there's a high chance of getting your purchase decisions influenced. However, a recent study showed that ( https://techcrunch.com/2012/01/12/social-networks-influence-shopping/ ) for 80% of the users;...
It seems like a common problem across 99% of the engineering colleges in India. The Head Of The Departments are absolutely clueless about the latest technologies & developments and yet,...
It was Dr. APJ Abdul Kalam who gave Indians the vision 2020. By the way, regular public is totally clueless about what vision 2020 is and how different India will...
Indian Government has finally 'said it'. The government says that if the popular search engine 'Google' and social networking website 'Facebook' do not remove objectionable content from their databases, they...
Blitzkrieg 12 is a Technical Fest organized by St. Josephs College of Engineering and Technology, Kottayam. It is conducted by department of Computer Science and Engineering of the college. BlitzKrieg...