Fetch meta tags from crawled html docs

I have used apache nutch-1.4 and crawled a website. Now i want to fetch meta tags from every html page. Is this possible ?
I have just started using nutch so I don't even know how to compile the code. For crawling i have downloaded the binary files and run some very simple commands.
So one of the doubt in my mind is How to run nutch if i modify one of the source files.

And what modification i can do which can show me the meta tag info corresponding to URL of pages.

Replies

  • Sachin Jain
    Sachin Jain
    Has anybody worked on apache nutch, lucene library and solR ?

You are reading an archived discussion.

Related Posts

im a student of electrical engineering. i want to do internship in manufacturing companies of eee. so i need information about that.
We all have those days where we hear something and we say to ourselves 'You know,I've heard that somewhere before' but you just can't remember what it is. I created...
Hi CEan's, Anybody using this website ? I did a little search regarding it and got a positive reviews. I just wanted to checkout it with you guy's. If anybody...
Virtual LANs is a L2 technique used to segregate users in a LAN(per say , seggregation of broadcast domain) . Now when I say L2, everyone relates it to a...
Well, this is the biggest thing that has been in my mind lately. India has been a developing country since a lot of time. How far is this legible for...