How to extract text from HTML using Ruby/Hpricot
I found this while solving my own problem. ( That is always a best way to learn).
Requirement: Extract text from HTML body which includes ignoring large white spaces between tags and words.
Solution : Use Hpricot to do the magic
Assumption : Only HTML body is used here
One liner : Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
- Above line gets the inner text (which is a very convenient method to get the actual meat out of HTML)
- Replaces line returns into spaces.
- Do a split/join which eliminates multiple spaces between tags and words. This trims it down to single space
Now if want scan thru the whole HTML right from <html> tags, then you will have to strip out script, link, meta, style tags as well. to do that just do following
hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove
Blogged with MessageDance using Gmail
April 25, 2008 | Filed Under Ruby/Ruby On Rails
Comments
Leave a Reply