“Smart” way of parsing and using website data?
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers’ websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don’t have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly…


1 on Mar 07, 2012
I’ve done some of this recently, and here are my experiences.
There are three basic approaches:
I’ve tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP’s older regex libraries lack these, and they’re indispensable for matching data between open/close tags in HTML.
0 on Mar 07, 2012
You don’t say what language you’re using. In Java land you can use TagSoup and XPath to help minimise the pain. There’s an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
I’d recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here’s an example XPath I’m definitely not using to screenscrape this site. No way, not me:
2 on Mar 07, 2012
Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements … that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some ‘in the dark’ probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on
<a href="blah">...turning into<a href="blah">...or whatever.Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for ‘anchors’ to character data in the tags.
Or send an email to the site stating a case for a XML API … you might get hired!
2 on Mar 07, 2012
You haven’t mentioned which technology stack you’re using. If you’re parsing HTML, I’d use a parsing library:
There are also webservices that do exactly what you’re saying – commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that
0 on Mar 07, 2012
It isn’t foolproof but you may want to look at a parser such as Beautiful Soup It won’t magically find the same info if the layout changes but it’s a lot easier then writing complex regular expressions. Note this is a python module.
10 on Mar 07, 2012
Unfortunately ‘scraping’ is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn’t result in bum data. Until the semantic web is a reality, that’s pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you’re using PHP I’d recommend SimpleHTMLDOM
1 on Mar 07, 2012
Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
I am sure you could find a library that does similar things in .NET or Python, etc.
4 on Mar 07, 2012
Try googling for screen scraping + the language you prefer. I know several options for python, you may find the equivalent for your preferred language:
Depending on the website to scrape you may need to use one or more of the approaches above.
3 on Mar 07, 2012
Parsley at http://www.parselets.com looks pretty slick.
It lets you define ‘parslets’ using JSON what you’re define what to look for on the page, and it then parses that data out for you.
4 on Mar 07, 2012
If you can use something like Tag Soup, that’d be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!
2 on Mar 07, 2012
As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery – http://www.ibm.com/developerworks/xml/library/j-jtp03225.html
0 on Mar 07, 2012
Fair enough, I am going to use The Tag soup method as recommended.
As a followup question – how on earth do those big scraper-type sites do it? I have seen a job search engine (e.g. indeed.com) that scans thousands of sites! Is that thousands of regexes? Its next to impossible…
Dillon Percival on May 21, 2012
a very good read excellent blog might be purchasing for more of these blogs
Fortune Teller on May 23, 2012
Yeah its a lil’ off topic mate but yuo missed a couple obvious typos
Toby Pawlak on May 31, 2012
One of the leading benefits of using acai berry for a fat reduction aid is that it can help your vitality significantly. These benefits however are only able to come to you, the use of the purest available sort of acai berries.
Chadwick Futrell on Jun 01, 2012
wehred my comment from earlier go?