Friday, June 19, 2009

C# Html Screen Scraping Part 1

This is the first post of a two part series.

Source Code: http://www.box.net/shared/i2p7t9kxkt

Very often, when a particular website has some information you'd like to use in your application, you'd see if they have some kind of API which you can use to query their data. However, it's very common for a website either to not have an API altogether, or not have that little bit of info you need made available in their API. What's done generally to get around this is a technique knows as "Screen Scarping". (Screen Scraping is a general term, not just for the web, but for the purpose of this blog, when I say Screen Scarping, I mean HTML Screen Scraping).

The general gist of it is this: when a browser contacts a site, an HTML document is sent back to the browser. The browser then has the (tedious) task of parsing out that HTML and rendering it out to the screen. End of the day though, the HTML is just a text file. What screen scraping basically is, you write an app that "acts" like a web browser, meaning it contacts the web site, downloads the HTML file into memory, at which point you're free to parse it out any which way you like and extract the data you need. In .NET this is incredibly easy to do, and I'll demonstrate a simple sample:


private string GetWebsiteHtml(string url)
{
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string result = reader.ReadToEnd();
stream.Dispose();
reader.Dispose();
return result;
}


Yup, that's pretty much it. First, you create a WebRequest object with the given URL. Then, you get a Response object out of that Request. Finally you get the response stream and read it with a StreamReader.

I attached a simple app so you can give it a whirl. Basically, it's a simple windows app with a textbox and a button. Enter any url in the textbox (make sure to write the full url including http://....) and hit the Go button. That will get you the entire HTML of that site, and display it in the richtextbox.

This is obviously a rough sample, make sure to add proper error handling, but other than that, it's pretty straightforward and real simple! The only thing to watch out for when scraping, is that your parsing code will rely on the HTML being formatted a VERY specific way. If the site changes in any way, your code WILL break.

This was a very simple post; in the next post, I'll take this much further, and demonstrate how we can actually POST to a server, and even get the cookies.

1 comment:

Biju Alapatt said...

How can I read the content of a text box and a hyperlink by specifying it's id from an aspx page?