Friday, June 19, 2009

C# Html Screen Scraping Part 2 / Performing POST with Cookies

Source code: http://www.box.net/shared/r7u052y507

I just want to point out, that I purposely didn't break this out into separate classes and methods, and I know I'm duplicating ALOT of code. I simply wanted to demonstrate each technique on its own.


In my previous post, I demonstrated how to connect to a website, and download the HTML (aka Screen Scraping). That all works nicely if it's a simple site you need to connect to. Sometimes however, you can't simply connect to the site and download the HTML, rather you need to first login, or maybe you need to enter some kind of search term first into a text box. In the HTML world, generally the page will have a simple form that posts to the server which queries the database or something based on the info supplied in the post, and then dynamically builds up the page. How do you do that in your C# app? How do you pass the values on to the Form Post that the server is expecting?

Throughout this post, I'll be referring to the code that's attached to this post. It basically has two projects. One's a simple ASP.NET MVC website, and the other is a winforms app. In order to run this properly, you'll need to first launch the web app, and then launch the windows app. Here's a simple screen shot of what the windows app looks like so you can get an idea:



First, a little note. In order to demonstrate this, I needed a site that had a simple Form with cookies. Since I couldn't find anything that was really simple and that would be easy to demo, I decided to create my own little "Website". It's written in ASP.NET MVC, so if you want to be able to run the code sample supplied in the link, head over to the ASP.NET MVC Website and download it (if you don't already have it.)

The site basically has two URL's that are of interest. The first is ../Home/SimplePost. If you navigate to that page, you'll see a simple textbox with a button. When you click the button, it simply posts the text in the textbox back to the server, and then it just outputs it back to the browser. Here's the HTML rendered for that form:

<form action="/Home/SimplePost" method="post">
<input type="text" id="text" name="text" />
<input type="submit" value="submit" />
</form>

The Form will Post to a site on the server /Home/SimplePost. We also can see in the form, that the server is expecting a parameter that's called "text". It's safe to assume, that anything with an input field (except for the button) is needed by the server. So, now we have enough info to write our C# function:



private void PostWithoutCookies()
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
String.Format("http://localhost:{0}/Home/SimplePost", port));
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
string postData = String.Format("text={0}", String.IsNullOrEmpty(textBoxPost.Text)
? "somerandomemail@address.com" : this.textBoxPost.Text);
byte[] bytes = Encoding.UTF8.GetBytes(postData);
request.ContentLength = bytes.Length;

Stream requestStream = request.GetRequestStream();
requestStream.Write(bytes, 0, bytes.Length);

WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
stream.Dispose();
reader.Dispose();
this.richTextBox1.Text = reader.ReadToEnd();
}


(This is all part of the app that's attached at the top of this post. It basically outputs all the results to a richtextbox.)

A bit more complicated than last time, but not as bad. The main difference here is that we'll actually be writing TO the Request stream. This will insert the form values into the headers, which will allow the server to receive this data and process it. The trick is that for each value that you need to add, you use this syntax:

field1=value1&field2=value2&field3=value3

where field is the name of the input field, and the value is the value you want to send over to the server (ie. the text that would be entered into the textbox).

So if we were to run this method now, we'd see the text that we posted to the server (the text that was in the textbox of the app) in the richtextbox.

There is one more thing that we can do with this. Very often, in order to access certain areas of a site, you need to first log in. When you login, the server sends a cookie to the browser, and then for each subsequent request that is for authenticated users only, the browser send the cookie back to the server so that you can access those parts of the site. Here, we're acting like a browser, so we need to have the ability to get the cookie, retain it somehow, and then pass that on to the next request.

To demonstrate this, in the web app of this demo, there's a page called "..Home/PostWithCookie". When you access this page, it sends a cookie to the browser. Then, on that page there's a form identical to the first one. When you post back to the server though, it checks if the cookie is there. If it is, it outputs "Cookie Found" along with the cookie value, if not, it outputs "Cookie not found."

So back in our Windows App, we need a way to first access that first page that gets us our cookie, then we need to GET the cookie, and finally we need to pass the cookie on with the form post. Here's the code:



private CookieCollection GetCookies()
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
String.Format("http://localhost:{0}/Home/PostWithCookie", port));
request.CookieContainer = new CookieContainer();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
return response.Cookies;
}

private void PostWithCookies()
{
CookieCollection cookies = this.GetCookies();
var request = (HttpWebRequest)WebRequest.Create(
String.Format("http://localhost:{0}/Home/PostWithCookie", port));
request.CookieContainer = new CookieContainer();
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
string postData = String.Format("text={0}", String.IsNullOrEmpty(textBoxPost.Text)
? "somerandomemail@address.com" : this.textBoxPost.Text);
byte[] bytes = Encoding.UTF8.GetBytes(postData);
request.ContentLength = bytes.Length;
if (cookies != null)
{
request.CookieContainer.Add(cookies);
}

Stream requestStream = request.GetRequestStream();
requestStream.Write(bytes, 0, bytes.Length);

var response = request.GetResponse();
var stream = response.GetResponseStream();
var reader = new StreamReader(stream);
this.richTextBox1.Text = reader.ReadToEnd();
}


The first bit of code, the GetCookies method looks JUST like the original Screen Scrape method, however here we're actually grabbing the cookies. The trick is to new up a new CookieContainer before we do the request. Once we have a container, and we execute the request, we can get the cookies out of the response.

Now we have the cookie, but we aren't done. We want to pass this cookie back to the server when we post to the form. The only difference again here is that we have to new up a CookieContainer on the request, and add the cookies to that container. Once it's there, when you execute the POST, the cookies will get sent over as well.

The only way to really understand this all, is to download the sample, and mess with it. It's not very complicated, but you need to just mess with it a bit to understand. Once you do grasp it though, you'll see just how powerful this is. You can access many websites straight from within your app, and get the data right into your application.

1 comment:

Patrick said...

Great stuff! Thanks!
Patrick