Geeks With Blogs
Michael Crump Microsoft MVP, INETA Community Champion and XAML Advocate.

So one question that seems to come up every once and a while is how to download a webpage and parse certain information out of it for example : What is in between the “Title” tag. Below is a code snippet that will help you achieve just that.

 

using System;

class Program
{
    static void Main(string[] args)
    {
        string strFullHtml = WebSite.FetchHTML("http://michaelcrump.net");
        string sFullTitle = WebSite.FetchTitleFromHTML(strFullHtml);

        Console.WriteLine(strFullHtml);
        System.Console.ReadLine();
        Console.WriteLine(sFullTitle);
        System.Console.ReadLine();
    }
}

public class WebSite
{
    public static string FetchHTML(string sUrl)
    {
        System.Net.WebClient oClient = new System.Net.WebClient();
        return oClient.DownloadString(sUrl);
    }

    public static string FetchTitleFromHTML(string sHtml)
    {
        string regex = @"(?<=<title.*>)([\s\S]*)(?=</title>)";
        System.Text.RegularExpressions.Regex ex = new System.Text.RegularExpressions.Regex(regex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        return ex.Match(sHtml).Value.Trim();

    }

}

 

If you break during debug you can see by using the HTML Visualizer that you have the entire site stored inside of strFullHtml.

image

The title was parsed out of the full html by using a Regular Expression.

image

Posted on Monday, July 5, 2010 4:10 PM | Back to top

Copyright © mbcrump | Powered by: GeeksWithBlogs.net