How to Create a Web Scraper in ASP.NET MVC and jQuery

How to Create a Web Scraper in ASP.NET MVC and jQuery

A Web Scraper is a software that helps in extracting data from websites. They can be use to extract typical information like emails, telephone numbers, addresses, etc from different URL. This extracting technique is also known as Data Harvesting.

I created this tutorial to teach how to create your own Web Scraper in ASP.NET MVC and jQuery. This Scraper will extract all emails and telephone numbers from a specified URL and show them in a HTML div control.

Let me tell you it is quite easy to create and you will enjoy the simple codes I have provided.

Web Scraper HTML Design

The HTML design of the Web Scraper consists of:

  • A input of type text where the URL of the page(to be crawled) is put.
  • A button which when clicked will start the data harvesting procedure.
  • A div where the extracted emails and telephone numbers will be shown.

web scraper html

ASP.NET MVC Controller

First create a Controller in your ASP.NET MVC application. Name the controller as “WebScrapingController” or you can name it anything else.

Now create a function “GetUrlSource” in this controller and make it as a [HttpPost] type. This function will be called on the button click event by the jQuery AJAX method.

This Code of GetUrlSource Function is:

[HttpPost]
public string GetUrlSource(string url)
{
    url = url.Substring(0, 4) != "http" ? "http://" + url : url;
    string htmlCode = "";
    using (WebClient client = new WebClient())
    {
        try
        {
            htmlCode = client.DownloadString(url);
        }
        catch (Exception ex)
        {

        }
    }
    return htmlCode;
}

Explanation – The “GetUrlSource” function receives the URL of the page in its parameter. It reads the HTML (page source) of the URL using “WebClient.DownloadString())” function and then returns this HTML in the end.

ASP.NET MVC View

Create a view named “Index”for the “WebScrapingController” controller and place the below html code in it.

<div id="message"></div>
<input id="urlInput" type="text" placeholder="Enter URL" />
<button id="submit">Submit</button>
<div class="textAlignCenter">
    <img src="~/Content/Image/loading.gif" />
</div>
<div id="twoColumn">
    <div></div>
    <div></div>
</div>

Explanation – The above HTML code contains “twoColumn” div that contains two inner divs. The first inner div will show the fetched emails while the second one will show the fetched telephone numbers.

Now add the below jQuery Code to the view:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.0/jquery.min.js"></script>
<script>
    $(document).ready(function () {
        $("#reset").click(function (e) {
            $("#urlInput").val("")
            $("#twoColumn > div").html("")
        });

        $("#submit").click(function (e) {
            var validate = Validate();
            $("#message").html(validate);

            if (validate.length == 0) {
                $.ajax({
                    type: "POST",
                    url: "/WebScraping/GetUrlSource",
                    contentType: "application/json; charset=utf-8",
                    data: '{"url":"' + $("#urlInput").val() + '"}',
                    dataType: "html",
                    success: function (result, status, xhr) {
                        GetUrlTelePhone(result);
                    },
                    error: function (xhr, status, error) {
                        $("#message").html("Result: " + status + " " + error + " " + xhr.status + " " + xhr.statusText)
                    }
                });
            }
        });

        function GetUrlTelePhone(html) {
            emails = html.match(/([a-zA-Z0-9._-]+@@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);
            emails = emails != null ? $.uniqueSort(emails) : "";
            var email = $("<p><u>Emails Found:-</u></p>");
            for (var i = 0, il = emails.length; i < il; i++)
                email.append("<p>" + (i + 1) + ". " + emails[i] + "</p>");
            $("#twoColumn > div").first().html(email);

            tels = html.match(/\(?([0-9]{3})\)?([ .-]?)([0-9]{3})\2([0-9]{4})/);
            tels = tels != null ? $.uniqueSort(tels) : "";
            tels = $.uniqueSort(tels);
            var tel = $("<p><u>Telephones Found:-</u></p>");
            for (var i = 0, il = tels.length; i < il; i++) {
                if (tels.length > 4)
                    tel.append("<p>" + (i + 1) + ". " + tels[i] + "</p>");
            }
            $("#twoColumn > div:nth-child(2)").html(tel);
        }

        $(document).ajaxStart(function () {
            $("img").show();
        });

        $(document).ajaxStop(function () {
            $("img").hide();
        });


        function Validate() {
            var errorMessage = "";
            if ($("#urlInput").val() == "") {
                errorMessage += "► Enter URL<br/>";
            }
            else if (!(isUrlValid($("#urlInput").val()))) {
                errorMessage += "► Invalid URL<br/>";
            }

            return errorMessage;
        }

        function isUrlValid(url) {
            var urlregex = new RegExp(
          "^(http[s]?:\\/\\/(www\\.)?|ftp:\\/\\/(www\\.)?|www\\.){1}([0-9A-Za-z-\\.@@:%_\+~#=]+)+((\\.[a-zA-Z]{2,3})+)(/(.)*)?(\\?(.)*)?");
            return urlregex.test(url);
        }
    });
</script>

Explanation – On the button click event the jQuery AJAX method calls the C# function “GetUrlSource” of the controller. Also note, on the success function of the jQuery AJAX method, I have called the jQuery function “GetUrlTelePhone” and have passed the URL’s HTML code to its parameter.

In the “GetUrlTelePhone” function I fetch the emails and telephone numbers using regular expressions, finally showing them at the end.

DEMO DOWNLOAD

Share this article -

yogihosting

ABOUT THE AUTHOR

This article has been written by the Technical Staff of YogiHosting. Check out other articles on "WordPress, SEO, jQuery, HTML" and more.