Share via


C# - extract source code from webbrowser control

Question

Sunday, May 3, 2015 11:21 PM

Good morning

I want to extract the html code from a web page and save it to a text file before it is displayed on the screen. Then I will edit it (remove some links) and set the new html page visible to the user.

Is it possible with c#?

Thank you.

All replies (13)

Monday, May 4, 2015 1:07 AM

Hello,

You can use WebClient.DownloadFile

https://msdn.microsoft.com/en-us/library/ms144194(v=vs.110).aspx

Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.


Monday, May 4, 2015 3:20 AM

Yes S7, you can easily do this easily with C#.

Here's the code I use to get the HTML of a webpage:

using System.Net // // Project > Add reference > System.Net
using System.Web;


        public string WebText(string url)
        {
            string html = "";

            if (url == "")
                return "";

            try
            {
                using (WebClient client = new WebClient())
                {
                    client.Headers["User-Agent"] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.12) Gecko/20100824 Firefox/3.5.12x";
                    client.Encoding = Encoding.UTF8;
                    html = client.DownloadString(url);
                }
            }
            catch (Exception ex)
            {
                // handle error
                Console.WriteLine(ex.Message);
            }

            return (html == "" ? "" : html.Trim());
        }

Once you get the html of the webpage, you can change the html as you wish.

Then save the html as a file.

Then open the file.

I hope this helps,
Andy

Thank you, Andy W7 SP1 VS2008: 9.0.30729.1 SP


Monday, May 4, 2015 5:09 AM

HI ,

It seems you wants to use web browser control in WinForms/WPF. Please find the below code will help you to downloads the content of the html then you can use filestream to save the html file.

using (webBrowser = new WebBrowser())
                {
                    webBrowser.DocumentCompleted += CustomWebBrowser_DocumentCompleted1;
                    webBrowser.Navigate(url);

                    while (webBrowser.ReadyState != WebBrowserReadyState.Complete)
                    {
                        Application.DoEvents();
                    }
                }
void CustomWebBrowser_DocumentCompleted1(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
          

                WebBrowser browser = sender as WebBrowser;
                string source = browser.DocumentText;
                using (FileStream fs = new FileStream("test.htm", FileMode.Create))
                {
                    using (StreamWriter w = new StreamWriter(fs, Encoding.UTF8))
                    {
                        w.WriteLine(source);
                    }
                }              
                         

        }

Thanks, Karikalan N


Monday, May 4, 2015 5:37 AM

In order to display the adjusted file, consider webBrowser.DocumentStream = File.OpenRead(…). If the new HTML is already loaded, then assign it to webBrowser.DocumentText.


Monday, May 4, 2015 11:53 PM

Sorry, I want modify the page at runtime. It's a dynamic page. In html source code there are a lot of link that I want to remove. But I can't save the page on hard disk and load it in webbrowser control, right?


Tuesday, May 5, 2015 3:45 AM

But I can't save the page on hard disk and load it in webbrowser control, right?

Hi s7evingra,

As far as I know, there is method named WebBrowser.ShowSaveAsDialog.

Using code like this

// Displays the Save dialog box. 
private void saveAsToolStripMenuItem_Click(object sender, EventArgs e)
{
    webBrowser1.ShowSaveAsDialog();
}

And this method can save some format what you want, like mht, html, txt file.

Have a look at this article, I think it is more helpful for you.

Convert any URL to a MHTML archive using native .NET code

*Note: This response contains a reference to a third party World Wide Web site. Microsoft is providing this information as a convenience to you. *

Microsoft does not control these sites and has not tested any software or information found on these sites;Therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there.

There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

Best regards,

Kristin

We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click HERE to participate the survey.


Tuesday, May 5, 2015 11:22 PM

Sorry again for my bad English. I try to explain. I would check in runtime if in the page that is being loaded (but it is not loaded yet), there is a series of links. If so, remove them and show to the user the page in the web browser control without that links.


Wednesday, May 6, 2015 7:16 AM

Sorry again for my bad English. I try to explain. I would check in runtime if in the page that is being loaded (but it is not loaded yet), there is a series of links. If so, remove them and show to the user the page in the web browser control without that links.

I am not completely sure what you are trying to achieve and what the real scenario is,

You have to know that what you dealing with is a bit complex but it’s not impossible in webbrower control. For webBrowser only DocumentCompleted event helps but it is after load page.

In addition, there is no build-in method in .NET framework. You may get some assistance by 3rd-party library.

As far as i know, there is a third-party library named Html Agility Pack. You could use it to deal with what kind of page that you want. For example, remove links, change attribute values and so on.

For example, here  is how you would fix all hrefs in an HTML file:

  HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.Load("file.htm");  //or whatever HTML file you have 
            HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//a[@href]");
            if (links != null) 
            {
                foreach (HtmlNode link in links)
                {
                    HtmlAttribute att = link["href"];
                    if(att.Value == "removewhichlinkwhatyouwant")
                    {
                        link.Attributes["href"].Remove();
                    }

                }
            }
            doc.Save("file.htm");

During the develop time, if you  have some issues about Html Agility Pack, You can consider posting it in CodePlex forum for more efficient responses.

https://htmlagilitypack.codeplex.com/

Best regards,

Kristin

 

We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click HERE to participate the survey.


Wednesday, May 6, 2015 8:37 AM

I think you could use jquery.

You can add scripts to the header of a page

HtmlElement head = webBrowser1.Document.GetElementsByTagName("head")[0];
HtmlElement scriptEl = webBrowser1.Document.CreateElement("script");
IHTMLScriptElement element = (IHTMLScriptElement)scriptEl.DomElement;
element.text = "function sayHello() { alert('hello') }";
head.AppendChild(scriptEl);
webBrowser1.Document.InvokeScript("sayHello");

http://stackoverflow.com/questions/153748/how-to-inject-javascript-in-webbrowser-control

Use that to add jquery references and use the technique here:

http://stackoverflow.com/questions/20543194/remove-all-href-links-using-jquery

Hope that helps.

Technet articles: Uneventful MVVM; All my Technet Articles


Wednesday, May 6, 2015 10:51 PM

Thank you for reply. Using javascript maybe is the faster solution.
I have a problem: this code doesn't function:

HtmlDocument doc = webBrowser1.Document;
HtmlElement head = doc.GetElementsByTagName("head")[0];
HtmlElement s = doc.CreateElement("script");
s.SetAttribute("text", "function remlinks() {document.getElementById('onlyMobile').style.display = 'none';}");
head.AppendChild(s);
webBrowser1.Document.InvokeScript("remlinks");

if I enter the following code, all works.

s.SetAttribute("text", "function sayHello() { alert('hello'); }");
webBrowser1.Document.InvokeScript("sayHello");

Monday, May 11, 2015 9:40 AM

@s7evingra

Sorry,  I am not familiar with JavaScript. After test your code, yes alert('hello') tests OK.

But if remove links, it often jump out of this error, So it is unstable. Since I don't know yours.  Which link do you use? Do you throw some error information?

I am afraid this way seems more suitable in Web application.

Like this way

http://www.devcurry.com/2010/06/remove-links-and-display-text-using.html

In addition, for questions related Javascript, I would suggest that you could post this question in HTML, CSS and JavaScript   forum.                              

Have a ncie day!

Kristin

We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click HERE to participate the survey.


Monday, May 11, 2015 9:06 PM

No, I havent resolverd the problem.

The code that I posted doesn't function. Can you help me ?

I insert that code in the "Document Completed" event of webbrowser.


Tuesday, May 12, 2015 3:08 AM

@s7evingra

I edited my last reply, Do you receive the alert email?please take a look. Thanks

We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click HERE to participate the survey.