Share via


How do i get using htmlagilitypack all the links from html content ?

Question

Friday, May 22, 2015 9:54 AM

My code in form1:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.IO;
using System.Net;
using HtmlAgilityPack;

namespace Test
{
    public partial class Form1 : Form
    {
        HtmlWeb hw = new HtmlWeb();
        string htmlCode = "";
        List<string> htmls = new List<string>();

        public Form1()
        {
            InitializeComponent();

            using (WebClient client = new WebClient())
            {
                htmlCode = client.DownloadString("http://test.com");
            }

            HtmlAgilityPack.HtmlDocument doc = hw.Load("http://test.com);
            foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                htmls.Add(link.Name);
            }
        }



        private void Form1_Load(object sender, EventArgs e)
        {

        }
    }
}

Then in the List htmls i'm getting 253 items and the firsto nes are all a in index 0 the letter a in index 1 the letter a and so on all the List items are the letter a

I don't get even one link.

I guess i need to use the htmlCode. How do i use it ?

If i'm using the htmlCode in the hw.Load instead the link i'm getting exception since it's not Uri it's the content i want to get the links from.

All replies (4)

Friday, May 22, 2015 10:09 AM âś…Answered

You don't want to add the value of the Name property to your list but the value of the InnerText property because this one contains the actual link text:

      HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();
      HtmlAgilityPack.HtmlDocument doc = hw.Load("http://blog.magnusmontin.net");
      List<string> htmls = new List<string>();
      foreach (HtmlAgilityPack.HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) {
        htmls.Add(link.InnerText);
      }

Please remember to mark helpful posts as answer to close the thread and then start a new thread if you have a new question. Please don't ask several questions in the same thread.


Friday, May 22, 2015 10:22 AM

Magnus i didn't get any links.

First i want to make clear i need to get the complete links addresses for example:

http://www.test.com

And not only test or test.com but the whole link

But in this case in your solution i'm getting 254 items all of them only words and many items contain \n\n\n\n and \r\r\r\r\r and i didn't see one link only words that are part of a text.


Friday, May 22, 2015 10:36 AM

This is working:

private void GetLinks()
        {
            HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = hw.Load("http://test.com");
            List<string> htmls = new List<string>();
            foreach (HtmlAgilityPack.HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                string hrefValue = link.GetAttributeValue("href", string.Empty);
                if (hrefValue.Contains("http") && hrefValue.Contains("attachment"))
                htmls.Add(hrefValue);
            }
        }

Friday, May 22, 2015 10:44 AM

Yes, if you want to add the value of the href (the actual link target) of the links to the list.

Please specify your requirements clearly in any future threads.

Please also remember to close your threads by marking helpful posts as answer and then start a new thread if you have a new question. Please don't ask several questions in the same thread.