Share via


Using regex on a img src

Question

Tuesday, December 16, 2014 8:11 PM

I have a c# web application where I am doing some coding and searching on some html tags and need to check all the images in the code.

But if the image url contains a specific url prefix (for example http://www.microsoft.com ) I don't want the code to run.

Below is my starting point that does a search for all img tags and reads everything within it.

Regex rgx = new Regex(@"<(img)\b[^>]*>", RegexOptions.IgnoreCase);

I'm not sure the best to move to the next part, which checks the url prefix, if it starts with http://www.microsoft.com, I then want to exit out of the function, but if it doesn't would like it to continue.

I have never really done much with regex as of yet.

All replies (12)

Thursday, December 18, 2014 11:16 PM âś…Answered

How about this

        static void Main(string[] args)
        {
            string input =
     "<img src=\"http://www.microsoft.ca/newsletters/banner.png\" width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\"  border=\"0/\" />\n" +
     "<img width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\" src=\"http://www.google.ca/newsletters/banner.png\"  border=\"0/\" />";
            GetUrls(input);

        }
        static string GetUrls(string input)
        {
            string output = "";
            string pattern = "(<img.*src=\")(?'URL'[^\"]*)(.*[^<$])";
            Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
            MatchCollection matches = rgx.Matches(input);
            foreach (Match match4 in matches)
            {
                string URL = match4.Groups["URL"].Value;
                if (URL.StartsWith("http://www.microsoft."))
                {
                    output += match4.Value;
                }
                else
                {
                    output += Regex.Replace(match4.Value, pattern, "$1//http/microsoft.com$2");
                }
                
            }
            output += ">";
            return output;
        }

jdweng


Tuesday, December 16, 2014 8:11 PM

I have a c# web application where I am doing some coding and searching on some html tags and need to check all the images in the code.

But if the image url contains a specific url prefix (for example http://www.microsoft.com ) I don't want the code to run.

Below is my starting point that does a search for all img tags and reads everything within it.

Regex rgx = new Regex(@"<(img)\b[^>]*>", RegexOptions.IgnoreCase);

I'm not sure the best to move to the next part, which checks the url prefix, if it starts with http://www.microsoft.com, I then want to exit out of the function, but if it doesn't would like it to continue.

I have never really done much with regex as of yet.


Tuesday, December 16, 2014 9:18 PM

can you give me please a sample input text?

if you only want to exclude http://www.microsoft.com than 

String strToExclude = "http://www.microsoft.com";

String webSource; ==> this is the source code of your web page or what ever you are trying to get data out of it.

String matchStr = rgx.Match(webSource).Value;

if(matchStr.Contains(strToExclude ))

{

// do here whatever you wanna do if it contains the microsoft url

}

else

{

// here you will do your code if no match for the excluded string url

}


Wednesday, December 17, 2014 8:37 AM

Hi CKMock,

Why you not choose String.StartsWith Method

It determines whether the beginning of this string instance matches a specified string.

Good day!

Kritsin

We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click HERE to participate the survey.


Wednesday, December 17, 2014 11:11 AM | 1 vote

Try something like the code below

           string input;  //enter your xml file here
            Regex rgx = new Regex(@"<(img)\b[^>]*>", RegexOptions.IgnoreCase);
            MatchCollection matches = rgx.Matches(input);
            foreach(Match match in matches)
            {
                if (match.ToString().StartsWith("http://www.microsoft.com"))
                {
                }
            }

jdweng


Thursday, December 18, 2014 2:55 AM

this is a sample of an img that I want to check the prefix

<img src="http://www.microsoft.ca/newsletters/banner.png" width="580" height="120" id="_x0000_i1025" alt="Winter 2015"  border="0/" />

as you notice it starts with http://www.microsoft.ca - therefor I don't want the function to continue or proceed, but if its like below

<img width="580" height="120" id="_x0000_i1025" alt="Winter 2015" src="http://www.google.ca/newsletters/banner.png"  border="0/" />

I want the function to continue.

I tried the example you provided with ".startswith" but it does not work.

I also want to mention that the format of the img src may vary from image to image, notice the first example had it in the beginning in the tag, whereas the second tag the src reference is near the end of the html img element.

So the code has to be able to check and reference the src prefix regardless of where it is in the tag.

Any feedback or suggestions is appreciated.


Thursday, December 18, 2014 3:31 AM

I tend to prefer Joel Engineer's answer, but you might also check out Mathieu's FetchLinksFromSource Function posted over here on StackOverflow:

https://stackoverflow.com/questions/138839/how-do-you-parse-an-html-string-for-image-tags-to-get-at-the-src-information

If you reuse Mathieu's FetchLinksFromSource Function then you might be able to modify Joel Engineer's code as follows:

         string someHTML_sourceString = 
             "<img src='http://www.microsoft.ca/newsletters/banner.png' width='580' height='120' id='_x0000_i1025' alt='Winter 2015'  border='0/' />" + 
             "<b>some text</b>" + 
             "<img width='580' height='120' id='_x0000_i1025' alt='Winter 2015' src='http://www.google.ca/newsletters/banner.png'  border='0/' />";

         List<Uri> theLinks = FetchLinksFromSource(someHTML_sourceString);
         foreach (Uri someURI in theLinks)
         {
            if (Regex.IsMatch(someURI.AbsoluteUri, "^https?://www.microsoft.ca"))
            {
               // Output message, Microsoft Canada is Awesome!!!
            }
         }

Thursday, December 18, 2014 5:55 AM

Try this

           string input = 
                "<img src=\"http://www.microsoft.ca/newsletters/banner.png\" width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\"  border=\"0/\" />\n" +
                "<img width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\" src=\"http://www.google.ca/newsletters/banner.png\"  border=\"0/\" />";

            Regex rgx1 = new Regex("<img.*[^>]>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
            Regex rgx2 = new Regex("src=\"(?'URL'[^\"]*)", RegexOptions.IgnoreCase);
            MatchCollection matches = rgx1.Matches(input);
            foreach (Match match4 in matches)
            {
                Match match4 = rgx2.Match(match4.Value);
                string URL = match4.Groups["URL"].Value;
                if (URL.StartsWith("http://www.microsoft."))
                {
                    Console.WriteLine("URL = {0}", URL); 
                }
            }

If you prefer one Regex then try this

            string input = 
                "<img src=\"http://www.microsoft.ca/newsletters/banner.png\" width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\"  border=\"0/\" />\n" +
                "<img width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\" src=\"http://www.google.ca/newsletters/banner.png\"  border=\"0/\" />";

            Regex rgx = new Regex("<img.*src=\"(?'URL'[^\"]*).*[^>]>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
            MatchCollection matches = rgx.Matches(input);
            foreach (Match match4 in matches)
            {
                string URL = match4.Groups["URL"].Value;
                if (URL.StartsWith("http://www.microsoft."))
                {
                    Console.WriteLine("URL = {0}", URL); 
                }
            }

Or even this

           string input = 
                "<img src=\"http://www.microsoft.ca/newsletters/banner.png\" width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\"  border=\"0/\" />\n" +
                "<img width=\"580\" height=\"120\" id=\"_x0000_i1025\" alt=\"Winter 2015\" src=\"http://www.google.ca/newsletters/banner.png\"  border=\"0/\" />";

            Regex rgx = new Regex("<img.*src=\"(?'URL'http://www.microsoft.[^\"]*).*[^>]>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
            MatchCollection matches = rgx.Matches(input);
            foreach (Match match4 in matches)
            {
                string URL = match4.Groups["URL"].Value;
                    Console.WriteLine("URL = {0}", URL); 
            }

Thursday, December 18, 2014 8:59 PM

Ok while waiting for answers I had started with my function and trying to add some suggestions (by Joel Engineer) but now I am getting lost in how exactly to take what you suggesting and add to what I have.
Below is what I have.

private string GetBodyHtml(string htmlString)
{
    Regex rgx = new Regex(@"<(img)\b[^>]*>", RegexOptions.IgnoreCase);
    MatchCollection matches = rgx.Matches(htmlString);

    string img;
    for (int i = 0, l = matches.Count; i < l; i++)
    {
        string imgName = GetImageName(matches[i].Value);
        imageNames.Add(imgName);
        img = string.Format("<img src=\cid:{0}\>", imgName);
        htmlString = htmlString.Replace(matches[i].Value, img);
    }
    return htmlString;
}

public string GetImageName(string imgSource)
{
    string src = XElement.Parse(imgSource).Attribute("src").Value;
    return Path.GetFileName(src);
}

So what I need added/incorporated into this is the check that confirms the value of the src from the img tag DOES NOT include the url of 'http://www.microsoft.com'
If that url value is found I want this src of that img tag to remain untouched and move onto the next img tag and perform the same check, if the url is not found then proceed with the function listed above.

I'm hoping someone can help incorporate this.
Joel's first suggestion from the above post worked, but I'm not sure how to wrap it around/with the current function I already had in place. I am getting lost in the list and loops.
If someone could assist adding the joel code into this script that would be great.


Thursday, December 18, 2014 9:18 PM

private string GetBodyHtml(string htmlString)
{
    Regex rgx = new Regex(@"<(img)\b[^>]*>", RegexOptions.IgnoreCase);
    MatchCollection matches = rgx.Matches(htmlString);

    string img;
    for (int i = 0, l = matches.Count; i < l; i++)
    {
        string imgName = GetImageName(matches[i].Value);

// add this line and do not add microsoft URLs to your imageNames List

if(imgName.StartsWith("http://www.microsoft.com").Equals(false))

        imageNames.Add(imgName);

        img = string.Format("<img src=\cid:{0}\>", imgName);
        htmlString = htmlString.Replace(matches[i].Value, img);
    }
    return htmlString;
}

I do not know what you wanna return in GetBodyHtml method.

if you wanna return empty string when match "microsoft URL" than you can use something like that:

private string GetBodyHtml(string htmlString)
{
    Regex rgx = new Regex(@"<(img)\b[^>]*>", RegexOptions.IgnoreCase);
    MatchCollection matches = rgx.Matches(htmlString);

    string img;
    for (int i = 0, l = matches.Count; i < l; i++)
    {
        string imgName = GetImageName(matches[i].Value);
        imageNames.Add(imgName);
        img = string.Format("<img src=\cid:{0}\>", imgName);
        htmlString = htmlString.Replace(matches[i].Value, img);
    }

// add this line and do not add microsoft URLs to your imageNames List

if(htmlString.StartsWith("http://www.microsoft.com").Equals(false))

    return htmlString;

else

return ""; // returns empty string when it is microsoft website

}

One final thing is you can use Contain method in the above code 

// instead of StartsWith you can use Contains method 

// by that way you can skip both http and https websites of microsoft 

// if(imgName.Contains("www.microsoft.com").Equals(false))

There are alternative ways but this will work too. 


Thursday, December 18, 2014 9:20 PM

Sorry I should have clarified, if the url contains the Microsoft one, I want the url and all of that img tag to remain the same, nothing changed in that tag, only change the tag information if there IS NOT Microsoft.com url.


Thursday, December 18, 2014 9:36 PM

Sorry I should have clarified, if the url contains the Microsoft one, I want the url and all of that img tag to remain the same, nothing changed in that tag, only change the tag information if there IS NOT Microsoft.com url.

 for (int i = 0, l = matches.Count; i < l; i++)
    {
        string imgName = GetImageName(matches[i].Value);

if(imgName.StartsWith("http://www.microsoft.com").Equals(false)){
        imageNames.Add(imgName);
        img = string.Format("<img src=\cid:{0}\>", imgName);
        htmlString = htmlString.Replace(matches[i].Value, img);

}

else

htmlString = imgName;
    }

you can do something like that. I do not know what do you mean by change. if this is the place you change the value of image tag than it will do the thing