Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Monday, April 27, 2009 1:12 AM
I'm trying to write a simple data scrapping program, and one of the websites I'm trying to download from is returning a 404: Unknown error even though I know the URL exists. I'm using a WebClient.DownloadString(string URL) instance, and when I try to use a URL such as the following I get the error:
http://web1.ncaa.org/d1mfb/worksheet.jsp?year=2005&game=200500000000820050903.xml
If I write the URL string to a file before calling for the download I can open the page from that file without any problems, so I know the URL does in fact exist. I am able to successfully download a URL that doesn't include an '=' sign or a '?' or a '&', so I'm thinking that it has some sort of problem with those characters, but I haven't been able to find anything to confirm this or workaround it. Any help is appreciated.
All replies (7)
Monday, April 27, 2009 3:16 PM âś…Answered
Figured it out, I had to pass a cookie with a sessionid or else it failed.
Monday, April 27, 2009 2:12 AM
Took a quick peek with fiddler to see what the difference between the request done by the webbrowser and the one form the webclient is. Turns out the ncaa.org site gets unhappy when you do not supply a user-agent in your headers.
static void Main(string[] args)
{
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
string s =wc.DownloadString(@"http://web1.ncaa.org/d1mfb/worksheet.jsp?year=2005&game=200500000000820050903.xml");
Console.WriteLine(s);
}
that should fix it.
Monday, April 27, 2009 2:47 AM
Hmm, I'm still getting the same 404 answer even with that added. For the record I can download some files from ncaa.org, I have no problems downloading the following page for example even before adding headers. I must admit I'm not all that knowledgable about networking in general so I'm not sure what if any differences there are between the two requests.
http://web1.ncaa.org/mfb/2005/Internet/worksheets/1200520050903.HTML
Monday, April 27, 2009 3:18 AM
wow they really don't like people scraping their site eh? :)
I got a 404 now as well (while it was working before) if i fetch their homepage before i fetch that seconds page it seems to work for again (but for how long?!)
static void Main(string[] args)
{
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
string s = wc.DownloadString(@"http://web1.ncaa.org/");
s = wc.DownloadString(@"http://web1.ncaa.org/d1mfb/worksheet.jsp?year=2005&game=200500000000820050903.xml");
Console.WriteLine(s);
}
Monday, April 27, 2009 3:32 AM
That lets me scrape a couple of games before returning a 404, but I still don't get out of the A's, I go from Alabama to Arizona St.
Monday, April 27, 2009 3:42 AM
If it helps any, here's my full downloading code. Here's the basic flowchart that it's trying to download, and it's giving me a 404 on step three every time:
http://web1.ncaa.org/mfb/2005/Internet/worksheets/DIVISION1.HTML --> each week of the season http://web1.ncaa.org/mfb/2005/Internet/worksheets/1200520050903.HTML --> each game of each week of the season http://web1.ncaa.org/d1mfb/worksheet.jsp?year=2005&game=200500000000820050903.xml
for (int y = 2005; y < 2009; y++)
{
URL = "http://web1.ncaa.org/mfb/" + Convert.ToString(y) + "/Internet/worksheets/" + "DIVISION1.HTML";
if (URL.Contains("2008"))
URL.Replace("B.HTML", "1.HTML");
websiteText = Client.DownloadString(URL);
//Scrapes webpage to find links to each week
foreach (string weeklines in websiteText.Trim().Split(new string[] { "</tr>" }, StringSplitOptions.None))
{
if (weeklines.Contains("<tr><td>") && weeklines.Contains("stylesheet")==false)
{
partialURL = weeklines.Split('"')[1].Remove(0, 1);
URL = "http://web1.ncaa.org/mfb/" + Convert.ToString(y) + "/Internet/worksheets" + partialURL;
//Downloads page containing list of that week's games
websiteText = Client.DownloadString(URL);
//Scrapes that page to find links to each individual game
foreach (string gamelines in websiteText.Trim().Split(new string[] { "</tr>" }, StringSplitOptions.None))
{
if (gamelines.Contains("xml"))
{
partialURL = gamelines.Split('"')[1];
URL = "http://web1.ncaa.org" + partialURL;
sr = new StreamWriter(path + "urls.txt");
sr.WriteLine(URL);
sr.Close();
Client.Headers.Clear();
Client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
websiteText = Client.DownloadString(@"http://web1.ncaa.org");
//Attempts to download the xml data for a single individual game
//THIS IS WHERE FAILURE OCCURS
websiteText = Client.DownloadString(URL);
}
}
}
}
}
Monday, April 27, 2009 2:17 PM
Anyone else?