Fastest way to read large text file

10/16/2012

Question

_{Tuesday, October 16, 2012 7:25 PM}

Hi all,

I've got a routine that reads in a text file and imports the data to an SQL Server database. The text file consists of various sections following a format similar tothe following:

*File header
*sec title1
<data>
*sec title1 footer
*sec title2
<data>
*sec title2 footer
*File footer

The file contains 5 sections of which I only use 3. I open the file, locate the first section of interest, import data until I find the end of section marker then proceed to locate and import the next section.

I ported this application from a previous VB6 app, so when I went ahead and built the routines I used the first method I found first. I am doing my file read with the following code:

   Dim txtReader As System.IO.TextReader = 
        System.IO.File.OpenText(strFile)

    strInBuf = txtReader.ReadLine()

    Do Until strInBuf Is Nothing

       'Process data

       strInBuf = txtReader.ReadLine()

    Loop

    txtReader.Close()

I am now looking for ways to make this process faster. After doing some research I found other ways of processing text files, such as using StreamReader and StringReader. I found another way that uses an ADO object to read the text file. I've also encountered information which discusses reading a file line by line (as I have been doing) or reading it all at once.

I've read about the TextReader being an abstract class; however, I am not all that clear what that means. I built a sample app that reads a file using Stream and Text readers and the time difference seemed negligible, although it seemed as though StreamReader tended to be faster.

If I want to get as much speed out of the file read process, which technique is recommended? Any assistance is appreciated. Thanks, Saga.

Insanity is the prelude to discovery

All replies (32)

_{Tuesday, October 16, 2012 9:17 PM ✅Answered | 1 vote}

It may be faster to read the entire file into an array and process the array but I don't know how to tell.

 Dim lines() As String = IO.File.ReadAllLines("C:\Users\John\Desktop\ClearFile1.txt")
        For Each line As String In lines
            'do what you want with each single line
        Next

You've taught me everything I know but not everything you know.

_{Wednesday, October 17, 2012 7:08 AM ✅Answered | 1 vote}

Saga,

Optimizing programs in reading txt files for speed is mostly a senseless thing.

Inside the memory processor part you can win some nano seconds, the main time is spent from getting the data from the device (whatever that is).

The only thing you can do is to make your program easier to maintain, but that does not mean more speed.

I assume you mean with ADO the OleDB part which can quick make from so called CSV (in Excel Text) files a datatable. However, that does not really affect the speed.

If we are talking about binary files like images or things like that there is depending from the type of image which is used some performance gain by first reading the headers. But a txt file has not a random readable header.

Success
Cor

_{Wednesday, October 17, 2012 3:36 PM ✅Answered}

Frank, yes, please post your code. I'll use that as a base for the test app. Again, thank! I will follow up when I have the test app functional. Saga

Sorry it took so long to get back to this.

I have the code (including usage) on a page of my website here.

I hope that helps. :)

Please call me Frank :)

_{Wednesday, October 17, 2012 6:21 PM ✅Answered | 1 vote}

You need to find which step is taking the time - if the directory containing the files is on a remote server, then you will be limited to the speed of the network connection (~125MB/s on a Gigabit Ethernet link on a good day) and the speed at which the server can send the files if that is lower. One avenue to look at would be to copy the files locally before processing.

If the SQL Server is on a remote computer, you would want to minimise the network traffic to it, which you could do by sending all the updates for an input file in one go using, e.g., by using table-valued parameters. That might also help if the database is local.

Also regarding SQL Server, if you determine that is a bottleneck, you may be able to speed it up by removing unnecessary indexes - adding or updating data also causes it to update the indexes.

If your program is using ~100% of a CPU core, you could try running the processing in threads (or use Tasks, if you're using a later version of .NET) so that you use more than one CPU core (if available).

So, where to give it attention depends on where the bottleneck is.

HTH,

Andrew

_{Wednesday, October 17, 2012 7:58 PM ✅Answered | 1 vote}

I would suspect that most of time would be spent parsing the data. If you can parse the data into a DataTable then something like the SqlBulkCopy Class would be the most efficient way to save the data to SQL Server.

http://msdn.microsoft.com/en-us/library/ex21zs8x.aspx

Paul ~~~~ Microsoft MVP (Visual Basic)

_{Wednesday, October 17, 2012 8:05 PM ✅Answered | 1 vote}

Saga,

I agree with Andrew that his latest suggestion "can" be a very good approach, together with Franks suggestion which I translate in short as, find first the problem before you start solving something what is not the problem.

I will add that maybe creating two programs, one which reads the files and put them simply in on a local disk file.

And one which put those diskfiles in a database can make your problem much easier. If that goes well you can start optimizing.

Success
Cor

_{Tuesday, October 16, 2012 9:51 PM}

Saga,

I would use a TextFieldParser and read the file in line-by-line, validate the fields that it returns, then do the rest of your work once validated, entering the fields that it returned from that line into a new row of the database, to then continue on to the next line.

If it stalls the main UI, then put it in a BackGroundWorker and show progress, but the built-in functionality of the TextFieldParser is hard to beat - whether there are or aren't faster methods (and if there are, I doubt it's of any real consequence).

I have sort of a "general" sub that I built a while back that I continue to use. I'll post it if you'd like?

For what it's worth...

Please call me Frank :)

_{Tuesday, October 16, 2012 10:19 PM}

I tried and timed two methods, your method and the method I posted, on a 3mb file. You can see the results for yourself to the left of button1 (my method) and to the left of button2 (your method).

Imports System
Imports System.Diagnostics

Public Class Form1

    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load

    End Sub

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        Dim stopWatch As New Stopwatch()
        stopWatch.Start()

        Dim lines() As String = IO.File.ReadAllLines("C:\Users\John\Desktop\ClearFile1.txt")
        For Each line As String In lines
            'do what you want with each single line
            RichTextBox1.AppendText(line & vbCrLf)
        Next

        stopWatch.Stop()
        Dim ts As TimeSpan = stopWatch.Elapsed
        Label1.Text = ts.Duration.ToString
    End Sub

    Private Sub RichTextBox1_TextChanged(sender As Object, e As EventArgs) Handles RichTextBox1.TextChanged

    End Sub

    Private Sub RichTextBox2_TextChanged(sender As Object, e As EventArgs) Handles RichTextBox2.TextChanged

    End Sub

    Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click

        Dim stopWatch As New Stopwatch()
        stopWatch.Start()

        Dim txtReader As System.IO.TextReader = System.IO.File.OpenText("C:\Users\John\Desktop\ClearFile1.txt")
        Dim strInBuf As String = txtReader.ReadLine()
        Do Until strInBuf Is Nothing
            RichTextBox2.AppendText(strInBuf & vbCrLf)
            strInBuf = txtReader.ReadLine()
        Loop
        txtReader.Close()

        stopWatch.Stop()
        Dim ts As TimeSpan = stopWatch.Elapsed
        Label2.Text = ts.Duration.ToString

    End Sub

    Private Sub Label1_Click(sender As Object, e As EventArgs) Handles Label1.Click

    End Sub

    Private Sub Label2_Click(sender As Object, e As EventArgs) Handles Label2.Click

    End Sub
End Class

You've taught me everything I know but not everything you know.

_{Wednesday, October 17, 2012 12:21 AM}

Mr. Monkeyboy and Frank...

Thanks for your replies and effort. I am going to add these routines to my test app to see how they behave with the file that I have.

Frank, yes, please post your code. I'll use that as a base for the test app. Again, thank! I will follow up when I have the test app functional. Saga

Insanity is the prelude to discovery

_{Wednesday, October 17, 2012 8:30 AM}

RichTextBox1.AppendText(line & vbCrLf)

That's likely to be the slow part. You can build up the string in a StringBuilder and then, after reading the file, set RichTextBox1.Text.

--
Andrew

_{Wednesday, October 17, 2012 10:55 AM}

I removed writing to RichTextBox1 from my previous code and changed both file reading methods to just clear the strings instead. The results were significantly faster and in both cases the time to read the file would differ in favor of each method.

Imports System
Imports System.Diagnostics

Public Class Form1

    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load

    End Sub

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        Dim stopWatch As New Stopwatch()
        stopWatch.Start()

        Dim lines() As String = IO.File.ReadAllLines("C:\Users\John\Desktop\ClearFile1.txt")

        For Each line As String In lines
            line = ""
        Next

        stopWatch.Stop()
        Dim ts As TimeSpan = stopWatch.Elapsed
        Label1.Text = ts.Duration.ToString
    End Sub

    Private Sub RichTextBox1_TextChanged(sender As Object, e As EventArgs) Handles RichTextBox1.TextChanged

    End Sub

    Private Sub RichTextBox2_TextChanged(sender As Object, e As EventArgs) Handles RichTextBox2.TextChanged

    End Sub

    Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click

        Dim stopWatch As New Stopwatch()
        stopWatch.Start()

        Dim txtReader As System.IO.TextReader = System.IO.File.OpenText("C:\Users\John\Desktop\ClearFile1.txt")

        Dim strInBuf As String = txtReader.ReadLine()
        Do Until strInBuf Is Nothing
            strInBuf = ""
            strInBuf = txtReader.ReadLine()
        Loop
        txtReader.Close()

        stopWatch.Stop()
        Dim ts As TimeSpan = stopWatch.Elapsed
        Label2.Text = ts.Duration.ToString

    End Sub

    Private Sub Label1_Click(sender As Object, e As EventArgs) Handles Label1.Click

    End Sub

    Private Sub Label2_Click(sender As Object, e As EventArgs) Handles Label2.Click

    End Sub
End Class

You've taught me everything I know but not everything you know.

_{Wednesday, October 17, 2012 5:37 PM}

Yeah, interacting with the GUI is time consuming. When dealing with Windows Forms apps I usually try to avoid that unless I feel that the user really needs to know what is going on. Regards, Saga

Insanity is the prelude to discovery

_{Wednesday, October 17, 2012 5:55 PM}

As always, thanks Cor.

The data source "device" for my app is a folder. As added information this folder is mapped in such a way that it allows portable terminals to connect to the LAN and upload data files (to the folder). This was never a problem when there were dozens of these terminals. My app, a Windows Service, would check the folder and if any files were found it would import them to the database.

Now there are hundreds of terminals and they are all uploading at approximately the same time (a 2 hour window, give or take). My application continues to work fine, but is now being overwhelmed by so many files and is taking longer to import them. A speedy import is necessary because once the data is uploaded from the terminal there are people waiting to pass to the next step which can't be done without the data. So sometimes they have to wait 5 minutes, but other times the waiting period rises to 20 minutes and over.

My app does basically two things: Reads text files and inserts/updates the database. My next step is to analyze the DB access technique and code to see whether that can be optimized. I agree in that making the file reads go faster may be senseless because the speed gains may be negligible; however, I at least need to assure my employer that the code that I have in place is the best choice, hence this analysis.

Again, thanks Cor, and everyone else for their input. Much appreciated! Saga

Insanity is the prelude to discovery

_{Wednesday, October 17, 2012 6:13 PM}

My app does basically two things: Reads text files and inserts/updates the database. My next step is to analyze the DB access technique and code to see whether that can be optimized. I agree in that making the file reads go faster may be senseless because the speed gains may be negligible; however, I at least need to assure my employer that the code that I have in place is the best choice, hence this analysis.

I want to throw something out just as as "what if" - I don't know this will make any difference at all, but it might be worth testing.

Create a class and set the fields to be the same as those that are in your DataTable. Then create a class-level List(Of Class).

Use whatever method you want to read the file then, per line, parse the fields and put them into the List by first creating a new instance of the class and "fill in the blanks". Good or bad, just put them into the list.

That will then close that text file and now do your work in memory.

I don't know if that'll make one iota of difference but just to toss in something that might be worth at least testing.

I can put together an example if it interests you?

Please call me Frank :)

_{Wednesday, October 17, 2012 6:22 PM}

Interesting. This is an approach which would not have occurred to me. Thanks.

Yes, I am interested in an example if it is not an inconvenience for you, thanks for offering. At this time I am building the test application with the code referenced in this thread. Regards, Saga

Insanity is the prelude to discovery

_{Wednesday, October 17, 2012 7:01 PM}

Thanks Andrew! Yes, the bottleneck could be in my application, but it can also be else where, such as the connectivity infrastructure, location of data resources, SQL Server or some other element that I am presently missing. This is why I am doing this analysis. Once I've determined that the application is functioning as best as it can under the circumstances I can then proceed to suggest possible solutions in these areas.

For example, you mention about copying the data file to a local resource. Yes, I am doing that. In my first round of optimizations I discovered that I was copying it twice: Once to the work folder and once to the back up folder. I changed this so that the file is copied only once, to the backup folder, processed there and then renamed to .bak when it has been processed. The typical file size is around 1 - 2 Kilobytes and with about 5,000 files waiting to processed, the copy operation was taking somewhere in the order of an hour :-S.

For that particular case I am sure that the problem is exactly what you mentioned about the files being on a remote server (most likely connected to the client PC via a T1, but not really sure). But of course, the first suspicious entity is the application "Slow app...", "Needs to be rewritten...". Maybe so, but I'd rather base such an extreme (and costly) conclusion on analysis and research :-).

Again, thanks for the input. I'll keep in mind your idea when it comes time to analyze the DB access component. Hmmm... I'll have to check what version of SQL Server they are using. I don't think I can use Table-Valued parameters if a version prior to 2008 is being used, but in fairness I suspect that I am ok with that. Regards, Saga

Insanity is the prelude to discovery

_{Wednesday, October 17, 2012 7:32 PM | 1 vote}

Ah well, if getting the files in the first place is taking so long, maybe you could get the server to zip them up at, say, three-minute intervals and load those in your program. You're only looking at 10MB data in total, and a T1 line should be capable of ~150KB/s so the transfer should only take a little over a minute - but you may not have exclusive access to that bandwidth, so sending the data as aggregated zips could make it more efficient. You would then have a latency of three (or whatever) minutes, but if it speeds up the overall process then the savings could be worth it.

--
Andrew

_{Wednesday, October 17, 2012 7:33 PM}

Interesting. This is an approach which would not have occurred to me. Thanks.

Yes, I am interested in an example if it is not an inconvenience for you, thanks for offering. At this time I am building the test application with the code referenced in this thread. Regards, Saga

Insanity is the prelude to discovery

I'm assuming that you're talking to me - we don't all see the forum the same way and I don't know who you're referring to.

At any rate, this is what I've done: I took a text file which has one million lines of people's names (I created this a while back using a random name generator that I made). I then inserted some blank and then some garbage lines because phat lot of good any of this is if the data is invalid!

Here's an excerpt:

Daniel Smith
Paul Smith
Mark Smith
Donald Smith

';k';lkasf';oiasdg

George Smith
Kenneth Smith
Steven Smith
Edward Smith

The TextFieldParser automatically will not insert a blank line and I modified one part of the method that I posted earlier to handle the other.

If you want to use it to test with for yourself, the text file is here, and is about 15 megs. The code for this is here and following are the results:

The reason that I show the number of lines is because as I showed earlier, it's more than a million lines since I introduced blank lines and garbage in it, but it left those out of the set and continued on.

I tend to think, as has been brought out here (including by you) that the real issue is elsewhere, but since you're testing - it's worth tossing this into the mix. ;-)

Please call me Frank :)

_{Wednesday, October 17, 2012 8:02 PM}

I would suspect that most of time would be spent parsing the data. If you can parse the data into a DataTable then something like the SqlBulkCopy Class would be the most efficient way to save the data to SQL Server.

http://msdn.microsoft.com/en-us/library/ex21zs8x.aspx

Paul ~~~~ Microsoft MVP (Visual Basic)

That's a good idea Paul - and Andrew has added many other good ideas of the process as a whole. He should have several good "finds" in this thread! :)

Please call me Frank :)

_{Wednesday, October 17, 2012 8:08 PM}

Saga,

I agree with Andrew that his latest suggestion "can" be a very good approach, together with Franks suggestion which I translate in short as, find first the problem before you start solving something what is not the problem.

I will add that maybe creating two programs, one which reads the files and put them simply in on a local disk file.

And one which put those diskfiles in a database can make your problem much easier. If that goes well you can start optimizing.

Success
Cor

Another good idea.

@Saga --> You'll have days of work to try out all this stuff - but it might just be worth the investment! :)

Please call me Frank :)

_{Wednesday, October 17, 2012 9:01 PM}

I would suspect that most of time would be spent parsing the data.

I /think/ we've found that the bottleneck is getting the data from the server in the first place. Processing the 10MB of data is probably not taking more than a speeding gnat's whisker of time. OP to confirm.

--
Andrew

_{Wednesday, October 17, 2012 11:45 PM}

I think this is all way out of my scope but it sure is interesting reading. WOW. Went from speed reading .txt file to 'I'm lost'. :)

You've taught me everything I know but not everything you know.

_{Thursday, October 18, 2012 6:17 PM}

Interesting. This is an approach which would not have occurred to me. Thanks.

Yes, I am interested in an example if it is not an inconvenience for you, thanks for offering. At this time I am building the test application with the code referenced in this thread. Regards, Saga

Insanity is the prelude to discovery

I'm assuming that you're talking to me - we don't all see the forum the same way and I don't know who you're referring to.

Sorry about that Frank. Yes, I was addressing you. I am now slowly processing all the information in this thread. Again, thanks for your time and assistance. Saga

Insanity is the prelude to discovery

_{Thursday, October 18, 2012 6:58 PM}

Thanks Andrew. Zipping the files is an interesting idea that has been partially implemented. Some of the incoming files are getting compressed, but not the 1-2KB ones. The idea of splitting the app into two components would fit nicely with your idea of zipping a bunch of data files together.

I could have one app doing the zipping and moving of the files and a second app do the import. Now all I need is approval for this project :-). Regards, Saga

Insanity is the prelude to discovery

_{Thursday, October 18, 2012 6:59 PM}

Frank,

"You'll have days of work to try out all this stuff "

Agreed.

"it might just be worth the investment"

Agreed! :-D Again thanks all for your input. Saga

Insanity is the prelude to discovery

_{Thursday, October 18, 2012 7:02 PM}

That's a good idea Paul - and Andrew has added many other good ideas of the process as a whole. He should have several good "finds" in this thread! :)

Please call me Frank :)

Yes, a treasure trove indeed.

Paul, thanks for your suggestion. I will definitely look into it and the others posted here. Regards, Saga.

Insanity is the prelude to discovery

_{Thursday, October 18, 2012 7:04 PM}

Mr. Monkeyboy, just a note to thank you for putting together the test app. Regards, Saga

Insanity is the prelude to discovery

_{Thursday, October 18, 2012 7:08 PM}

Some of the incoming files are getting compressed, but not the 1-2KB ones.

Then I think you're missing part of the benefit.

It's not just for the sake of compression (although that helps too, obviously), it's to move fewer files.

Here's my point: Have you ever uploaded (or downloaded) say 10,000 small files? Compare that to a single file of the same size - try it!

There's an overhead associated with each file transfer (I assume it's doing error checking of some sort, but I don't know), so wrapping them into a single file is, itself, a help in the right direction.

Please call me Frank :)

_{Thursday, October 18, 2012 7:10 PM}

Thanks Andrew. Zipping the files is an interesting idea that has been partially implemented. Some of the incoming files are getting compressed, but not the 1-2KB ones. The idea of splitting the app into two components would fit nicely with your idea of zipping a bunch of data files together.

In case I didn't present the idea clearly, I should emphasize that it was to get the remote server onto which the files are uploaded to zip the files in batches. Your local program then downloads those zips instead of the text files, thus saving bandwidth.

--
Andrew

_{Thursday, October 18, 2012 8:49 PM}

Thanks Andrew, actually your idea was presented clearly. I understood that the ZIPping occurs at the file server then a second app running elsewhere copies the ZIPped files and unzips them locally.

The partial implementation does just that, except that the files are compressed at their physical source, that is, at the portable terminal. Once my app has copied the file it uncompresses and processes it locally. Apologies if the two app comment above was confusing. Regards, Saga

Insanity is the prelude to discovery

_{Thursday, October 18, 2012 9:01 PM}

Frank, I think I am missing much of the benefit! Actually, I have done the experiment that you mention. I've copied 3,000 small files and 100 large ones. The small files in total weigh in at about 2.26 MB while the total weight of the larger files is about 12MB. The large files copy in about 6 minutes while the smaller files copy in about 45 minutes.

While I am also not familiar with the intricate details, I agree with you in that there is overhead when a file is copied that adds up. At this time only the large files are compressed (1.2MB org size, 230K compressed) because that was part of the original design of the application. Obviously this is no longer working as well as it should. I see compressing the small files as a viable option (perhaps even necessary), although the corporate heads will have the final saying whether they want me to undertake this project. Thanks again, Saga

Insanity is the prelude to discovery

_{Friday, October 26, 2012 9:52 PM}

Update: Hi all, I am now slowly researching the various ideas presented here; however, I ran into a small problem.

I just found out that the production servers are running SQL Server 2005. Basically I will have to rethink the solution since any that were discussed here and centric to SS 2008 will not work in the installed environment.

As far as the file i/o, after some test it seems that the "fastest" is reading the entire file at once, followed by using the StreamReader and then by using the TextReader. At least in my specific case. I also noticed that by running the read process various times the timing seemed to normalize. In the end, the timing for the method that I use for file i/o becomes negligible, as was mentioned in this thread.

While the bottleneck is due to an external cause (to the app) I will investigate further to verify that my app is performing as best as it can. I'd like to thank everyone for all assistance given. Saga

Insanity is the prelude to discovery

Share via

Fastest way to read large text file

Question

All replies (32)

Additional resources