Share via


Read CSV file to create object List

Question

Sunday, August 16, 2020 1:32 AM

What's the fastest way to read CSV file delimited by | into object List.

I have tried CSVHelper which works fine for several thousand records but for 3 GB CSV file this takes hours. My requirement is to read in the least possible time as there are many such files to process everyday.

Tried so far:

using (var reader = new StreamReader(csv file path))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
List<Customer> cList = csv.GetRecords<Customer>().ToList();

I can use any non-CsvHelper approach as well. Please let me know.

Thanks, AT

All replies (6)

Sunday, August 16, 2020 6:55 AM

I believe you're real problem is the object materialization.  Each CVS record is populated to a Customer object that must be instanced, the object is populated from a CVS record and the object loaded into a List<T>. It takes time to do this for each CVS record. IMHO, the only thing that can help is get a faster computer,  and one that is standalone that does nothing but this processing when needed.


Sunday, August 16, 2020 8:27 AM

If you do not need all of customers concomitantly, then check if you can solve your problem without ‘.ToList()’. Try a loop like this:

*   foreach( Customer customer in csv.GetRecords<Customer>() )*

*   {*

*   }*

If this advances reasonably, then insert the code that processes the customer.

If you want to consider some custom approach, then write a loop like this:

*   foreach( string line = File.ReadLines( csv file path ) )*

*   {*

*   }*

and estimate its duration. Then insert the code that parses the line. The string.Split(‘|’) cannot be always used; it depend on how ‘|’ is represented inside the text fields.


Sunday, August 16, 2020 10:47 AM | 1 vote

Hello,

Why not install SQL-Server Express edition, import the files via BULK INSERT.

Create a permanent or temp table with columns matching what you are working with in the Customer class. Import is not going to take hours either.

BULK INSERT YourTableName FROM 'YourFileName'
WITH
(
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
)

Once the import is done do what ever work needs to be done then clear rows on a permanent table or drop the temp table.

With your current method data is present in memory and if something goes wrong all collected data is lost while this is a chance with any method using SQL-Server it's less likely to happen.

Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

NuGet BaseConnectionLibrary for database connections.

StackOverFlow


Sunday, August 16, 2020 11:09 AM

What's the fastest way to read CSV file delimited by | into object List.

I have tried CSVHelper which works fine for several thousand records but for 3 GB CSV file this takes hours. My requirement is to read in the least possible time as there are many such files to process everyday.

Tried so far:

using (var reader = new StreamReader(csv file path))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
List<Customer> cList = csv.GetRecords<Customer>().ToList();

I can use any non-CsvHelper approach as well. Please let me know.

Thanks, AT

I've posted a number of CSV parsers on MSDN in the past 6 months.  The fastest way to do the job you describe depends entirely on the actual format of your CSV.

  • You say it's | (pipe) delimited so you may not need to process ' or " (single or double quotes) because it's unlikely that the | occurs in data fields.  Leveraging that may help create a faster process.
  • While DA924x makes a good point that 3Gb text files are just enormous and you're pushing each line of such a monstrosity into a managed-code List, there are better ways of reading such files and it doesn't sound much like CsvHelper is doing a good job at all.
  • There probably isn't any solution, including compiled/native code, that will take less than 20-30 minutes per Gigabyte to process any kind of delimited value.

So the best way to help MSDN contributors to help you would be for you to post a sample (say 20-30 lines) of your actual CSV data.

Before you can learn anything new you have to learn that there's stuff you don't know.


Sunday, August 16, 2020 4:14 PM

Hi Aishwarya

If you need to insert into data base ,then follow @Karena post .

other wise if you need to convert to object of file read line wise

Thanks and regards


Tuesday, August 18, 2020 10:25 AM

Dapper would help, using an OLE connector. Actually quite simple. Need to create a POCO of the CSV fields names and types, run it through the standard dapper code and it converts it nicely for you.

The trick is in the Connection string being an OLE driver