Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Monday, September 26, 2011 4:16 PM | 1 vote
In applications such as genomics it is common to work with extremely large sets of data. For example, the human genome consists of 3 billion nucleotides. It is common to divide the genome into short words. Depending on the application these words can range in length from 3 to a few tens (e.g. 20) of nucleotides. It is useful to build custom in-memory indexes of these words which also number about 3 billion for the human genome. Other applications include various data mining scenarios that also involve very large amounts of medical, financial, etc. data.
I have been conducting simple tests with 64bit (x64 target platform) C# applications. Very simply, in each test I instantiate a List<something> where something is either a class or struct. The class or struct contains one or more int or long fields. I then create an infinite loop to see how many I objects I can Add to the list. I enclose the loop in a try block and catch System.OutOfMemoryException. I will include the code below.
Results: I use a System.Diagnostics.Stopwatch to time how long it takes to add every 1000 objects. For the first 1,073,741,000 objects I find that adding 1000 objects takes 2.3 milliseconds. Then, suddenly the time to add the next 1000 objects jumps to nearly 2 minutes. Then it takes nearly 9 minutes to add the next 1000 objects.
I never get an exception and the program happily continues to create objects and add them to the List. However, the amount of time to add them continues to increase.
Note that my laptop is a Thinkpad W510 with 14 GB of memory. Physical memory is clearly not an issue.
Here is the code that I tested with. The only thing I change for each test is the fields in struct s1. If I define s1 to be class s1, then interestingly I hit the System.OutOfMemoryException.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Diagnostics;
using System.IO;
namespace MemTester1
{
struct s1
{
public byte v1;
}
public partial class Form1 : Form
{
List<s1> testList = new List<s1>();
long sz = 0;
public Form1()
{
InitializeComponent();
}
private void test1ToolStripMenuItem_Click(object sender, EventArgs e)
{
try
{
StreamWriter w = new StreamWriter("diagnostics.txt");
Stopwatch sw = new Stopwatch();
sw.Start();
while (true)
{
s1 t = new s1();
testList.Add(t);
sz = testList.Count;
if ((sz > 1000000000) && ((sz % 1000) == 0))
{
w.WriteLine(sw.Elapsed.ToString() + ", " + sz.ToString());
w.Flush();
}
}
}
catch (System.OutOfMemoryException)
{
MessageBox.Show(sz.ToString());
}
}
}
}
David Cox, Ph.D.
All replies (11)
Monday, September 26, 2011 5:34 PM âś…Answered
David,
A List<T> keeps internally an array to store the items. The max # of items you can store in a list is 2^31-1
If your structure is 8 bytes, your first 1,073,741,000 items will occupy about 8 Gb of memory. Your system will start to page at a certain moment
When a list is first instantiated, it's capacity is about 4 items.
When you add an item and the array is full (see property Capacity), its size is doubled : 8, 16, 32, 64, ...)
The expansion of the array is not optimal. A new array is created, the contents of the old array are copied to the new array, and the old array will be removed from the heap when the garbage collector decides.
At a certain point in time , the array will be, say, 1 GB.
A new array of 2GB gets created and initialized with 0x00
The old array is copied (1 GB)
The old array is removed (later) and the heap gets compacted, which is copying 2 GB
Does that shed some light on the issues that you see ?
You can instantiate a List<T>() with a desired capacity - see the constructor
Finally - if s1 is a class, the instances are kept on the heap. One million objects to allocate on the heap, one million to garbage collect + heap compaction
If you can use a struct for s1, it will be better
Monday, September 26, 2011 4:30 PM
And you question is?
Maybe why using a Struct doesnt hit an exception OutOfMemory?
Mitja
Monday, September 26, 2011 4:33 PM
No question. It was just an observation. I will add that my conclusion is that allocating large lists is not practical with C# on a 64bit platform regardless of the amount of memory available. However, if anyone can prove otherwise, I would be very interested to know how.
David Cox, Ph.D.
Monday, September 26, 2011 4:37 PM
I cant help you here, since Iam on 32bit win7.
And actually didnt event try something like it so far. Maybe someone else who tried it, can give you some other "answer" on this interesting issue.
Mitja
Monday, September 26, 2011 4:41 PM
If you have really large datasets you should use a database, that's what they're for.
Monday, September 26, 2011 5:30 PM
I suggest you XML file for store very large objects data so you can handle and serialize or deserialize easily and manageable.
Monday, September 26, 2011 5:39 PM
Yes, right you are. Thanks for pointing that out to me. For some reason I was thinking that the number I was getting was in the millions not billion. Makes bit of a difference.
David Cox, Ph.D.
Monday, September 26, 2011 5:45 PM
Is there any way you can generate the genome/nucleotides one by one, process them, etc without necessarily storing all of them in a list/array ?
Monday, September 26, 2011 6:02 PM
If you want a really large dataset in managed code I suggest using the PersisentDictionary. http://managedesent.codeplex.com/wikipage?title=PersistentDictionaryDocumentation&referringTitle=Home
Monday, September 26, 2011 8:44 PM
Hi David,
I see your issue and please let me point you to the Capacity of the list. If you look at the source code of the add method, the list will resize the capacity for each new item:
public void Add(T item)
{
if (this._size == this._items.Length)
this.EnsureCapacity(this._size + 1);
this._items[this._size++] = item;
++this._version;
}
EnsureCapacity will double the capacity. Actually I cannot see why this breaks at 1073741824, but you can avoid the behavior by enlarging the Capacity of the list e.g. to 2000000000 before adding the items.
If you want to see what is goning on under the hood, please look at the source code ;-)
I hope this is helpful, happy coding...
(btw, resizing the capacity of List<T> will recreate the internal array and copy by calling Array.copy(...) )
UPDATE:
I did not reload this page for a long time and did not see the answer of Gregory Adam, sorry, he is right
"It's time to kick ass and chew bubble gum... and I'm all outta gum." - Duke Nukem
Tuesday, September 27, 2011 1:03 PM
I wish to take a moment to thank everyone for responding to my post. The information was very helpful. Thanks everyone!
David
David Cox, Ph.D.