Sunday, 20 August 2017
Main Menu
Home
Blog
News
Software
Forums
Music
Email
Web Links
Blog Links
Musical Links
Nerdery Links
Other Neato Links
Syndicate
 
Reading a file sequentially using multiple threads
User Rating: / 6
PoorBest 
.NET Programming
Written by WATYF on Friday, 21 September 2007 (32284 hits)
Category: .NET Programming

So I took a dive (head first, of course) into a new area of programming this week: The wonderful world of uber-large file processing. Now, I've worked with large files before... don't get me wrong... but I'm a database guy. Most of my file experience involves figuring out ways to get data into a database quickly and efficiently (and I've gotten fairly adept at it, if I do say so myself Tongue out). Recently, however, I needed to write some code to process a file, and as usual, my first assumption was that I would just load it into a db (using one of the bagillions of methods at my disposal), run a few SQL statements against the table, and be done with it. Then I found out that the file was almost 2GB. Surprised The problem with that is twofold: 1) Importing a 2GB file into a database would take a really freaking long time and then you'd still have to run queries on it to do your processing, which would take even longer. And 2) Putting the file in a db would have the net effect of doubling the file size on disk... because now it exists as the original file and as records in your db. And since most db's aren't super efficient at file storage, chances are that it'll take up more than 2GB in the db. So you're looking at turning your 2GB file into 4-6GB (if not more).

So anywhoo... once I realized that a database was not the way to go, I had to figure out how to process this huge monster in an efficient manner...


It took a bit (and by "a bit", I mean, "days of constant trial and error") but I eventually worked up a method that was fast enough for my tastes. Lots of tweaking was involved and trying out different types of Object and Methods to see which one worked the fastest. That's the worst thing about large file processing, by the way... every little thing you do matters. Are you gonna empty that string by setting it to null or use a StringBuilder? Are you gonna use an arraylist or a hastable? Or how about a Dictionary Entry or maybe a SortedList? Which one is faster? And what about the buffer size for reading the file? Should you keep it large or small or variable or blah blah blah blah blah and so on and so forth. Basically, it involves having to approach almost every line of your code in as many ways as you can think of to make sure you found the fastest way to get that line done, since you're gonna be looping through that line about 50 bagillion times. Tongue out

But enough about the annoyances of large file processing... you're here for something else. You're here because even after you find the fastest possible way to loop through that file, there's still one more thing you can do to make it faster: Multi-Threading.

Of course, that leads us to a problem... how do you read from the same file using more than one thread? Is that even possible? Wouldn't that cause the universe to implode or something like that? I mean, it's a simple StreamReader... you call ReadLine, it consumes that line and you move on to the next line... how are you gonna do that from more than one thread?

Here's how. Smile

First off, we need to make some public variables. In this example, I'm using the simplicity of public variables, but you could make them properties in a Class (so long as you place them in the proper scope, so that the threads can access them). The variables that you'll need depend on what you want to do, so I'll use my scenario for an example. I needed to load some data into an array, then compare that array against our ginormous file. Then, based on whether or not a match was found in the small array, I needed to write some output to a file. That, by the way, introduces another issue... multi-threaded file output. Now we don't just need to read from a file using multiple thread, but all those threads have to share the same file for writing their output to. So, with that in mind, here are the variables we'll need.

Dim trHuge as TextReader
Dim twResults as TextWriter
Dim arrComp as ArrayList

Here we have the textreader that we're going to use to read the huge file, the textwriter that will be used for output, and the array which will hold the values needed to compare against the huge file. A simple array would be faster (assuming you knew how large array would be each time), but you can't synchronize regular arrays for multi-threading, so we have to use an ArrayList here. You also might be wondering why I used TextReader/Writer instead of StreamReader/Writer. StreamReaders/Writers are more flexible, but again (like the arrays), they can't be synchronized for multi-threading (for whatever reason).

So, how do you synchronize these objects for multi-threading? You instantiate them by calling the "Syncronized" method of the type of object that you're creating.

        Dim fs as FileStream = New FileStream("C:\My Files\BigFile.txt", FileMode.Open, FileAccess.Read, FileShare.Read)
        trHuge = TextReader.Synchronized(New StreamReader(fs))

Now we have a reader that's ready for multi-threaded processing. So let's create the writer.

        fs = New FileStream("C:\My Files\OutputFile.txt", FileMode.Create, FileAccess.Write, FileShare.Write)
        twResults = TextWriter.Synchronized(New StreamWriter(fs)) 

I'm reusing the "fs" variable here for the demo, but you can set that up more appropriately in your own code, if you so desire. At this point, we have a synchronized reader and writer, both created using the TextReader/Writer.Synchronized function. If you had declared trHuge or twResults as a StreamReader or StreamWriter, that function would have failed, which is, again, why we're using TextReader/Writer.

Another thing to note is the absence of the FileOptions.Asynchronous argument in the FileStream creation. There's an argument that you can use when creating a FileStream that allows you to indicate certain attributes of how the stream will be handled, such as "Random Access", "Sequential Scan", etc. One of those options is "Asynchronous", which you would think would be needed for accessing the file asynchronously. At first, I had that option enabled, but in my speed tests, I found that the loops ran faster if I didn't use that option AND that there were no problems accessing the file from multiple threads even with that option "turned off". So I'm not sure what the use is for FileOptions.Asynchronous, but it doesn't appear to be necessary for multi-threaded file I/O (so long as you use the Synchronized function to create the reader/writer). If anyone happens to know what its use is, or if there is some kind of unforeseen danger that will befall me for leaving it off, please let me know.

After creating the file objects, you create your synchronized ArrayList.

        arrComp = ArrayList.Synchronized(New ArrayList)

There's still one more thing to do though. You see, when you use a Synchronized Reader/Writer, the loop runs a bit slower. "So why would I use it, then?", you ask? Because more than one loop will be running at the same time, so the net time is faster. For example, lets say you're looping through 800,000 records, and that takes 1500 milliseconds using a standard StreamReader. But with a Synchronized TextReader, that loop may take 1900 milliseconds. After a while (i.e. hundreds of millions of records) that difference is gonna add up. But, if two loops are running at 1900 vs. one loop running at 1500, you're getting more done in less time.

This causes a problem, though.... what if they don't have a Dual Core (or Quad Core or whatever) system. Our code will run slower, since we used a Synchronized TextReader, and since they can only make use of one thread at a time. In that case, we check to see how many CPUs the user has, and if they only have one, we use a standard, non-synchronized reader. Like so...

     iCPU = System.Environment.ProcessorCount
     Dim fs as FileStream = New FileStream("C:\My Files\BigFile.txt", FileMode.Open, FileAccess.Read, FileShare.Read)
     If iCPU = 1 Then trHuge = New StreamReader(fs) Else trHuge = TextReader.Synchronized(New StreamReader(fs))

Now that we have all of the ingredients, we can write the procedures that will employ them.

 

Dim trHuge as TextReader
Dim twResults as TextWriter
Dim arrComp as ArrayList

Sub BigFileProcess()

     iCPU = System.Environment.ProcessorCount

     'load data into the arraylist
     'I'm assuming you know how to do that :op
     If iCPU = 1 Then arrComp = New ArrayList Else
arrComp = ArrayList.Synchronized(New ArrayList)
     Do While 'whatever
          arrComp.Add("whatever")
     Loop

     'Create the TextReader
     Dim fs as FileStream = New FileStream("C:\My Files\BigFile.txt", FileMode.Open, FileAccess.Read, FileShare.Read)
    
If iCPU = 1 Then trHuge = New StreamReader(fs) Else trHuge = TextReader.Synchronized(New StreamReader(fs))

     'Create the TextWriter
     fs = New FileStream("C:\My Files\OutputFile.txt", FileMode.Create, FileAccess.Write, FileShare.Write)
     If iCPU = 1 Then twResults = New StreamWriter(fs) Else twResults = TextWriter.Synchronized(New StreamWriter(fs))

     'For however many CPUs they have, create that many new processes
     'First you create a new instance of the Class that is used to do the processing
     'Then you create a thread at the AddressOf the Class's method.
     For x = 0 To iCPU - 1
          bfProc = New BigFileProcess
          thrProc(x) = New Thread(AddressOf bfProc.BFProc)
          thrProc(x).Start()
     Next 

     'Now join each thread so that you can wait until they're all done
     For x = 0 To iCPU - 1
          thrProc(x).Join()
     Next

End Sub 

Public Class BigFileProcess()

     Sub BFProc()

          'And here's where we process the file.
          Dim sTmp as String = trHuge.ReadLine
          Do While Not sTmp Is Nothing
               'Do some kind of comparison against the array
               If Not arrComp.Contains(sTmp) Then
                    'Write the record to the output file if it isn't found in the array
                    twResults.WriteLine(sTmp)
               End If
               sTmp = trHuge.ReadLine
          Loop 

     End Sub

End Class

And now you have a huge file being read in multiple theads at the same time. For the record, this demo (on top of not having been debugged in any way, shape, or form) is not the best way to read from a huge file. In my actual code, I loaded chunks of the file at a time into an array and used that in the threaded process. This code is just to give you the general idea of how to work with reading and writing to files in a multi-threaded environment... making it as efficient as possible is up to you. Tongue out

 

WATYF

 

NOTE: I've had a couple of people comment elsewhere about how they're dubious that using multi-threading would speed up a process that involves disk I/O. After all, the disk can only serve up the data so fast (and a filestream can only be read one line at a time), so if it's two slow to keep up with one thread, how will throwing two threads at it make any difference? Well, that's a very good question, and it brings to light the advantages of being a hack programmer. Tongue out See, I didn't even stop to consider the technical aspects of disk I/O and hard drive speeds and limitations when trying out multi-threading. If I had, I might not have even bothered testing it out. I might have just thought about it and come to the assumption that multi-threading couldn't possibly help. But I'm glad that I'm too ignorant to know all that technical crap, because as a result of multi-threading the code, processing a ~2GB text file went from 4.1 minutes to 3.0 minutes. How do I explain that (in light of the fact that disk I/O is a bottleneck and that two threads shouldn't be able to read from the disk at the same time)??  Well, I think it's because after the ReadLine happens, there's a logic check that gets performed (in the above sample, that logic is searching through the ArrayList to see if it contains a value). While one thread is performing that logic check, the other thread is already reading from the file. So they basically trade off... one thread reads from the file while the other thread is doing its logic, and vice versa. Now, if your process was just reading from a file, I don't think multi-threading will help (but hey, don't let me stop you from trying it out). On the other hand, there aren't many reasons for just reading from a file without actually doing something after you've read from it.

ANOTHER NOTE: I've since found out that your ArrayLists don't need to be Synchronized if you only plan on reading from them. I tested it out and found that to be the case. But in my actual code, I was writing something to the ArrayLists, so I still needed to Sync them.

YET ANOTHER NOTE:  There is another way to sync objects for multiple-threads which could be used in place of TextReader.Synchronized (in this example). It's called SyncLock. The advantage is that you can use it on any line (or block) of code. So there doesn't need to be a "Synchronized" function available for the object you plan on using. Any line that needs to be Sync'ed for threading should be surrounded by SyncLock, and it will be thread-safe. So in my example, it would be used like so:

SyncLock GetType(BigFileProcess)

     sTmp = trHuge.ReadLine

End SyncLock

So theoretically, I could abandon using TextReader.Synchronized altogether and just surround any ReadLine statements in SyncLock. But in testing, I found TextReader.Synchronized to be faster than using SyncLock. It's still handy, because it's more flexible, and I actually ended up using it to make an Array sync'ed for threading elsewhere in my code.

 
< Prev   Next >

Comments

You must javascript enabled to use this form

Given that you haven't actually provided the 'real' code, your claims that multiple threads will speed up I/O remain untestable by others. I am skeptical that adding threads to an I/O operation would improve performance. Any such attempt would most certainly be I/O bound, unless you are doing some serious number crunching between reads, or you have stumbled upon some happy coincidence having to do with the disk cache, in which case it might still be unreproduceable on another machine.

Any chance of seeing the real code?

Posted by Robert, on 09/15/2009 at 18:58

I believe I covered most of that in my note. Yes, if you're just reading from disk without doing any processing of any kind, I don't see how this would speed anything up. But usually, the point of programming is to do something with data (not just read it), so having multiple threads read and process is what allows the process to speed up. While the first thread is reading, the second is processing, and when the first is done reading, it processes while the second is back to reading. So we have an overlap, which essentially gets us to the very edge of the I/O bottleneck. If we just had one thread reading and processing, we'd wouldn't be bottle-necking the I/O because the I/O would be getting a rest whenever the processing is happening. Add that processing up over the course of a 2GB file, and you have a lot of time spent processing while no I/O is actually happening. That's where trading off between multiple threads comes in.


As for the real code... it's proprietary (and I'm not the owner), so I can't post it. Really though, you should get the same results with this dummy code. It's essentially doing the same thing (read then process, etc).

Posted by WATYF, on 09/15/2009 at 19:51

 1 
Page 1 of 1 ( 2 Comments )
©2007 MosCom

Add comments: Reading a file sequentially using m...

Enter your comment below:

Name: (required)

E-mail: (required)
Your email will not be displayed on the site; only to our administrator.
Home page: (optional)



Comment: [BBcode]

 

© 2017 Musical Nerdery
Joomla! is Free Software released under the GNU/GPL License.