I’m creating a program that parses a web page then follows the links and then parses the next set of web pages to eventually build a picture of an entire site. This means that as the program runs more work is being generated and more tasks can be launched to process each new page as it is discovered.
My original solution was simply to create code like this:
1: private void ProcessLink(string link) 2: { 3: var page = GetPageInformation(link); 4: var newLinks = GetNewLinks(page); 5: 6: foreach(var newLink in newLinks) 7: { 8: Action action = () => { ProcessLink(newLink); }; 9: Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent); 10: } 11: }
The premise is simple enough, build a list of new links from a page then for each of the new links start a new task. The new task is attached to the parent task (the task that is launching the new set of tasks)
However, it soon became apparent that this was quickly getting out of control and I had no idea what was still waiting to be processed, or that the same link was being queue up multiple times in many different threads and so on. I ended up putting in place so many mechanisms to prevent the code processing the same page over again in different threads that it was getting silly. For a small number of new tasks being launched, I’m sure that Task.Factory.StartNew() is perfectly suitable.
I eventually realised that I was heading down the wrong way and I needed to rethink my strategy altogether. I wanted to make the code parallelisable so that while I was waiting on one page I could be parsing and processing another page. So, I eventually refactored it to this:
1: public class SiteScraper 2: { 3: private ConcurrentDictionary<string, ScraperResults> completedWork = 4: new ConcurrentDictionary<string, ScraperResults>(); 5: 6: private List<string> currentWork; 7: 8: private ConcurrentQueue<string> futureWorkQueue = 9: new ConcurrentQueue<string>(); 10: 11: public void GetSiteInformation(string startingUrl) 12: { 13: currentWork = new List<string(); 14: currentWork.Add(startingUrl.ToLowerInvariant()); 15: 16: while(currentWork.Any()) 17: { 18: Parallel.ForEach(currentWorkQueue, item => GetPageInformation(item)); 19: BuildWorkQueue(); 20: } 21: } 22: 23: private void BuildWorkQueue() 24: { 25: currentWork = new List<string>(futureWorkQueue 26: .Select(link => link.ToLowerInvariant()).Distinct() 27: .Where(link => IsLinkToBeProcessed(link))); 28: 29: futureWorkQueue = new ConcurrentQueue<string>(); 30: } 31: 32: private void GetPageInformation(string url) 33: { 34: // Do stuff 35: ProcessNewLinks(newLinks) 36: } 37: 38: private void ProcessNewLinks(IEnumerable<string> newLinks) 39: { 40: foreach (string url in newLinks.Where(l => IsLinkToBeProcessed(l))) 41: { 42: futureWorkQueue.Enqueue(url); 43: } 44: } 45: 46: 47: // Other bits 48: 49: }
There is still some code to ensure duplicates are removed and not processed, but it become much easier to debug and know what has been processed and what is still to be processed than it was before.
The method GetSiteInformation (lines 11-21) handles the main part of the parallelisation. This is the key to this particular algorithm.
Before discussing what that does, I just want to explain the three collections set up as fields on the class (lines 3 to 9). The completedWork is a dictionary keyed on the url containing an object graph representing the bits of the page we are interested in. The currentWork (line 6) is a list of the current urls that are being processed. Finally, the futureWorkQueue contains a queue of all the new links that are discovered, which will feed into the next iteration.
The GetSiteInformation class creates the initial list of currentWork and processes it using Parallel.ForEach (line 18). On the first iteration only one item will be processed, but it should result in many new links to be processed. A call to BuildWorkQueue builds the new work queue for the next iteration which is controlled by the while loop (lines 16-20). When BuildWorkQueue creates no new items for the workQueue then the work is complete and the while loop exits.
BuildWorkQueue is called when all the existing work is completed. It then builds the new set of urls to be processed. The futureWorkQueue is the collection that was populated when the links get processed (see later). All the links are forced into lower case (while this may not be advisable for all websites, for my case it is sufficient), only distinct elements are processed as the futureWorkQueue could quite easily have been filled with duplicates and finally a check is made to ensure that the link has not already been processed (lines 25-27).
During the processing of a specific URL (lines 32-36 – mostly not shown) new links may be generated. Each of these will be be added to the futureWorkQueue (lines 40-43). Before enqueuing any link a check is made to ensure it has not already been processed.
There are other bits of the class that are not shown. For example the IsLinkToBeProcessed method (which checks the domain, whether it has been processed already and so on) and the code that populates the completedWork.
In this version of the code it is much easier to see what has been completed and what is still to do (or at least, what has been found to do).
1 Comment