The right tool for the right job.
Part 1: Log classifier in C#

The right tool for the right job.  Part 1: Log classifier in C#
Preface:

In software development, there are no silver bullets, but often, we developers want to marry a solution. And you may wonder why there are so many languages, libraries and frameworks, even though a big set of them overlaps in functionality, you’ll see that each of them have different approaches, pro & cons. And before picking any language/framework/library, you must see if it’ll be a fit to your project.

I’m a NodeJS fan, not gonna lie, but to demonstrate this, I’ll put NodeJS into a bad scenario, this will be a two-part series, in the first one (this) we’ll see a minor introduction, then a scenario, and finally a solution in C# .NET Core.

In the second part, we’ll have a NodeJS code to achieve the same.

Introduction:

When we want to use all the power our machine have, may it be your god-ol’ laptop, your powerful workstation, or a production server hosted in the cloud, parallel processing is the key to performance.

We achieve parallel processing by taking a task and dividing it into smaller sub-task that doesn’t depend on each others. Take as example a fast food restaurant that have:

  • One person taking orders. (Taker)
  • Four chief. (Chief 1..4)
  • Two deliveries. (Delivery 1..2)

First, we have one main task, running the restaurant and earn money, but actually that task is not a big, monolithic goal, it is the overall goal we want to achieve, so, to do that we need to split that into several, smaller tasks. To keep things simple (and because I’m a programmer not a chef) we’ll say we need to take orders, cook them and deliver.

Because we only have one person taking orders, we don’t have a way to deal with it, but once an order was taken, it just depends on itself and not on any other order, so it doesn’t matter if we process orders one-by-one (sequentially) or several at-a-time (concurrently).

One-by-one: order #1 -> Taker 1 -> Chief 1 -> Delivery 1 -> order #2 -> Taker 1 -> Chief 1 -> Delivery 1

Would you imaging the guy on the phone? Sorry, I cannot take your order, we need to deliver a previous order first, please wait.

I’m sure you’ll never buy in that restaurant anymore, it is anything but FAST food, slow food at best, or no-food at worst.

So usually, you’ll have things like this in the real world, several at-a-time:

order #1 -> Taker 1    | order #1 ... waiting for Taker 1. 
order #1 -> Chief 1    | order #2 -> Taker 1
order #1 -> Delivery 1 | order #2 -> Chief 2
<done>                 | order #2 -> Delivery 2

But actually, in reality, things go even further, if the chef see the can prepare the salad, the fried chips, and the meat, he’ll do all 3 things at the same time, I’m sure you’ve seen it.

That’s the key to successful and efficient parallelism:

Identifying what things can be divided in smaller parts that are no dependent on each others.

Usually, we need to synchronize at some point, a common pattern is:

t1 = thread1.start( task1 );
t2 = thread2.start( task2 );
t1.join();
t2.join();

So, you’ll normally have to manage a thread poll, to avoid the overhead of creating and destroying threads, then, you need to wait for the task to finish at the right moment.

With NodeJS, one could leverage the use of several cores with Cluster, for multiple process that can communicate via IPC or with Workers that implements threads and can share the same memory.

But NodeJS is not the best environment to do this kind of operations besides asynchronous I/O operations (processing request, connecting to databases, serving some files, doing minimal data transformation, physical disk write/reads that are limited by the hardware rather than the language), it is just not the right tool, I mean, in case you need it, you can use it, without having to change neither your language nor your working environment.

But let’s take a look at C# (or .NET Core). This whole framework has a set of useful utilities to process big amount of data without much trouble, it just works (if used right of course).

And you’re most than done, you can do the same in NodeJS (or other similar work environments) but not as easy as C# allows.

We’ll setup an scenario and see several solutions, being specific, 3 solutions:

  1. Sequential: Using a sequential, simple, day-to-day for-loop.
  2. Parallel.For: Using C# Parallel.For method.
  3. LINQ .AsParallel(): Using C# LINQ (PLINQ) extensions method AsParallel.

My computer setup is the following:

  • i7-9700K (8 cores/8 threads).
  • 32GB RAM (3200Mhz).
  • NVMe/SSD only drives.

You’ll find the full code at this GitHub link.

So YMMV, and consider most of the time you won’t be using dedicated hardware in the cloud, but maybe you’ll have more cores/threads available.

Without further ado, let’s begin:

What is the task about?

The task will consists on classifying an Android log file, that log file contains lines like this:

12-17 19:31:36.263  1795  1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0
12-17 19:31:36.263  5224  5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264  1795  2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823
12-17 19:31:36.264  1795  2750 I HwLightsService: back light level before map = 823
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264  1795  1825 V KeyguardServiceDelegate: onScreenTurnedOn()
12-17 19:31:36.264  1795  1825 I WindowManger_keyguard: onScreenTurnedOn()
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Display ready!
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Finished business...
12-17 19:31:36.264  2852  3328 D KeyguardService: Caller checkPermission fail
12-17 19:31:36.264  2852  3328 D KeyguardService: KGSvcCall onScreenTurnedOn.
12-17 19:31:36.264  2852  3328 D KeyguardViewMediator: notifyScreenTurnedOn
12-17 19:31:36.265  2852  2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn
12-17 19:31:36.265  2852  2852 I PhoneStatusBar: onScreenTurnedOn
12-17 19:31:36.265  2852  2852 D KGWallpaper_Magazine: getNextIndex: 0; from 5 to 5; size: 44
12-17 19:31:36.265  2852  2852 I HwLockScreenReporter: report msg is :{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg}
12-17 19:31:36.265  2852  2852 W HwLockScreenReporter: report result = falsereport type:162 msg:{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg, channelId: 05}

Our goal is to read the log file, classify every log with its category/type which will be defined as a single capital letter located before the process/activity name.

  • 1825 I WindowManger_keyguard... → Category is I.
  • 2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn... → Category is D.

Every log type must be written in a separate log file, which must preserve the initial order.

All possible category/types are: I , D , V , W , E , F.

Initial code setup:

Part of the main class as a singleton:

static HashSet<string> filePaths = new HashSet<string>();

Some method helpers, part of the main class:

// Possible log types: "I" , "D" , "V" , "W" , "E" , "F"
static Dictionary<char , SlotType> GenSlots<SlotType>() where SlotType : new()
{
  return new Dictionary<char , SlotType>
  {
    { 'I' , new SlotType() },
    { 'D' , new SlotType() },
    { 'V' , new SlotType() },
    { 'W' , new SlotType() },
    { 'E' , new SlotType() },
    { 'F' , new SlotType() },
  };
}

/* Basically this will be the ""HEAVY"" logic of our program. 
    Yes, just a regex expression.
*/
static char GetLineIdentifier( string line )
{
  var matches = Regex.Matches( line , @"[\d\-\.\s]+([A-Z])" ); // Match the line group.
  var identifier = matches[ 0 ].Groups[ 1 ].Value; // Get the line group.
  return identifier[ 0 ]; // Pick the character #0.
}

In main method:

// Read the log file and store it into string[].
var lines = File.ReadAllLines( "Android.log" );

// Creates the directory if it doesn't exists, otherwise do nothing.
Directory.CreateDirectory( "./output" );

That was the most common part, there are few more things, for those, refer to the source code.

1) Sequential day-to-day for-loop:
static void AsSequentialVersion( string[] lines )
{
  var slots = GenSlots<List<string>>();

  foreach( var line in lines )
  {
    var charIdentifier = GetLineIdentifier( line );
    slots[ charIdentifier ].Add( line ); // Store the line in the target slot.
  }

  var prefix = "__sequential";

  foreach( var entry in slots )
  {
    var path = $"./output/{prefix}_{entry.Key}.log";
    File.WriteAllLines( path , entry.Value );
    filePaths.Add( path );
  }
}
2) Employing Parallel.For:
static void ParallelForVersion( string[] lines )
{
  var slots = GenSlots<CBag>(); // Generate slots to store the classified log lines.

  Parallel.For( 0 , lines.Length , idx =>
  {
    var line = lines[ idx ];
    var charIdentifier = GetLineIdentifier( line );
    slots[ charIdentifier ].Add( (idx, line) ); // Store the line in the target slot.
  } );

  var prefix = "parallel_for";

  foreach( var entry in slots )
  {
    // Because both Parallel.For & ConcurrentBag does not guarantees the ordering, we must sort them before saving.
    var linesToWrite = entry.Value.ToList().OrderBy( v => v.Item1 ).Select( v => v.Item2 );
    var path = $"./output/{prefix}_{entry.Key}.log";
    File.WriteAllLines( path , linesToWrite );
    filePaths.Add( path );
  }
}
3) Using .AsParallel() from PLINQ:
static void AsParallelVersion( string[] lines )
{
  var slots = lines.AsParallel().AsOrdered().GroupBy( line =>
  {
    return GetLineIdentifier( line );
  } ).AsSequential();

  var prefix = "_as_parallel";

  foreach( var entry in slots )
  {
    var path = $"./output/{prefix}_{entry.Key}.log";
    File.WriteAllLines( path , entry.ToArray() );
    filePaths.Add( path );
  }
}
The results are the following:
Method Run #1 - Time Run #2 - Time
Sequential 6313ms 6332ms
Parallel.For 2333ms 1737ms
AsParallel 1762ms 1708ms

Basically Parallel.For and AsParallel are the same, at least for this test scenario. Implementing parallelism, we reduced the processing time by more than 70%.

We should take into account that despite using a basic regex expression as our “heavy” logic, we can get a significant boost in performance, this will be greater if we put more logic into that, in a more realistic scenario we may divide every log line into columns (date, type, message, etc) and sent that into a database for further inspection.

I don’t know the internal implementation of File.WriteAllLines(), but it seems to be that it’s close to the following:

  • Open a file handler in write/truncate/create mode (w+).
  • Write line by line appending an EOL (End-of-line) at the end of each line.
  • Close the file handler.

In order to further improve performance we could write to the files in parallel as well. Yes, we cannot write to the same file/stream in parallel but you can write to different files/streams in parallel.

To do that, I thought in 2 solutions:

1) Without having every an intermediary memory array/buffer:

  1. Open all needed files before the processing (instead of using memory arrays).
  2. If using Parallel.For or AsParallel use lock to lock every access to the same file stream.
  3. When the lock is acquired write the result to the target stream.

2) With an intermediary memory array/buffer:

  1. Store every classified line into its target array.
  2. For every array run: Task.Run(...) to write each array into the target file.
  3. Wait for all tasks to complete and you’re done.
Why did I use C# for this?

Because C# provides one of the cleanest interfaces to leverage all the power of our machine, LINQ is like having an in-memory C# database.
C# also provides SIMD-ready classes.
And even hardware intrinsics.

About the continuation of this series:

The next part of this tutorial will be a NodeJS version of this, to have a better picture of WHY picking the right tool for the right job is a must in software development.

Here is the link, enjoy!