The right tool for the right job.
Part 2 (final): Log classifier in NodeJS

The right tool for the right job.  Part 2 (final): Log classifier in NodeJS

In the previous part of this demonstration, we used C# .NET Core to create a log classifier. In this continuation, we’ll implement the same but in our dear NodeJS.

I’ll start copying-n-pasting the previous task for my previous entry (which you should read if you haven’t).

What is the task about?

The task will consists on classifying an Android log file, that log file contains lines like this:

12-17 19:31:36.263  1795  1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0
12-17 19:31:36.263  5224  5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264  1795  2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823
12-17 19:31:36.264  1795  2750 I HwLightsService: back light level before map = 823
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264  1795  1825 V KeyguardServiceDelegate: onScreenTurnedOn()
12-17 19:31:36.264  1795  1825 I WindowManger_keyguard: onScreenTurnedOn()
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Display ready!
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Finished business...
12-17 19:31:36.264  2852  3328 D KeyguardService: Caller checkPermission fail
12-17 19:31:36.264  2852  3328 D KeyguardService: KGSvcCall onScreenTurnedOn.
12-17 19:31:36.264  2852  3328 D KeyguardViewMediator: notifyScreenTurnedOn
12-17 19:31:36.265  2852  2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn
12-17 19:31:36.265  2852  2852 I PhoneStatusBar: onScreenTurnedOn
12-17 19:31:36.265  2852  2852 D KGWallpaper_Magazine: getNextIndex: 0; from 5 to 5; size: 44
12-17 19:31:36.265  2852  2852 I HwLockScreenReporter: report msg is :{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg}
12-17 19:31:36.265  2852  2852 W HwLockScreenReporter: report result = falsereport type:162 msg:{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg, channelId: 05}

Our goal is to read the log file, classify every log with its category/type which will be defined as a single capital letter located before the process/activity name.

  • 1825 I WindowManger_keyguard... → Category is I.
  • 2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn... → Category is D.

Every log type must be written in a separate log file, which must preserve the initial order.

All possible category/types are: I , D , V , W , E , F.

What is the most efficient way to do so?

Let’s copy paste (again) from my previous post.

BEGIN COPY PASTE

With NodeJS, one could leverage the use of several cores with Cluster, for multiple process that can communicate via IPC or with Workers that implements threads and can share the same memory.

But NodeJS is not the best environment to do this kind of operations besides asynchronous I/O operations (processing request, connecting to databases, serving some files, doing minimal data transformation, physical disk write/reads that are limited by the hardware rather than the language), it is just not the right tool, I mean, in case you need it, you can use it, without having to change neither your language nor your working environment.

END COPY PASTE

In order to come close to the C# performance, we’ll need to use Workers. But this is too much setup, so we could make our development easier by using yet another JS library:

As we will probably want to share the array within threads, we would need to store that string data into a SharedArrayBuffer

And, if we ever need to move data around threads without cloning it, we’ll need to use a Transferable object.

Those things make the development take more time, and also more error prone.

In order to be fair and match the simplicity that C# offers, we’ll forget about that and just implement two versions of it:

  1. With writeable file streams.
  2. With simple files opened in W+ mode.

Without further ctrl+c & ctrl+v, let’s begin:

My computer setup is the following:

  • i7-9700K (8 cores/8 threads).
  • 32GB RAM (3200Mhz).
  • NVMe/SSD only drives.
  • Node version: v16.5.0

As always, the source code of this post is available on GitHub.

With writeable file streams:
const classify_with_filestreams = async function()
{
  const reader = readline.createInterface({
    input  : fs.createReadStream( LogPath ) ,
    output : null ,
  });

  const outputStreams = createOutput.streams( LogTypes );

  // Handle when a new <line> is available in a asynchronous way.
  for await( const line of reader )
  {
    const matchesIterator = line.matchAll( /[\d\-\.\s]+([A-Z])/g );
    const matches         = matchesIterator.next().value;
    const identifier      = matches[1];
    const writeable       = outputStreams[ identifier ];

    if( !writeable.write( line ) ) // If the stream buffer is full, wait until is drained.
      await events.once( writeable , "drain" ); // Wait for the drain event.

    writeable.write( os.EOL ); // Append a new line.
  };

  // Close all write streams.
  for( const [ key , value ] of Object.entries( outputStreams ) )
  {
    value.close();
  }

  reader.close();
};

This was the correct implementation, in case you don’t want to use workers (threads), later you’ll see why.

With simple files opened in W+ mode.
const classify_with_files = async function()
{
  const reader = readline.createInterface({
    input    : fs.createReadStream( LogPath ) ,
    output   : null ,
    terminal : false ,
  });

  const writePromises = {};
  const writeTask     = [];

  // Handle when a new <line> is available in a asynchronous way.
  for await( const line of reader )
  {
    const matchesIterator = line.matchAll( /[\d\-\.\s]+([A-Z])/g );
    const matches         = matchesIterator.next().value;
    const identifier      = matches[1];
    writeTask.push({ type : identifier , line : line + os.EOL });
  };

  const outputFiles = await createOutput.files( LogTypes );

  // Write each line on its target file.
  for await( const task of writeTask )
  {
    await writePromises[ task.type ];
    writePromises[ task.type ] = outputFiles[ task.type ].write( task.line );
  }

  // Close all files handlers.
  for( const [ key , value ] of Object.entries( outputFiles ) )
  {
    value.close();
  }

  reader.close();
};

This is a good implementation if we have need to await write() with longer strings but fewer times, but in this scenario: we’re writing small strings and doing so several times, then wait for each one, this is needed because obvious reasons and also NodeJS official documentation state it.

It is unsafe to use fs.write() multiple times on the same file without waiting for the callback. For this scenario, fs.createWriteStream() is recommended.

Quoted from here.

Results:
Method Time Memory
File streams 3730ms 73.07 MB
Files (without streams) 36406ms 1028.51 MB

I usually run the code more than once, but this time is really not needed, the 2nd implementation was:

  • x10 worst in time.
  • x14 worst in memory.

In our second scenario, this code:

for await( const task of writeTask )
{
  await writePromises[ task.type ];
  writePromises[ task.type ] = outputFiles[ task.type ].write( task.line );
}

Is totally bottle-necking our application, an optimization would be concatenating lines into larger chunks, but it will be even better if you simply use the 1st implementation.

Conclusion of this series:

Despite I love NodeJS, and I could make some hacks and what-not to make it perform on par with C#, it won’t be as clean and easy as C# provides. Also, this is such a small example, as I said in my previous post, in a real scenario we’ll probably do more data transformations than a simple regex expression. And your program must be about getting those transformations rights instead of faster, yes, we need those to be fast, so we choose the right tool, to make it easier and faster.