Line-by-line Processing in node.js

Suppose that you have to count the number of accesses for each IP address in the access.log file. And list the IP addresses which accessed more than 100,000 times. You want to do it with JavaScript or node.js. How can you do that?

You may write a script like as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
const fs = require('fs');

const lines = fs.readFileSync('access.log').toString().split('\n');

const freq = {};
for (const line of lines) {
const ip = line.split(' ')[0];
freq[ip] = (freq[ip] + 1) || 1;
}

const list = [];
for (const ip in freq) {
if (freq[ip] >= 100000) {
list.push([ip, freq[ip]]);
}
}
list.sort((a,b) => b[1] - a[1]);

console.log(list);

This code works great for a small file. But, it would not work for a large file. It takes so much memory as the size of the access.log file since readFileSync returns the lines variable after reading all the file contents. Therefore, if the file is too large to fit in memory, the script does not work with the following error.

1
2
3
Error: Cannot create a string longer than 0x3fffffe7 characters
at Buffer.toString (buffer.js:645:17)
...

Using readFile(), which is the asynchronous version of readFileSync(), would be a solution to the memory problem. But, the callback function would be called more than once and the passed data through callback function is not guaranteed to be passed line by line.

You need the way to read a file line by line, if it is possible, asynchronously. In this article, some ways to process text line by line are presented.

readline: Standard node.js Module

The standard node.js way to process text line by line is using the readline module.

It seems that the major purpose of readline module is to make interactive text environment easily. But, we can make use of the feature to split the input stream by one line at a time. The rewritten script is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
const fs = require('fs');
const readline = require('readline');

const rl = readline.createInterface({
input: fs.createReadStream('access.log'),
crlfDelay: Infinity
});

const freq = {};
rl.on('line', (line) => {
const ip = line.split(' ')[0];
freq[ip] = (freq[ip] + 1) || 1;
});

rl.on('close', () => {
const list = Object.entries(freq)
.filter(x => x[1] >= 100000)
.sort((a,b) => b[1] - a[1]);
console.log(list);
});

Note that the line processing part, which was in the for-loop, is in the ‘line’ event handler. And, since it is asynchronous, the post-processing part should be in the ‘close’ event handler.

split Transform Stream

You may notice the event name of the ‘readline’ module is different from the standard event name if you are familiar to node.js stream.

If you just want to supply a line at a time to stream handler, you may use ‘split‘ module.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
const fs = require('fs');
const split = require('split');

const freq = {};
fs.createReadStream('access.log')
.pipe(split())
.on('data', (line) => {
const ip = line.split(' ')[0];
freq[ip] = (freq[ip] + 1) || 1;
})
.on('end', () => {
const list = [];
for (const ip of Object.keys(freq)) {
if (freq[ip] >= 100000) {
list.push([ip, freq[ip]]);
}
}
list.sort((a,b) => b[1] - a[1]);
console.log(list);
});

readline for async/await

For who loves async/await or Generator function, asyncIterator interface of ‘readline’ was experimentally added to node.js since v11.4.0. Using this feature, we can rewrite the script as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
const fs = require('fs');
const readline = require('readline');

async function main() {
const rl = readline.createInterface({
input: fs.createReadStream('access.log'),
crlfDelay: Infinity
});

const freq = {};
for await (const line of rl) {
const ip = line.split(' ')[0];
freq[ip] = (freq[ip] + 1) || 1;
}

const list = Object.entries(freq)
.filter(x => x[1] >= 100000)
.sort((a,b) => b[1] - a[1]);

console.log(list);
}

if (require.main === module) {
main();
}