Friday, 15 April 2016

Gigabit File uploads Over HTTP - The Node.js with NGINX version

Gigabit File uploads Over HTTP - The Node.js with NGINX version


Please see the ASP .Net version of the article. It provides background information that might not be covered here.

Please see the original NODE.js version of this article.

One of the things that we wanted to do after blogging about Gigabit File uploads with Node.js was to see how we could improve the performance of the application. In the previous version of the application the code that was written was mostly synchronous and as a result of that  we had high CPU usage, did quite a lot of I/O, and used up a fair amount of memory. All in all what was created had more to do with demonstrating the concept of how to do the Gigabit File uploads over HTTP rather than for performance.

Now that we have established the concept it is now time to see how the application's performance can be improved.


The Performance Tuning

The areas that we want to look at to address the Gigabit File upload  performance are:
  1.  Implementing a reverse proxy  server in front of the Node.js server.
  2. Offloading the file upload  requests  to the reverse proxy.
  3.  Converting the MergeAll blocking synchronous  code to non-blocking asynchronous code
  4. Creating an API for each backend request. As it is now the UploadChunk API call is used to manage all uploads. 
  5. Removing the checksum calculation from the MergeAll  API call. A GetChecksum API will be created to calculate the checksum of the uploaded file.

The performance testing was conducted  on a Centos 7 virtual machine running NGINX version 1.9.9. and Node.js  version 5.3.0. This is a departure from our previous blog post, because that work was done on a Windows 2012 platform.


The Reverse Proxy

Node.js allows you to build fast, scalable network applications capable of handling a huge number of simultaneous connections with high throughput. This means that from the very start Node.js is quite capable of handling the Gigabit File uploads.

So why would we want to use a reverse proxy in front of our Node.js server in this scenario? We want to do this because offloading the file handling to the NGINX web server will reduce the overhead on the Node.js backend and this should provide a performance boost. The following figure shows how this is achieved.



Figure 1 Offloading file upload to NGINX reverse proxy


  1. The client computer uploads the file chunks by calling the XFileName API. Once the NGINX  reverse proxy sees a call to /api/CelerFTFileUpload/UploadChunk/XFileName it will save the file chunk to the NGINX private temporary directory, because we have enabled the NGINX client_body_in_file_only directive. The NGINX private temporary directory can be found under /tmp. This happens because in the NGINX systemd file the PrivateTmp configuration option is  set to true. Please consult the systemd man pages for more  information on the PrivateTmp configuration option.
  2. After the file chunk has been saved NGINX will set the X-File-Name header with the name of  the file chunk. This will be sent to Node.js.
  3. Once all of the file chunks have been uploaded the client calls the MergeAll API and this is sent directly to Node.js by NGINX. Once Node.js receives the MergeAll request it will merge all of the uploaded file chunks to create the file.
  4. Once Node.js receives the X-File-Name header it will move the file chunk from the NGINX     private temporary directory and save it to the file upload directory with the correct name.

We used the following NGINX configuration:

# redirect CelerFT

    location  = /api/CelerFTFileUpload/UploadChunk/XFileName {
       aio on;
       directio 10M;
       client_body_temp_path      /tmp/nginx 1;
       client_body_in_file_only   on;
       client_body_buffer_size    10M;
       client_max_body_size 60M;

       proxy_pass_request_headers on;
       proxy_set_body             off;
       proxy_redirect             off;
       proxy_ignore_client_abort  on;
       proxy_http_version         1.1;
       proxy_set_header           Connection "";
       proxy_set_header           Host $host;
       ##proxy_set_header         Host $http_host;
       proxy_set_header           X-Real-IP $remote_addr;
       proxy_set_header           X-Forwarded-For $proxy_add_x_forwarded_for;
       proxy_set_header           X-Forwarded-Proto $scheme;
       proxy_set_header           X-File-Name $request_body_file;
       proxy_pass                 http://127.0.0.1:1337;
      # proxy_redirect             default;

       proxy_connect_timeout       600;
       proxy_send_timeout          600;
       proxy_read_timeout          600;
       send_timeout                600;

       access_log                  off;
       error_log                  /var/log/nginx/nginx.upload.error.log;

   }

The key parameter is the X-File-Name header which is set to the name of the file. The Node.js backend has to then process the individual chunks. The crucial part of the code is to find out where the NGINX private temporary directory is created, because this is where NGINX will write the file chunks. Under systemd the NGINX private temporary directory will have a different name each time NGINX is restarted and so we have to get the name of that directory before we can move the file chunk to the final destination.

app.post('*/api/CelerFTFileUpload/UploadChunk/XFileName*', function (request, response) {
   
   
    // Check if we uploading using a x-file-header
    // This means that we have offloaded the file upload to the
    // web server (NGINX) and we are sending up the path to the actual
    // file in the header. The file chunk will not be in the body
    // of the request

    if (request.headers['x-file-name']) {
       
        // Temporary location of our uploaded file
        // Nginx uses a private file path in /tmp on Centos
        // we need to get the name of that path
        var temp_dir = fs.readdirSync('/tmp');
        var nginx_temp_dir = [];

        for (var i = 0; i < temp_dir.length; i++) {
           
            if (temp_dir[i].match('nginx.service')) {
                nginx_temp_dir.push(temp_dir[i]);
            }
        }
       
        var temp_path = '/tmp/' + nginx_temp_dir[0] + request.headers['x-file-name'];
       
        fs.move(temp_path , response.locals.localfilepath, {}, function (err) {
           
            if (err) {
                response.status(500).send(err);
                return;
            }
           
            // Send back a sucessful response with the file name
            response.status(200).send(response.locals.localfilepath);
            response.end();
                    
               
        });
    }

});


The MergeAll Asynchronous API

In the previous blog post we used the fs.readdirSync  and the fs.readfileSync function calls quite extensively. The fs.readdirSync was called each time we needed to check whether or not we had uploaded all of the file chunks. The fs.readfileSync was called when we merged all of the uploaded file chunks to create the file.
Each of those function calls are synchronous calls and caused the MergeAll API to block each time they had to be called.
The getfilesWithExtensionName function that was being called in the MergeAll API was replaced with a fs.readdir function call that is used to check that we have uploaded all of the file chunks.

The getfilesWithExtensionName function.

function getfilesWithExtensionName(dir, ext) {
   
    var matchingfiles = [];
   
    if (fs.ensureDirSync(dir)) {
        return matchingfiles;
    }

    var files = fs.readdirSync(dir);
    for (var i = 0; i < files.length; i++) {
        if (path.extname(files[i]) === '.' + ext) {
            matchingfiles.push(files[i]);
        }
    }

    return matchingfiles;
}

The MergeAll API was written to use the fs.readdir function to check  if we have uploaded all of the file chunks. In each call to fs.readdir we populate the an array named fileslist with the filenames. Once we have uploaded all of the file chunks we populate an array named files with all of the file names as shown.

for (var i = 0; i < fileslist.length; i++) {
     if (path.extname(fileslist[i]) == '.tmp') {
                       
         //console.log(fileslist[i]);
         files.push(fileslist[i]);
    }
}

The next thing that is done is to use the fs.createWriteStream to create the output file.

// Create tthe output file
var outputFile = fs.createWriteStream(filename);

We then used a recursive function named mergefiles  to merge the file chunks into the final output file. In the mergefiles function we use fs.createReadStream to read each file in the files array and write them to the output file. The mergefiles function is called with the index set to 0, and after each successful call to fs.createReadStream we increment the index.

var index = 0;
               
// Recrusive function used to merge the files
// in a sequential manner
var mergefiles = function (index) {
                   
    // If teh index matches the items in the array
   // end the function and finalize the output file
    if (index == files.length) {
         outputFile.end();
         return;
     }
              
     console.log(files[index]);
                   
     // Use a read stream too read the files and write them to the write stream
     var rstream = fs.createReadStream(localFilePath + '/' + files[index]);
                   
     rstream.on('data', function (data) {
            outputFile.write(data);
     });
                   
     rstream.on('end', function () {
            //fs.removeSync(localFilePath + '/' + files[index]);
            mergefiles(index + 1);
     });
                   
     rstream.on('close', function () {
             fs.removeSync(localFilePath + '/' + files[index]);
            //mergefiles(index + 1);
     });
                    
     rstream.on('error', function (err) {
             console.log('Error in file merge - ' + err);
             response.status(500).send(err);
             return;
     });
};
              
mergefiles(index);
           

The complete code for the MergeAll API call.

// Request to merge all of the file chunks into one file
app.get('*/api/CelerFTFileUpload/MergeAll*', function (request, response) {

    if (request.method == 'GET') {
       
        // Get the extension from the file name
        var extension = path.extname(request.param('filename'));
       
        // Get the base file name
        var baseFilename = path.basename(request.param('filename'), extension);
       
        var localFilePath = uploadpath + request.param('directoryname') + '/' + baseFilename;
       
        var filename = localFilePath + '/' + baseFilename + extension;
       
        // Array to hold files to be processed
        var files = [];
       
        // Use asynchronous readdir function to process the files
        // This provides better i/o
        fs.readdir(localFilePath, function (error, fileslist) {

            if (error) {
               
                response.status(400).send('Number of file chunks less than total count');
                //response.end();
                console.log(error);
                return;
            }
           
            //console.log(fileslist.length);
            //console.log(request.param('numberOfChunks'));
           

            if ((fileslist.length) != request.param('numberOfChunks')) {
               
                response.status(400).send('Number of file chunks less than total count');
                //response.end();
                return;
            }
           
            // Check if all of the file chunks have be uploaded
            // Note we only want the files with a *.tmp extension
            if ((fileslist.length) == request.param('numberOfChunks')) {

                for (var i = 0; i < fileslist.length; i++) {
                    if (path.extname(fileslist[i]) == '.tmp') {
                        //console.log(fileslist[i]);
                        files.push(fileslist[i]);
                    }
                }
               
                if (files.length != request.param('numberOfChunks')) {
                    response.status(400).send('Number of file chunks less than total count');
                    //response.end();
                    return;
                }
               
                // Create tthe output file
                var outputFile = fs.createWriteStream(filename);
               
                // Done writing the file. Move it to the top level directory
                outputFile.on('finish', function () {
                   
                    console.log('file has been written ' + filename);
                    //runGC();
                   
                    // New name for the file
                    var newfilename = uploadpath + request.param('directoryname') + '/' + baseFilename + extension;
                   
                    // Check if file exists at top level if it does delete it
                    // Use move with overwrite option
                    fs.move(filename, newfilename , {}, function (err) {
                        if (err) {
                            console.log(err);
                            response.status(500).send(err);
                            //runGC();
                            return;
                        }
                        else {
                           
                            // Delete the temporary directory
                            fs.remove(localFilePath, function (err) {
                               
                                if (err) {
                                    response.status(500).send(err);
                                    //runGC();
                                    return;
                                }
                               
                                // Send back a sucessful response with the file name
                                response.status(200).send('Sucessfully merged file ' + filename);
                        //response.end();
                        //runGC();
                       
                            });

                        // Send back a sucessful response with the file name
                        //response.status(200).send('Sucessfully merged file ' + filename + ", " + md5results.toUpperCase());
                        //response.end();
                   
                        }
                    });
                });
                               

                var index = 0;
               
                // Recrusive function used to merge the files
                // in a sequential manner
                var mergefiles = function (index) {
                   
                    // If teh index matches the items in the array
                    // end the function and finalize the output file
                    if (index == files.length) {
                        outputFile.end();
                        return;
                    }
                   
                    console.log(files[index]);
                   
                    // Use a read stream too read the files and write them to the write stream
                    var rstream = fs.createReadStream(localFilePath + '/' + files[index]);
                   
                    rstream.on('data', function (data) {
                        outputFile.write(data);
                    });
                   
                    rstream.on('end', function () {
                        //fs.removeSync(localFilePath + '/' + files[index]);
                        mergefiles(index + 1);
                    });
                   
                    rstream.on('close', function () {
                        fs.removeSync(localFilePath + '/' + files[index]);
                        //mergefiles(index + 1);
                    });
                   
                    rstream.on('error', function (err) {
                        console.log('Error in file merge - ' + err);
                        response.status(500).send(err);
                        return;
                    });
                };
                
                mergefiles(index);
            }
            /*else {
                response.status(400).send('Number of file chunks less than total count');
                //response.end();
                return;
            }*/
               

        });
    }

});


Other Improvements

As mentioned the other thing that we did was to create an API call for each type of file upload that is supported by CelerFT.

  1.  The Base64 API call will handle uploads in which the CelerFT-Encoded header is set to base64
  2. The FormData API call will handle all multipart/form-data uploads.
  3. The XFileName API call will be used to offload file uploads to the NGINX reverse proxy.
The preliminary tests showed marked improvements in the performance of the backend server during the file uploads. Please feel free to download CelerFT and provide feedback on its performance.  

The code for this project can be found at my github repository under the nginxasync branch.