Creating a PDF conversion tool using Serverless and Amplify - Part 3 - Serverless PDF Processing Lambda

JonthumbPosted by Jonathan Weyermann on December 6, 2019 at 12:00 AM
Serverless small

This is part 3 of the PDF2JPG Conversion app - part 2 can be found here

For this part of the project we will build a lambda that automatically processes uploaded pdfs. It first takes the pdf and turns each page into a jpg and eventually also creates a zip file out of all the jpgs.

For this project, the lambda will be built as a trigger to the s3 bucket. If you would like more control about how the processing is initiated, you will need to find another way to trigger it.


The lambda performs a number of tasks sequentially. First, it reads the pdf into the lambdas temporary storage. Then it uses GhostScript to convert the pdf into individual jpg images. It then uploads all of these to s3. After, it creates a file stream using each file and dumps it into a zip file that gets pushed right onto s3.

We will go through the code for each of these steps

Required Imports

fs is for file operations, archive-promise exists to create a promise-returning zip archive, await-exec is there to do a promise returning exec - required to run ghostscript synchronously.

const fs = require('fs');
const archiver = require('archiver-promise');
const path = require('path');
const AWS = require('aws-sdk');
const exec = require('await-exec')
const stream = require('stream')
require('events').EventEmitter.defaultMaxListeners = 100;

Reading in the pdf from s3

We read in the file from S3 as a stream and save it to a temp file inside the lambda storage. Lambdas are provided up to 500mb of temporary storage, so this is great for our purpose.

const grabPdfFromS3 = user => {
  const S3 = new AWS.S3({ region: 'us-west-2' });
  return new Promise((resolve, reject) => {
    const destPath = `/tmp/${user}.pdf`
    var params = {
      Bucket: process.env.BUCKET,
      Key: `pdfs/${user}.pdf`
    }
    S3.headObject(params)
      .promise()
      .then(() => {
        const s3Stream = S3.getObject(params).createReadStream()
        const fileStream = fs.createWriteStream(destPath);
        s3Stream.on('error', reject);
        fileStream.on('error', reject);
        fileStream.on('close', () => { resolve(destPath);});
        s3Stream.pipe(fileStream);
      })
      .catch(error => {
        console.log(`grabfroms3error: ${error}`);
        reject(error)
      });
  });
};

Saving a file to s3

The saveFileToS3 takes a single file object and saves it to s3. S3 doesn't have an API command to upload multiple files, so each file has to be uploaded individually.

const saveFileToS3 = (path,key) => {
  const S3 = new AWS.S3({ region: 'us-west-2' });
  return new Promise((resolve, reject) => {
    fs.readFile(path, (err, data) => {
      S3.putObject({
        Bucket: process.env.BUCKET,
        Key: `${key}`,
        ContentType: data.type,
        Body: data
      }, (error) => {
        if (error) {
          console.log(`savetos3error: ${error}`);
          reject(`savetos3error: ${path}`);
        } else {
          resolve(`success: saved ${path} to S3:${key}`);
        }
      });
    });
  });
};

Creating a Zip file


The createZipFile method creates a zip file stream with archiver-promise and uploads it directly to s3. Archive uses a pipe to pass data through, during which the upload occurs. You can look at instructions for archiver here (https://www.npmjs.com/package/archiver).

const createZipFile = async (index, subdir) => {
  return new Promise((resolve, reject) => {
    const s3 = new AWS.S3({ region: 'us-west-2' });
    var archive = archiver('/tmp/tmp.zip');
    archive.on('warning', function(err) {
      console.log(`err: ${err}`)  
      if (err.code !== 'ENOENT') {
        throw err;
      }
    });

    archive.on('error', function(err) {
      console.log(`err: ${err}`)
      throw err;
    });

    archive.pipe(uploadFromStream(s3, subdir));
    archive.glob('*.jpg',{ cwd: '/tmp'},{ prefix: '' });
    archive.finalize().then( function (result) {
      console.log(`result: ${result}`)
      console.log("finalizing archive");
    });
  });
}

These are some additional methods required for the archiver upload

const uploadFromStream = (s3, subdir) => {
  var pass = new stream.PassThrough();
  var params = {Bucket: process.env.BUCKET, Key: `${subdir}.zip`, Body: pass}
  s3uploadfunc(s3, params)
  return pass;
}

const s3uploadfunc = async (s3, params) => {
  try {
    var data = await s3.upload(params).promise()
    console.log(`s3ZipUpload data: ${JSON.stringify(data)}`);
    deleteTempFiles()
  } catch (err){
    console.log(`s3ZipUpload err: ${err}`);
  }
}

deleteTempfiles exists to remove jpgs after the zip upload. I do this because sometimes I find that aws gives me a the same file storage instance in a subsequent lambda call, and I want to make sure the jpg files I'm uploading are actually from the current pdf, and not from a previous one, so I delete them as soon as they are uploaded

const deleteTempFiles = () => {
  var files_to_delete = []
  fs.readdirSync('/tmp/').forEach(file => {
    if (file.split('.')[1]=='jpg') {
      files_to_delete.push(`/tmp/${file}`)
    }
  });

  var i = files_to_delete.length;
  files_to_delete.forEach(function(filepath){
    fs.unlinkSync(filepath);
  });
}


Putting it all together in the main function

module.exports.execute = async (event, context, callback) => {
  try {
    var local_pdf_path = await grabPdfFromS3(user(event))
    await exec('./lambda-ghostscript/bin/gs -sDEVICE=jpeg -dTextAlphaBits=4 -r128 -o /tmp/image%d.jpg ' + local_pdf_path);

    var index = 1;
    while (fs.existsSync(currentUrl(index))) {
    var response = await saveFileToS3(currentUrl(index), s3Url(user(event), index));
    index++;
  }
  if (index > 1) {
    console.log(`creating zip: index ${index}, user(event): ${user(event)}`)
    await createZipFile(index,user(event));
    return { statusCode: 200, body: JSON.stringify({message: 'Success', input: `${index} images were uploaded`}, null, 2)}
  }
  } catch (err) {
    console.log(`conversion error: ${err}`);
    return { statusCode: 500, body: JSON.stringify({message: err,input: event}, null, 2)}
  }
};

First, we grab the PDF file from S3, then we run ghostScript to convert the file into individual jpgs. Once completed, we sequentially upload each file as long as it exists. If we've generated at least a single jpg, we create our zip file. The createZipFile method also runs the delete method, so that's pretty much it.

While some small performance gains may be possible by allowing things to run asynchronously concurrent more often, I the gains will be minimal as a lambda can only access a single cpu thread. Also, I don't want to spin up multiple lambdas with this, as they normally don't share temp data. 

The biggest issue I had writing this was to ensure the lambda didn't close before all the tasks are completed. To that end, we have to ensure we wait for all promises before moving on. While lambdas can often complete the current task even if we don't wait, often the next task will just be ignored as the lambda quits.

The full source code is available on github


Serverless Template

To deploy this code as a lambda on AWS, we will be using serverless. To install serverless, you will need node.

Once you've install node, you can install serverless by running the command

npm install -g serverless

you will also need to set up AWS credentials if you have not done so

to get started with a serverless template, run the command

serverless create --template aws-nodejs

This will generated a basic serverless template to get you started. Here is the serverless template (serverless,yml) once i've made my necessary changes:

service: pdf2img

custom:
  myStage: ${opt:stage, self:provider.stage}
  myEnvironment:
  HOSTURL:
    prod: "https://pdf2jpgs.com"
    dev: "http://localhost:3000"

service - name of the project

custom - variables we may use in our scrpts

provider:
  name: aws
  runtime: nodejs8.10
  stage: dev
  environment:
    HOSTURL: ${self:custom.myEnvironment.HOSTURL.${self:custom.myStage}}
  region: us-west-2
  memorySize: 3008
  timeout: 180
  iamRoleStatements:
  - Effect: Allow
    Action:
      - s3:PutObject
      - s3:getObject
    Resource: "arn:aws:s3:::quiztrainer-quiz-images-${self:provider.stage}/*"

provider - configuration options for aws. I've set memorySize (for the lambda) to 3gb because AWS allocates more cpu power proportional to the amount of memory used. I've set the timeout to 3 minutes, beause really large pdfs could take this long. I've configured aws IAM roles to allow the lambda to write and read to the bucket i'm going to create here, the one I need to hold the pdfs, images and zip files


functions:
  pdf2img:
  handler: pdf2img.execute
  events:
  - s3:
    bucket: pdf
    event: s3:ObjectCreated:*
    rules:
      - suffix: .pdf
    environment:
      BUCKET: "quiztrainer-quiz-images-${self:provider.stage}"

resources:
- ${file(resources/s3-bucket.yml)}

function - specific configuration for the lambda that serverless is going to create - We're referening our pdf2img.js (the lambda we've created above), which gets triggered on the event we decribe here (the creation of a pdf file in the specified bucket. In resources, we have more configuration options for this bucket.

While this is normally not as important, for this particular cloudformation template, the naming conventions used ot describe the bucket is very important, because adding a bucket to an event normally creates it, but we also wish to configure this bucket in the same step to add CORS (Cross-Origin Resource Sharing) to it, and if you don't follow this convention, serverless will have trouble recognizing it as the same button


resources/s3-bucket.yml

Here is the rest of the bucket configuration:

Resources:
  S3BucketPdf:
    Type: AWS::S3::Bucket
    Properties:
      # Set the CORS policy
      BucketName: "quiztrainer-quiz-images-${self:provider.stage}"
      CorsConfiguration:
        CorsRules:
          -
            AllowedOrigins:
              - ${self:provider.environment.HOSTURL}
            AllowedHeaders:
              - '*'
            AllowedMethods:
              - GET
              - PUT
              - POST
              - DELETE
              - HEAD
            MaxAge: 3000
  Pdf2imgLambdaPermissionPdfS3:
    Type: "AWS::Lambda::Permission"
    Properties:
      FunctionName:
        "Fn::GetAtt":
          - Pdf2imgLambdaFunction
          - Arn
      Principal: "s3.amazonaws.com"
      Action: "lambda:InvokeFunction"
      SourceAccount:
        Ref: AWS::AccountId
      SourceArn: "arn:aws:s3:::quiztrainer-quiz-images-${self:provider.stage}"
  Pdf2imgRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com  
            Action: sts:AssumeRole
      Policies:
        - PolicyName: Logs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: '*'
  PdfBucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket:
        Ref: S3BucketPdf
      PolicyDocument:
        Statement:
          - Principal: "*"
            Action:
              - s3:GetObject
            Effect: Allow
            Sid: "AddPerm"
            Resource:
              Fn::Join:
                - '/'
                - - Fn::GetAtt:
                    - S3BucketPdf
                    - Arn
                  - '*'


This concludes the tutorial to build a PDF2JPG converter using lambda, serverless, react js and amplify. Please leave questions and comments in the box below

Add Comment