This is part 3 of the PDF2JPG Conversion app - part 2 can be found here.
For this part of the project we will build a lambda that automatically processes uploaded pdfs. It first takes the pdf and turns each page into a jpg and eventually also creates a zip file out of all the jpgs.
For this project, the lambda will be built as a trigger to the s3 bucket. If you would like more control about how the processing is initiated, you will need to find another way to trigger it.
The lambda performs a number of tasks sequentially. First, it reads the pdf into the lambdas temporary storage. Then it uses GhostScript to convert the pdf into individual jpg images. It then uploads all of these to s3. After, it creates a file stream using each file and dumps it into a zip file that gets pushed right onto s3.
We will go through the code for each of these steps
Required Imports
fs is for file operations, archive-promise exists to create a promise-returning zip archive, await-exec is there to do a promise returning exec - required to run ghostscript synchronously.
const fs = require('fs');
const archiver = require('archiver-promise');
const path = require('path');
const AWS = require('aws-sdk');
const exec = require('await-exec')
const stream = require('stream')
require('events').EventEmitter.defaultMaxListeners = 100;
Reading in the pdf from s3
We read in the file from S3 as a stream and save it to a temp file inside the lambda storage. Lambdas are provided up to 500mb of temporary storage, so this is great for our purpose.
const grabPdfFromS3 = user => {
const S3 = new AWS.S3({ region: 'us-west-2' });
return new Promise((resolve, reject) => {
const destPath = `/tmp/${user}.pdf`
var params = {
Bucket: process.env.BUCKET,
Key: `pdfs/${user}.pdf`
}
S3.headObject(params)
.promise()
.then(() => {
const s3Stream = S3.getObject(params).createReadStream()
const fileStream = fs.createWriteStream(destPath);
s3Stream.on('error', reject);
fileStream.on('error', reject);
fileStream.on('close', () => { resolve(destPath);});
s3Stream.pipe(fileStream);
})
.catch(error => {
console.log(`grabfroms3error: ${error}`);
reject(error)
});
});
};
Saving a file to s3
The saveFileToS3 takes a single file object and saves it to s3. S3 doesn't have an API command to upload multiple files, so each file has to be uploaded individually.
const saveFileToS3 = (path,key) => {
const S3 = new AWS.S3({ region: 'us-west-2' });
return new Promise((resolve, reject) => {
fs.readFile(path, (err, data) => {
S3.putObject({
Bucket: process.env.BUCKET,
Key: `${key}`,
ContentType: data.type,
Body: data
}, (error) => {
if (error) {
console.log(`savetos3error: ${error}`);
reject(`savetos3error: ${path}`);
} else {
resolve(`success: saved ${path} to S3:${key}`);
}
});
});
});
};
Creating a Zip file
The createZipFile method creates a zip file stream with archiver-promise and uploads it directly to s3. Archive uses a pipe to pass data through, during which the upload occurs. You can look at instructions for archiver here (https://www.npmjs.com/package/archiver).
const createZipFile = async (index, subdir) => {
return new Promise((resolve, reject) => {
const s3 = new AWS.S3({ region: 'us-west-2' });
var archive = archiver('/tmp/tmp.zip');
archive.on('warning', function(err) {
console.log(`err: ${err}`)
if (err.code !== 'ENOENT') {
throw err;
}
});
archive.on('error', function(err) {
console.log(`err: ${err}`)
throw err;
});
archive.pipe(uploadFromStream(s3, subdir));
archive.glob('*.jpg',{ cwd: '/tmp'},{ prefix: '' });
archive.finalize().then( function (result) {
console.log(`result: ${result}`)
console.log("finalizing archive");
});
});
}
These are some additional methods required for the archiver upload
const uploadFromStream = (s3, subdir) => {
var pass = new stream.PassThrough();
var params = {Bucket: process.env.BUCKET, Key: `${subdir}.zip`, Body: pass}
s3uploadfunc(s3, params)
return pass;
}
const s3uploadfunc = async (s3, params) => {
try {
var data = await s3.upload(params).promise()
console.log(`s3ZipUpload data: ${JSON.stringify(data)}`);
deleteTempFiles()
} catch (err){
console.log(`s3ZipUpload err: ${err}`);
}
}
deleteTempfiles exists to remove jpgs after the zip upload. I do this because sometimes I find that aws gives me a the same file storage instance in a subsequent lambda call, and I want to make sure the jpg files I'm uploading are actually from the current pdf, and not from a previous one, so I delete them as soon as they are uploaded
const deleteTempFiles = () => {
var files_to_delete = []
fs.readdirSync('/tmp/').forEach(file => {
if (file.split('.')[1]=='jpg') {
files_to_delete.push(`/tmp/${file}`)
}
});
var i = files_to_delete.length;
files_to_delete.forEach(function(filepath){
fs.unlinkSync(filepath);
});
}
Putting it all together in the main function
module.exports.execute = async (event, context, callback) => {
try {
var local_pdf_path = await grabPdfFromS3(user(event))
await exec('./lambda-ghostscript/bin/gs -sDEVICE=jpeg -dTextAlphaBits=4 -r128 -o /tmp/image%d.jpg ' + local_pdf_path);
var index = 1;
while (fs.existsSync(currentUrl(index))) {
var response = await saveFileToS3(currentUrl(index), s3Url(user(event), index));
index++;
}
if (index > 1) {
console.log(`creating zip: index ${index}, user(event): ${user(event)}`)
await createZipFile(index,user(event));
return { statusCode: 200, body: JSON.stringify({message: 'Success', input: `${index} images were uploaded`}, null, 2)}
}
} catch (err) {
console.log(`conversion error: ${err}`);
return { statusCode: 500, body: JSON.stringify({message: err,input: event}, null, 2)}
}
};
First, we grab the PDF file from S3, then we run ghostScript to convert the file into individual jpgs. Once completed, we sequentially upload each file as long as it exists. If we've generated at least a single jpg, we create our zip file. The createZipFile method also runs the delete method, so that's pretty much it.
While some small performance gains may be possible by allowing things to run asynchronously concurrent more often, I the gains will be minimal as a lambda can only access a single cpu thread. Also, I don't want to spin up multiple lambdas with this, as they normally don't share temp data.
The biggest issue I had writing this was to ensure the lambda didn't close before all the tasks are completed. To that end, we have to ensure we wait for all promises before moving on. While lambdas can often complete the current task even if we don't wait, often the next task will just be ignored as the lambda quits.
The full source code is available on github
Serverless Template
To deploy this code as a lambda on AWS, we will be using serverless. To install serverless, you will need node.
Once you've install node, you can install serverless by running the command
npm install -g serverless
you will also need to set up AWS credentials if you have not done so
to get started with a serverless template, run the command
serverless create --template aws-nodejs
This will generated a basic serverless template to get you started. Here is the serverless template (serverless,yml) once i've made my necessary changes:
service: pdf2img
custom:
myStage: ${opt:stage, self:provider.stage}
myEnvironment:
HOSTURL:
prod: "https://pdf2jpgs.com"
dev: "http://localhost:3000"
service - name of the project
custom - variables we may use in our scrpts
provider:
name: aws
runtime: nodejs8.10
stage: dev
environment:
HOSTURL: ${self:custom.myEnvironment.HOSTURL.${self:custom.myStage}}
region: us-west-2
memorySize: 3008
timeout: 180
iamRoleStatements:
- Effect: Allow
Action:
- s3:PutObject
- s3:getObject
Resource: "arn:aws:s3:::quiztrainer-quiz-images-${self:provider.stage}/*"
provider - configuration options for aws. I've set memorySize (for the lambda) to 3gb because AWS allocates more cpu power proportional to the amount of memory used. I've set the timeout to 3 minutes, beause really large pdfs could take this long. I've configured aws IAM roles to allow the lambda to write and read to the bucket i'm going to create here, the one I need to hold the pdfs, images and zip files
functions:
pdf2img:
handler: pdf2img.execute
events:
- s3:
bucket: pdf
event: s3:ObjectCreated:*
rules:
- suffix: .pdf
environment:
BUCKET: "quiztrainer-quiz-images-${self:provider.stage}"
resources:
- ${file(resources/s3-bucket.yml)}
function - specific configuration for the lambda that serverless is going to create - We're referening our pdf2img.js (the lambda we've created above), which gets triggered on the event we decribe here (the creation of a pdf file in the specified bucket. In resources, we have more configuration options for this bucket.
While this is normally not as important, for this particular cloudformation template, the naming conventions used ot describe the bucket is very important, because adding a bucket to an event normally creates it, but we also wish to configure this bucket in the same step to add CORS (Cross-Origin Resource Sharing) to it, and if you don't follow this convention, serverless will have trouble recognizing it as the same button
resources/s3-bucket.yml
Here is the rest of the bucket configuration:
Resources:
S3BucketPdf:
Type: AWS::S3::Bucket
Properties:
# Set the CORS policy
BucketName: "quiztrainer-quiz-images-${self:provider.stage}"
CorsConfiguration:
CorsRules:
-
AllowedOrigins:
- ${self:provider.environment.HOSTURL}
AllowedHeaders:
- '*'
AllowedMethods:
- GET
- PUT
- POST
- DELETE
- HEAD
MaxAge: 3000
Pdf2imgLambdaPermissionPdfS3:
Type: "AWS::Lambda::Permission"
Properties:
FunctionName:
"Fn::GetAtt":
- Pdf2imgLambdaFunction
- Arn
Principal: "s3.amazonaws.com"
Action: "lambda:InvokeFunction"
SourceAccount:
Ref: AWS::AccountId
SourceArn: "arn:aws:s3:::quiztrainer-quiz-images-${self:provider.stage}"
Pdf2imgRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: Logs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: '*'
PdfBucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket:
Ref: S3BucketPdf
PolicyDocument:
Statement:
- Principal: "*"
Action:
- s3:GetObject
Effect: Allow
Sid: "AddPerm"
Resource:
Fn::Join:
- '/'
- - Fn::GetAtt:
- S3BucketPdf
- Arn
- '*'
This concludes the tutorial to build a PDF2JPG converter using lambda, serverless, react js and amplify. Please leave questions and comments in the box below
0 Comments
Add Comment