How can I extract text from a searchable PDF and save it to my db (nodejs)

Hi nodejs experts!

I’m searching for a way to extract text from a searchable PDF and save it to my database. I know that there are some npm packages for this, but I have no idea how to implement this.

Does someone have some experience with this?

Thanks a lot for your help!

I would like to use it from the server side. I assume I have to create a custom module for this. Is this correct?

Yes. Custom module is the way to go.
Take a look at the topic #custom-modules

Thanks @sid for your help. I’m really bad with javascript…

I looked at your image-to-base64 module and tried to create the .hjson file:

  type: 'pdf_getValue',
  module : 'pdf',
  action : 'getValue',
  groupTitle : 'My Modules',
  groupIcon : 'fas fa-lg fa-project-diagram comp-images',
  title : 'PDF Extraction',
  icon : 'fas fa-lg fa-file-pdf comp-images',
  dataPickObject: true,
  properties : [
    {
      group: 'Source File',
      variables: [
        { name: 'name', optionName: 'name', title: 'Name', type: 'text', required: true, defaultValue: ''},
        { name: 'path', optionName: 'path', title: 'Path', type: 'text', required: true, defaultValue: '', serverDataBindings: true},
        { name: 'output', optionName: 'output', title: 'Output', type: 'boolean', defaultValue: false }
      ]
    }
  ]
}

I have absolute no idea how to translate the following code for the .js file to make it work:

const pdf = require("pdf-extraction");
 
let dataBuffer = fs.readFileSync("path to PDF file...");
 
pdf(dataBuffer).then(function (data) {
    // number of pages
    console.log(data.numpages);
    // number of rendered pages
    console.log(data.numrender);
    // PDF info
    console.log(data.info);
    // PDF metadata
    console.log(data.metadata);
    // PDF.js version
    // check https://mozilla.github.io/pdf.js/getting_started/
    console.log(data.version);
    // PDF text
    console.log(data.text);
});

well, with trial and error I got it working. It was easier than I thought. Pretty cool!!!

This is the final code (for the moment):

.hjson file

  type: 'pdf_extraction_getValue',
  module : 'pdf_extraction',
  action : 'getValue',
  groupTitle : 'My Modules',
  groupIcon : 'fas fa-lg fa-project-diagram comp-images',
  title : 'PDF Extraction',
  icon : 'fas fa-lg fa-file-pdf comp-images',
  dataPickObject: true,
  properties : [
    {
      group: 'Source File',
      variables: [
        { name: 'name', optionName: 'name', title: 'Name', type: 'text', required: true, defaultValue: ''},
        { name: 'path', optionName: 'path', title: 'Path', type: 'text', required: true, defaultValue: '', serverDataBindings: true},
        { name: 'output', optionName: 'output', title: 'Output', type: 'boolean', defaultValue: false }
      ]
    }
  ]
}

.js file

const { toSystemPath } = require('../../../lib/core/path');

exports.getValue = function (options) {
    let path = toSystemPath(this.parseRequired(this.parse(options.path), 'string', 'fs.exists: path is required.'))
    return pdf_extraction(path);
};

to make it work you need to install the npm package first: npm install pdf-extraction

1 Like

There must still be missing something…

I get [object Object] in the App Connect expression.

This is the output of the Server Action:

{"identity":false,"test":{"numpages":2,"numrender":2,"info":{"PDFFormatVersion":"1.3","IsAcroFormPresent":false,"IsXFAPresent":false},"metadata":null,"text":"\n\nTest","version":"1.10.100"}}

image

What could be the cause? It seems that it does not read the schema…

What am I missing @sid, @patrick, @JonL?

What value do you want to output?

“text” and “PDFFormatVersion”

{{ scPDFTest.data.test.info.PDFFormatVersion }}
and
{{ scPDFTest.data.test.text }}

1 Like

Thank you @JonL!

Is it also possible to make it available in the picker?

You need to define the dataScheme in the .hjson file.

dataScheme: [
  { name: 'numpages', type: 'number' },
  { name: 'numrender', type: 'number' },
  { name: 'info', type: 'object', sub: [
    { name: 'PDFFormatVersion', type: 'text' },
    { name: 'IsAcroFormPresent', type: 'boolean' },
    { name: 'IsXFAPresent', type: 'boolean' }
  ]},
  { name: 'text', type: 'text' },
  { name: 'version', type: 'text' }
]
3 Likes

Thanks @patrick, @sid & @JonL for your help!

This is my final module:

3 Likes