How can I extract text from a searchable PDF and save it to my db (nodejs)

Hi nodejs experts!

I’m searching for a way to extract text from a searchable PDF and save it to my database. I know that there are some npm packages for this, but I have no idea how to implement this.

Does someone have some experience with this?

Thanks a lot for your help!

I would like to use it from the server side. I assume I have to create a custom module for this. Is this correct?

Yes. Custom module is the way to go.
Take a look at the topic #custom-modules

Thanks @sid for your help. I’m really bad with javascript…

I looked at your image-to-base64 module and tried to create the .hjson file:

  type: 'pdf_getValue',
  module : 'pdf',
  action : 'getValue',
  groupTitle : 'My Modules',
  groupIcon : 'fas fa-lg fa-project-diagram comp-images',
  title : 'PDF Extraction',
  icon : 'fas fa-lg fa-file-pdf comp-images',
  dataPickObject: true,
  properties : [
    {
      group: 'Source File',
      variables: [
        { name: 'name', optionName: 'name', title: 'Name', type: 'text', required: true, defaultValue: ''},
        { name: 'path', optionName: 'path', title: 'Path', type: 'text', required: true, defaultValue: '', serverDataBindings: true},
        { name: 'output', optionName: 'output', title: 'Output', type: 'boolean', defaultValue: false }
      ]
    }
  ]
}

I have absolute no idea how to translate the following code for the .js file to make it work:

const pdf = require("pdf-extraction");
 
let dataBuffer = fs.readFileSync("path to PDF file...");
 
pdf(dataBuffer).then(function (data) {
    // number of pages
    console.log(data.numpages);
    // number of rendered pages
    console.log(data.numrender);
    // PDF info
    console.log(data.info);
    // PDF metadata
    console.log(data.metadata);
    // PDF.js version
    // check https://mozilla.github.io/pdf.js/getting_started/
    console.log(data.version);
    // PDF text
    console.log(data.text);
});

well, with trial and error I got it working. It was easier than I thought. Pretty cool!!!

This is the final code (for the moment):

.hjson file

  type: 'pdf_extraction_getValue',
  module : 'pdf_extraction',
  action : 'getValue',
  groupTitle : 'My Modules',
  groupIcon : 'fas fa-lg fa-project-diagram comp-images',
  title : 'PDF Extraction',
  icon : 'fas fa-lg fa-file-pdf comp-images',
  dataPickObject: true,
  properties : [
    {
      group: 'Source File',
      variables: [
        { name: 'name', optionName: 'name', title: 'Name', type: 'text', required: true, defaultValue: ''},
        { name: 'path', optionName: 'path', title: 'Path', type: 'text', required: true, defaultValue: '', serverDataBindings: true},
        { name: 'output', optionName: 'output', title: 'Output', type: 'boolean', defaultValue: false }
      ]
    }
  ]
}

.js file

const { toSystemPath } = require('../../../lib/core/path');

exports.getValue = function (options) {
    let path = toSystemPath(this.parseRequired(this.parse(options.path), 'string', 'fs.exists: path is required.'))
    return pdf_extraction(path);
};

to make it work you need to install the npm package first: npm install pdf-extraction

1 Like

There must still be missing something…

I get [object Object] in the App Connect expression.

This is the output of the Server Action:

{"identity":false,"test":{"numpages":2,"numrender":2,"info":{"PDFFormatVersion":"1.3","IsAcroFormPresent":false,"IsXFAPresent":false},"metadata":null,"text":"\n\nTest","version":"1.10.100"}}

image

What could be the cause? It seems that it does not read the schema…

What am I missing @sid, @patrick, @JonL?

What value do you want to output?

“text” and “PDFFormatVersion”

{{ scPDFTest.data.test.info.PDFFormatVersion }}
and
{{ scPDFTest.data.test.text }}

1 Like

Thank you @JonL!

Is it also possible to make it available in the picker?

You need to define the dataScheme in the .hjson file.

dataScheme: [
  { name: 'numpages', type: 'number' },
  { name: 'numrender', type: 'number' },
  { name: 'info', type: 'object', sub: [
    { name: 'PDFFormatVersion', type: 'text' },
    { name: 'IsAcroFormPresent', type: 'boolean' },
    { name: 'IsXFAPresent', type: 'boolean' }
  ]},
  { name: 'text', type: 'text' },
  { name: 'version', type: 'text' }
]
3 Likes

Thanks @patrick, @sid & @JonL for your help!

This is my final module:

4 Likes

Hey @patrick if I wanted to define the schema to have nesting within nesting, would this be the correct structure:

{name: 'experience', type: 'array', sub: [
{name: 'experience.end_date', type: 'date' },
{name: 'experience.start_date', type: 'date' },
{name: 'experience.is_primary', type: 'boolean' },
{name: 'experience.summary', type: 'text' },
{name: 'experience.location_names', type: 'array' },
{name: 'experience.company', type: 'obect', sub: [
{name: 'experience.company.facebook_url', type: 'text' },
{name: 'experience.company.founded', type: 'text' },
{name: 'experience.company.id', type: 'text' },
{name: 'experience.company.industry', type: 'text' },
{name: 'experience.company.name', type: 'text' },
{name: 'experience.company.size', type: 'text' },
{name: 'experience.company.website', type: 'text' },
{name: 'experience.company.location', type: 'object' },
{name: 'experience.company.twitter_url', type: 'text' },
{name: 'experience.company.facebook_url', type: 'text' }
    ]},
{name: 'experience.title', type: 'obect', sub: [
{name: 'experience.title.name', type: 'text' },
{name: 'experience.title.role', type: 'text' },
{name: 'experience.title.sub_role', type: 'text' },
{name: 'experience.title.levels', type: 'array' }
    ]},
]},
1 Like

Hey @patrick - I feel as though the above nesting is incorrect when there are multiple layers/nesting as some data is not being handled well when trying to format/display - would you mind taking a look to see if the above is OK?

You only need to set the sub property as name, not the full expression.

{name: 'experience', type: 'array', sub: [
  {name: 'end_date', type: 'date' },
  {name: 'start_date', type: 'date' },
  {name: 'is_primary', type: 'boolean' },
  {name: 'summary', type: 'text' },
  {name: 'location_names', type: 'array' },
  {name: 'company', type: 'object', sub: [
    {name: 'facebook_url', type: 'text' },
    {name: 'founded', type: 'text' },
    {name: 'id', type: 'text' },
    {name: 'industry', type: 'text' },
    {name: 'name', type: 'text' },
    {name: 'size', type: 'text' },
    {name: 'website', type: 'text' },
    {name: 'location', type: 'object' },
    {name: 'twitter_url', type: 'text' },
    {name: 'facebook_url', type: 'text' }
  ]},
  {name: 'title', type: 'object', sub: [
    {name: 'name', type: 'text' },
    {name: 'role', type: 'text' },
    {name: 'sub_role', type: 'text' },
    {name: 'levels', type: 'array' }
  ]},
]},

2 Likes

shouldn‘t it be object instead of obect? It seems like a typo.

1 Like

yes, should be object

1 Like

Thank you @patrick, have updated based on this.