Script to extract pdf files, structure data and export to .csv .xls .xlsx file

Closed job
Michał Wojda
Michał Wojda
Employer
3 deals
Job category:
Desktop/web applications
Expected budget:

Negotiable

Preferable skills:
Published:
Valid until:

Job description

I will order to write a script that will extract all related data i.e. contact details, contact data, etc. from pdf file directories.

The pdf files have a complex structure (more than 100 elements per one company, multiple companies in the file) that need to be extracted from a single pdf file

The result we are interested in is fully structured data in a .csv /.xls/.xlsx/ file.

If you have dealt with data like CIDG / KRS / Court Monitor /.

If you have dealt with #NLP #pytnon #PyPDF #PyMuPDF .... then you can handle

link to 2ch sample .pdf files

https://wyszukiwarka-msig.ms.gov.pl/api/Monitor/Download?id=1943&fileId=true

https://wyszukiwarka-msig.ms.gov.pl/api/Monitor/Download?id=6969&fileId=true

Process:

1) we sign an assignment contract and NDA agreement

2) you receive a test 30 files

as you confirm that you can transform them

3) you receive a directory with sample 6000 files

4) you send the result of the script

we check the correctness of the data if they do not split in the columns

if everything will be ok

5) you receive the transfer

6) you send the script with instructions for installation / operation

Required functions: