Extract values from a document

Introduction

The following presents the 4 steps to extract data from your documents, assuming the DocType has been created on the platform.

The code snippets below use the os and requests packages. Import them with:

import os
import requests

Moreover, the following constants have to be defined:

URL_AUTH_SERVER: the URL of the authentication server, like https://extract.auth.recital.ai/
URL_SERVER: the URL of the Extract server, like https://extract.api.recital.ai/

Authenticate on a server

The authentication step returns headers that will be used for all the further API calls. You need a user and pwd to get them.
The following code returns the headers:

token = requests.post(URL_AUTH_SERVER+'auth/api/v1/login/?noAuth=true',
                data={'username':user,'password':pwd}).json()
headers = {'Authorization':'Bearer '+token['access_token']}

Get the DocType id

Let’s suppose for the sake of example that the DocType of the document you want to extract data is called “Invoice (fr)” (a DocType for French invoices).
The ID of the DocType is required to call the APIs. Hence, you can define this function:

def get_DT_id(a_DT_name):
    r = requests.get(url=URL_SERVER+'system_2/document_type/', headers=headers)
    res = [DT['id'] for DT in r.json() if DT['name'] == a_DT_name]
    if len(res) == 0: return None
    else: return res[0]_2/document_type/', headers=headers)

And call DT_id = get_DT_id('Invoice(fr)') to get the id of your DocType.
If DT_id is None, it means it does not exist on the server referred to as URL_SERVER. In this case, log in to the server (same address as URL_SERVER without .api) and make sure you actually defined the DocType and correctly spelled it in the call.

Upload the document

To upload the document on the server, just run the following code with the headers, the file_name and the doc_type:

f = open(file_name,'rb')
files = {'file_in': (os.path.basename(f.name), f, 'multipart/form-data')}
file_post = requests.post(f'{URL_SERVER}files/',
data = {'doctype_id':DT_id}, files = files, headers=headers)
if file_post.status_code == 201:
file_id = file_post.json()
print(f'{file_post} - {file_id}')
else:
print(f'ERROR - {file_post.status_code} - {file_post.reason}')
f.close()

Download the results

Once the document is uploaded, the extracted values are available for download with the file_id.

To get the values extracted from your file, just call:

r = requests.get(url=URL_SERVER+f'files/{file_id}/values/', headers=headers)

if r.status_code == 200:
values = r.json()
else:
print(f'ERROR - {r.status_code} - {r.reason}')

The values are returned as a list of dictionaries.

Each dictionary holds various information for each Data Point. The whole structure is described in the Swagger here.

In a nutshell, each Data Point is referenced with its ID (data_point_id) and its name (data_point_name) and contains two lists, values and verified_values:
values contains the list of values returned by Extract
verified_values contains the list of values validated by the user

In either case, each value holds a page number (page_nb) and a list of values, each in turn containing a value.

So, to download the values for all the Data Points extracted in the uploaded document, just run:

res = []
for v in data:
    to_append = {'name': v['data_point_name'],
            'id': v['data_point_id']}
    
    if len(v['values'][0]['values']) == 0:
        to_append['page_nb'] = None
        to_append['value'] = None
    else:
        to_append['page_nb'] = v['values'][0]['page_nb']
        to_append['value'] = v['values'][0]['values'][0]['value']
     
    res.append(to_append)

At the end of the code, res contains a list of dict containing the name, id, page_nb and value of each Data Point from the uploaded document.


Did this page help you?