image

TabulaPro: Pro-version of Tabula-py

image image image

TabulaPro is a layer on the tabula-py library to extract tables from Scan PDFs and Images.

TabulaPro vs Tabula

TabulaPro is no different from the original Tabula to code. Turn your current tabula-py code to TabulaPro compatible with flavor="TabulaPro" or tabulapro=True in read_pdf() to process images or scanned PDFs”.

Installation

💡 ProTip: ExtractTable-py is the official library, FASTER than this wrapper, has NO software dependencies.

As the library itself is dependent on Tabula which has software dependencies, the developer is expected to install them, to use the regular Tabula flavors *(“stream”, “lattice”) along with “TabulaPro”.

Using pip

After installing software dependencies, you can simply use pip to install TabulaPro:

$ pip install -U TabulaPro  

Prerequisites

The developer needs an api_key (free credits here) to use TabulaPro. Each Image file or one PDF page consumes one credit to trigger the process.

api_key should be passed through pro_kwargs, a dict type argument that accepts api_key, job_id, dup_check, wait_for_output as keys, can be used as below

{
    "api_key": str,
    Mandatory, to trigger "TabulaPro" flavor, to process Scan PDFs and images, also text PDF files

    "job_id": str,
        optional, if processing a new file
        Mandatory, to retrieve the result of the already submitted file

    "dup_check": bool, default: False - to bypass the duplicate check
        Useful to handle duplicate requests, check based on the FileName

    "max_wait_time": int, default: 300
        Checks for the output every 15 seconds until successfully processed or for a maximum of 300 seconds.
}

Let’s code

Quickly validate the API key and see the number of credits attached to it

api_key = YOUR_API_KEY_HERE

from tabula_pro import check_usage
print(check_usage(api_key))

No error from the above code snippet run implies API Key is valid

Here’s how you can extract tables from Image files.

The example image (tabula-data-page-1.PNG) used in the code below, can be found here. Notice that tabula-data-page-1.PNG is the image version of the first page of Tabula’s PDF example, data.pdf.

from tabula_pro import read_pdf
pro_tables = read_pdf(
    'foo-image.jpg', 
    flavor="tabulapro", 
    pro_kwargs={"api_key": api_key}
)

# To process PDF, make use of pages ("1", "1,3-4", "all") params in the read_pdf function
# pro_tables = read_pdf('foo-image.PDF', flavor="tabulaPro", pages="1,3-4", pro_kwargs={'api_key': api_key})

pro_tables is a list of dataframes that are found in the file

pro_tables[0]
mpg cyl disp hp drat wt gsec VS am gear carb  
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 L 0 3 L
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Mere 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Mere 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Mere 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 L 1 4 L
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro 728 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 L 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volyo 142F 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Most of the image files are processed under 5 seconds. At times a blurry/big/bad image processing may take up to 15 seconds and the PDF file depends on the page count. In these cases, the process waits for a maximum of 300 seconds to check the job status every 15 seconds until a process ends successfully to return a final response.

ProTip: To have more control on the process wait time checkout ExtractTable-py

Pull Requests & Rewards

Pull requests are most welcome and greatly appreciated with API credits.

License

This project is licensed under the Apache License 2.0, see the LICENSE file for details.

Credits

Last but not least, we want to be thankful to the contributors of tabula-py

Social Media

Follow us on Social media for library updates and free credits.

Image      Image