Guide to Using AWS Textract with PHP

Guide to Using AWS Textract with PHP

This guide demonstrates how to use AWS Textract to extract text from a PDF document using PHP.

Prerequisites:

Steps:

  1. Include Libraries: require 'vendor/autoload.php'; This line assumes you have downloaded the AWS SDK for PHP using Composer and it’s located in the vendor directory.
  2. Create Textract Client: use Aws\Textract\TextractClient; $textract = new TextractClient([ 'version' => 'latest', ]); This code creates a client object to interact with the Textract service.
  3. Specify Document Path: $documentPath = 'sampleinvoice.pdf'; // Change this to the path of your document Update this line with the actual path to your PDF document you want to process.
  4. Read Document File: $fp_document = fopen($documentPath, 'r'); $document = fread($fp_document, filesize($documentPath)); fclose($fp_document); This code opens the PDF document for reading, reads its content into a variable ($document), and then closes the file.
  5. Analyze Document: $result = $textract->analyzeDocument([ 'Document' => [ 'Bytes' => $document, ], 'FeatureTypes' => ['TABLES', 'FORMS'], // Specify the feature types you want to extract ]); This code calls the analyzeDocument function of the Textract client. It provides the document bytes and specifies the feature types (TABLES and FORMS) you want to extract (optional).
  6. Extract Text: $extractedText = ''; if (!empty($result['Blocks'])) { foreach ($result['Blocks'] as $block) { if ($block['BlockType'] === 'LINE') { $extractedText .= $block['Text'] . ' '; } } } This section processes the response from Textract. It checks if there are any extracted blocks ('Blocks') in the response. If so, it iterates through each block and extracts text only from blocks with the type 'LINE'. Finally, it concatenates the extracted lines into a single string ($extractedText).
  7. Output Extracted Text: echo $extractedText; This line simply prints the extracted text to the console.
See also  Part 10: Future Trends and Innovations in AWS Cloud Optimization and Serverless Computing

Sample Output:

The output will depend on the content of your PDF document. Here’s an example of a possible output:

Invoice Number: 12345
Customer Name: Acme Corp
Date: 2023-10-26
... (other lines from the invoice)
Total: $100.00

Complete Code:

<?php

require 'vendor/autoload.php';

use Aws\Textract\TextractClient;

// Create a Textract client
$textract = new TextractClient([
    'version' => 'latest',
]);

// Specify the path to the document file
$documentPath = 'sampleinvoice.pdf'; // Change this to the path of your document

// Read the document file
$fp_document = fopen($documentPath, 'r');
$document = fread($fp_document, filesize($documentPath));
fclose($fp_document);

// Call AnalyzeDocument operation to extract text from the document
$result = $textract->analyzeDocument([
    'Document' => [
        'Bytes' => $document,
    ],
    'FeatureTypes' => ['TABLES', 'FORMS'], // Specify the feature types you want to extract

]);

// Initialize the variable to store the extracted text
$extractedText = '';

// Process the result to extract text from the response
if (!empty($result['Blocks'])) {
    foreach ($result['Blocks'] as $block) {
        if ($block['BlockType'] === 'LINE') {
            $extractedText .= $block['Text'] . ' ';
        }
    }
}

// Output the extracted text
echo $extractedText;

Description of the Code:

This code demonstrates a basic example of using Textract with PHP. It showcases the steps involved in:

  • Creating a Textract client
  • Specifying the document path
  • Reading the document content
  • Calling the analyzeDocument function
  • Processing the response to extract text from lines
  • Printing the extracted text

Additional Notes:

  • This example extracts text only from lines ('BlockType' === 'LINE'). You can modify the code to extract text from other block types based on your needs (e.g., headings, paragraphs).
  • The FeatureTypes option allows you to extract information from tables and forms within the document. Explore the Textract documentation for more details on these features.
  • Error handling is not included in this basic example. Consider implementing proper error handling for real-world applications.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.