This guide demonstrates how to use AWS Textract to extract text from a PDF document using PHP.
Prerequisites:
- An AWS account with Textract service enabled.
- The AWS SDK for PHP configured in your project with proper credentials.
- Please refer here on how to do the above: https://blog.echomy.com/index.php/2024/03/14/getting-started-with-amazon-rekognition-image-and-video-ai-a-step-by-step-guide/
Steps:
- Include Libraries:
require 'vendor/autoload.php';
This line assumes you have downloaded the AWS SDK for PHP using Composer and it’s located in thevendor
directory. - Create Textract Client:
use Aws\Textract\TextractClient; $textract = new TextractClient([ 'version' => 'latest', ]);
This code creates a client object to interact with the Textract service. - Specify Document Path:
$documentPath = 'sampleinvoice.pdf'; // Change this to the path of your document
Update this line with the actual path to your PDF document you want to process. - Read Document File:
$fp_document = fopen($documentPath, 'r'); $document = fread($fp_document, filesize($documentPath)); fclose($fp_document);
This code opens the PDF document for reading, reads its content into a variable ($document
), and then closes the file. - Analyze Document:
$result = $textract->analyzeDocument([ 'Document' => [ 'Bytes' => $document, ], 'FeatureTypes' => ['TABLES', 'FORMS'], // Specify the feature types you want to extract ]);
This code calls theanalyzeDocument
function of the Textract client. It provides the document bytes and specifies the feature types (TABLES
andFORMS
) you want to extract (optional). - Extract Text:
$extractedText = ''; if (!empty($result['Blocks'])) { foreach ($result['Blocks'] as $block) { if ($block['BlockType'] === 'LINE') { $extractedText .= $block['Text'] . ' '; } } }
This section processes the response from Textract. It checks if there are any extracted blocks ('Blocks'
) in the response. If so, it iterates through each block and extracts text only from blocks with the type'LINE'
. Finally, it concatenates the extracted lines into a single string ($extractedText
). - Output Extracted Text:
echo $extractedText;
This line simply prints the extracted text to the console.
Sample Output:
The output will depend on the content of your PDF document. Here’s an example of a possible output:
Invoice Number: 12345
Customer Name: Acme Corp
Date: 2023-10-26
... (other lines from the invoice)
Total: $100.00
Complete Code:
<?php
require 'vendor/autoload.php';
use Aws\Textract\TextractClient;
// Create a Textract client
$textract = new TextractClient([
'version' => 'latest',
]);
// Specify the path to the document file
$documentPath = 'sampleinvoice.pdf'; // Change this to the path of your document
// Read the document file
$fp_document = fopen($documentPath, 'r');
$document = fread($fp_document, filesize($documentPath));
fclose($fp_document);
// Call AnalyzeDocument operation to extract text from the document
$result = $textract->analyzeDocument([
'Document' => [
'Bytes' => $document,
],
'FeatureTypes' => ['TABLES', 'FORMS'], // Specify the feature types you want to extract
]);
// Initialize the variable to store the extracted text
$extractedText = '';
// Process the result to extract text from the response
if (!empty($result['Blocks'])) {
foreach ($result['Blocks'] as $block) {
if ($block['BlockType'] === 'LINE') {
$extractedText .= $block['Text'] . ' ';
}
}
}
// Output the extracted text
echo $extractedText;
Description of the Code:
This code demonstrates a basic example of using Textract with PHP. It showcases the steps involved in:
- Creating a Textract client
- Specifying the document path
- Reading the document content
- Calling the
analyzeDocument
function - Processing the response to extract text from lines
- Printing the extracted text
Additional Notes:
- This example extracts text only from lines (
'BlockType' === 'LINE'
). You can modify the code to extract text from other block types based on your needs (e.g., headings, paragraphs). - The
FeatureTypes
option allows you to extract information from tables and forms within the document. Explore the Textract documentation for more details on these features. - Error handling is not included in this basic example. Consider implementing proper error handling for real-world applications.