Day 2 – Extracting Text from Uploaded PDFs in Laravel #LaravelGPT #PDFParsing #LaravelPDF #SmartDocs #TextExtraction #AIValidation


Today we’ll focus on extracting text from uploaded PDF documents, preparing it for AI analysis in later steps. This is a crucial step before sending data to GPT for validation.


📦 Step 1: Install PDF Parser

We’ll use Smalot/pdfparser:

composer require smalot/pdfparser

🧠 Step 2: Create a Service to Handle PDF Text Extraction

php artisan make:service PdfTextExtractor

In app/Services/PdfTextExtractor.php:

namespace App\Services;

use Smalot\PdfParser\Parser;
use Illuminate\Support\Facades\Storage;

class PdfTextExtractor
{
    public function extract(string $filename): string
    {
        $parser = new Parser();
        $pdfPath = Storage::disk('public')->path("documents/{$filename}");
        $pdf = $parser->parseFile($pdfPath);

        return $pdf->getText();
    }
}

⚙️ Step 3: Add extracted_text to Documents Table

php artisan make:migration add_extracted_text_to_documents_table

In the migration:

public function up()
{
    Schema::table('documents', function (Blueprint $table) {
        $table->longText('extracted_text')->nullable();
    });
}

Then run:

php artisan migrate

🧪 Step 4: Modify Store Logic to Extract Text

In DocumentController.php:

use App\Services\PdfTextExtractor;

public function store(Request $request, PdfTextExtractor $extractor)
{
    $request->validate([
        'title' => 'required|string',
        'type' => 'required|in:contract,invoice',
        'document' => 'required|file|mimes:pdf|max:20480',
    ]);

    $file = $request->file('document');
    $filename = time() . '-' . $file->getClientOriginalName();
    $file->storeAs('documents', $filename, 'public');

    $text = $extractor->extract($filename);

    Document::create([
        'title' => $request->title,
        'type' => $request->type,
        'filename' => $filename,
        'user_id' => auth()->id(),
        'extracted_text' => $text,
    ]);

    return redirect()->back()->with('success', 'Document uploaded and text extracted.');
}

✅ Summary

✅ Today you:

  • Installed and used smalot/pdfparser to extract text from PDFs
  • Created a reusable service class
  • Stored extracted content alongside the document

✅ Up next (Day 3): We’ll send the extracted text to GPT for structure detection – like identifying contract parties, payment terms, dates, and clauses.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.