pdf to txt , pyton code 17th sep 24

- September 17, 2024

2 pyton codes.

1pyton

# Install google-cloud-vision and pdf2image
!pip install google-cloud-vision pdf2image PdfReader

# Install poppler-utils
!apt-get update
!apt-get install -y poppler-utils

2nd pyton

!pip show google-cloud-vision pdf2image
!apt-cache policy poppler-utils
# this is for doc to text ; ocr.
import io
import os
from google.cloud import vision
from google.cloud.vision_v1 import types
from pdf2image import convert_from_path
from google.oauth2 import service_account

# Paths to the files
credentials_path = '/content/credentials_doc.json'
pdf_file_path = '/content/Aditya.pdf'
output_text_file_path = '/content/Aditya_text.txt'

# Verify the file existence
if os.path.exists(credentials_path):
    print("Credentials file found.")
else:
    print("Credentials file not found. Please check the path.")

if os.path.exists(pdf_file_path):
    print("PDF file found.")
else:
    print("PDF file not found. Please check the path.")

# Authenticate using service account
credentials = service_account.Credentials.from_service_account_file(credentials_path)
client = vision.ImageAnnotatorClient(credentials=credentials)

# Convert PDF to images
images = convert_from_path(pdf_file_path, dpi=300)

# Function to perform OCR on an image
def perform_ocr(image):
    content = io.BytesIO()
    image.save(content, format='PNG')
    content = content.getvalue()
    image = types.Image(content=content)
    response = client.document_text_detection(image=image)
    return response.full_text_annotation.text

# Process each image and save the text
with open(output_text_file_path, 'w', encoding='utf-8') as output_file:
    for i, image in enumerate(images):
        text = perform_ocr(image)
        output_file.write(f"Page {i + 1}\n")
        output_file.write(text)
        output_file.write("\n\n")

print(f'OCR text written to file "{output_text_file_path}"')

Search This Blog

My important workaround in my journey

pdf to txt , pyton code 17th sep 24

Popular posts from this blog

SAP CPI : camle expression in sap cpi , cm, router, filter and groovy script. format

pss book: గురు ప్రార్థనామంజరి . completed 21st july 2024

pss book : శ్రీకృష్ణుడు దేవుడా, భగవంతుడా completed , second review needed. 26th April 2024