validation pyton code, corrected text with pdf file again.

- August 17, 2024

1. after correction of cloud text, cloud text corrected. as renamed.

2. again validation with google cloud.

a) what ever i corrected, those appear again in report. okay.

b)ignored spaces . // code not generating those in report.

this is 2nd validation.

----------

example: need to upload pdf file, corrected file and authentication file.

Line 385 mismatch: OCR: 'www.thraithashakam.org' Expected: 'శాస్త్రము అనీ అనడము జరుగుచున్నది. ఇక్కడ మీ ప్రశ్న ఆధ్యాత్మిక ' Line 439 mismatch: OCR: 'www.thraithashakam.org' Expected: 'పనినీ చేయనివాడుగా యున్నాడు అనుటయే సత్యము. అయినా ' Line 489 mismatch: OCR: 'www.thraithashakam.org' Expected: 'ఆముగ్గురిలో ఇంకా ఇద్దరి గురించి చూస్తే ఒకరు జీవాత్మకాగా, ఇంకొకరు ' Line 756 mismatch: OCR: 'www.thraithashakam.org' Expected: 'శరీరములో ఆత్మకూడా యున్నదను విషయము తెలియదు. తెలియక ' Line 865 mismatch: OCR: 'www.thraithashakam.org' Expected: 'జవాబు :బయట ప్రపంచములోని విద్యాలయములలో విద్యార్థులను ' Line 957 mismatch: OCR: 'www.thraithashakam.org' Expected: 'మతమును బోధించుచున్నాడని ఆరోపణలు చేశారు. అయినా అలా ' Line 984 mismatch: OCR: 'www.thraithashakam.org' Expected: 'ఉపయోగించుకొనుచుంటిని. ఇంటి యజమానికి ప్రతి నెల కిరాయి ' Line 993 mismatch: OCR: 'www.thraithashakam.org' Expected: 'ఆలోచన రావలసింది. ఎవరయినా అలాగే అనుకొంటారు. అయినా ' Line 1344 mismatch: OCR: 'www.thraithashakam.org' Expected: 'పోయినది. ఒక పనికి కావలసిన సంకల్పమునుండి పని అంతయూ ' Line 1348 mismatch: OCR: 'www.thraithashakam.org' Expected: 'అణిగియుండి బయటికి తెలియకుండా అన్ని పనులకు కారణము ' Line 1372 mismatch: OCR: 'www.thraithashakam.org' Expected: 'వాడు. దైవ గ్రంథములు కూడా చదివెడివాడు. అయితే ప్రపంచ ' Line 1377 mismatch: OCR: 'www.thraithashakam.org' Expected: 'అర్థమయినదే సరియైన అర్థమని అనుకొనెడివాడు. తనకంటే పెద్ద ' Line 1855 mismatch: OCR: 'www.thraithashakam.org' Expected: 'అసత్యమును వేయిమంది చెప్పినా, అది సత్యము కాదు, ' Line 1856 mismatch: OCR: 'www.thraithashakam.org' Expected: 'సత్యమును వేయిమంది కాదనినా, అది అసత్యము కాదు. ' Validation complete.

------------

code version2 on 17thaug 2024:

import gc
import io
import os
import re
from google.cloud import vision
from google.cloud.vision_v1 import types
from pdf2image import convert_from_path
from google.oauth2 import service_account

gc.collect()
!pip show google-cloud-vision pdf2image
!apt-cache policy poppler-utils

# Paths to the files
credentials_path = '/content/credentials_doc.json'
pdf_file_path = '/content/Vishwa-Vidyalayam.pdf'
input_text_file_path = '/content/CLOUD_విశ్వ విద్యాలయము_Corrected.txt'

# Verify the file existence
if os.path.exists(credentials_path):
    print("Credentials file found.")
else:
    print("Credentials file not found. Please check the path.")

if os.path.exists(pdf_file_path):
    print("PDF file found.")
else:
    print("PDF file not found. Please check the path.")

if os.path.exists(input_text_file_path):
    print("Input text file found.")
else:
    print("Input text file not found. Please check the path.")

# Authenticate using service account
credentials = service_account.Credentials.from_service_account_file(credentials_path)
client = vision.ImageAnnotatorClient(credentials=credentials)

# Convert PDF to images
images = convert_from_path(pdf_file_path, dpi=300)

# Function to perform OCR on an image
def perform_ocr(image):
    content = io.BytesIO()
    image.save(content, format='PNG')
    content = content.getvalue()
    image = types.Image(content=content)
    response = client.document_text_detection(image=image)
    return response.full_text_annotation.text

# Preprocess the text by removing spaces, punctuation, and converting to lowercase
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', '', text)      # Remove spaces
    return text.lower()                  # Convert to lowercase

# Process each image and store the OCR text
ocr_text = []
for i, image in enumerate(images):
    text = perform_ocr(image)
    ocr_text.append(text)

# Join all OCR text into a single string
ocr_full_text = "\n".join(ocr_text)

# Read input text file
with open(input_text_file_path, 'r', encoding='utf-8') as input_file:
    input_lines = input_file.readlines()

# Split OCR text into lines
ocr_lines = ocr_full_text.split('\n')

# Validate each line
for line_number, input_line in enumerate(input_lines, start=1):
    input_line_processed = preprocess_text(input_line.strip())

    found = False
    for ocr_line in ocr_lines:
        ocr_line_processed = preprocess_text(ocr_line.strip())
        if input_line_processed == ocr_line_processed:
          #  print(f"Line {line_number} validated and found." )
          # print(f"Line {line_number} validated and found. {input_line}" )
            found = True
            break

    if not found:
        print(f"Line {line_number} mismatch:")
        print(f"  OCR: '{ocr_line}'")
        print(f"  Expected: '{input_line}'")

print('Validation complete.')

version 1: ( not ignoring additiona spaces).

import gc
import io
import os
from google.cloud import vision
from google.cloud.vision_v1 import types
from pdf2image import convert_from_path
from google.oauth2 import service_account

gc.collect()
!pip show google-cloud-vision pdf2image
!apt-cache policy poppler-utils

# Paths to the files
credentials_path = '/content/credentials_doc.json'
pdf_file_path = '/content/Vishwa-Vidyalayam.pdf'
input_text_file_path = '/content/CLOUD_విశ్వ విద్యాలయము_Corrected.txt'

# Verify the file existence
if os.path.exists(credentials_path):
    print("Credentials file found.")
else:
    print("Credentials file not found. Please check the path.")

if os.path.exists(pdf_file_path):
    print("PDF file found.")
else:
    print("PDF file not found. Please check the path.")

if os.path.exists(input_text_file_path):
    print("Input text file found.")
else:
    print("Input text file not found. Please check the path.")

# Authenticate using service account
credentials = service_account.Credentials.from_service_account_file(credentials_path)
client = vision.ImageAnnotatorClient(credentials=credentials)

# Convert PDF to images
images = convert_from_path(pdf_file_path, dpi=300)

# Function to perform OCR on an image
def perform_ocr(image):
    content = io.BytesIO()
    image.save(content, format='PNG')
    content = content.getvalue()
    image = types.Image(content=content)
    response = client.document_text_detection(image=image)
    return response.full_text_annotation.text

# Process each image and store the OCR text
ocr_text = []
for i, image in enumerate(images):
    text = perform_ocr(image)
    ocr_text.append(text)

# Join all OCR text into a single string
ocr_full_text = "\n".join(ocr_text)

# Read input text file
with open(input_text_file_path, 'r', encoding='utf-8') as input_file:
    input_lines = input_file.readlines()

# Split OCR text into lines
ocr_lines = ocr_full_text.split('\n')

# Validate each line
for line_number, input_line in enumerate(input_lines, start=1):
    input_line = input_line.strip()

    found = False
    for ocr_line in ocr_lines:
        if input_line == ocr_line.strip():
          #  print(f"Line {line_number} validated and found." )
          # print(f"Line {line_number} validated and found. {input_line}" )
            found = True
            break

    if not found:
        print(f"Line {line_number} mismatch:")
        print(f"  OCR: '{ocr_line}'")
        print(f"  Expected: '{input_line}'")

print('Validation complete.')

Search This Blog

My important workaround in my journey

validation pyton code, corrected text with pdf file again.

Popular posts from this blog

SAP CPI : camle expression in sap cpi , cm, router, filter and groovy script. format

pss book : శ్రీకృష్ణుడు దేవుడా, భగవంతుడా completed , second review needed. 26th April 2024

SAP CPI camel conditions and xpath conditions