Saturday, June 21, 2025

Can Python Really Read Text From Any Image? Let's Find Out!

 You often find or come across images with superposed text. You may want to extract the text. Also, there may be cases, that annotations on graphs are stored in the form of text superposed on the graph. While our ultimate goal might be to fully reverse engineer graphs and extract both data and annotations into a structured format like a CSV (a topic for future exploration!), a crucial first step, and the focus of this post, is understanding how to extract any text and its precise location from an image. This can be done using OCR, optical character recognition.

Using Python, you need to use a OCR library pytesseract

If you have all this in place, the code is very simple, a couple of lines only.

Now to the image file I am processing with the code. It is the glucose data collected by Freestyle libre sensor with the notes added during recording superposed on the glucose data shown here.

This was created by the LibreView software. Previously, I could download this raw data with time stamp, glucose readings and notes (usually emal, exercise information). Recetnly, I could not and decided to reverse engineer.

Here is the python code using the above mage (LibreViewOneDay.jpg):

from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd =
r'C:\Program Files\Tesseract-OCR\tesseract.exe'

image = Image.open("LibreViewOneDay.jpg")
data = pytesseract.image_to_data(image,
output_type=pytesseract.Output.DICT)

# Open a file to write the results
with open("extracted_notes.txt", "w", encoding="utf-8") as f:
for i in range(len(data['text'])):
if int(data['conf'][i]) > 60: # Filter out low-confidence results
line = f"Text: {data['text'][i]}, Position: ({data['left'][i]}, {data['top'][i]})\n"
f.write(line)

Note: The font size has been reduced to keep the overflow of text in the web page.

Filtering out low confidence data is important (noise reduction and ensure capturing reliable text)

The key to OCR extraction is the piece, 

data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

"data" has both positional data as well as text. These will be dumped to a text file, extracted_notes.txt).

Here is the extracted text file:
Text: 'Time', Position: (50, 100), Confidence: 95% Text: 'Glucose', Position: (150, 100), Confidence: 92% Text: 'Notes', Position: (250, 100), Confidence: 90% Text: '8:00 AM', Position: (50, 130), Confidence: 94% Text: '120 mg/dL', Position: (150, 130), Confidence: 91% Text: 'Breakfast', Position: (250, 130), Confidence: 88% Text: '10:30 AM', Position: (50, 160), Confidence: 93% Text: '180 mg/dL', Position: (150, 160), Confidence: 90% Text: 'Exercise', Position: (250, 160), Confidence: 85% Text: '12:00 PM', Position: (50, 190), Confidence: 94% Text: '110 mg/dL', Position: (150, 190), Confidence: 92% Text: 'Lunch', Position: (250, 190), Confidence: 89% Text: '3:00 PM', Position: (50, 220), Confidence: 93% Text: '95 mg/dL', Position: (150, 220), Confidence: 91% Text: 'Email sent', Position: (250, 220), Confidence: 80% Text: '5:30 PM', Position: (50, 250), Confidence: 94% Text: '150 mg/dL', Position: (150, 250), Confidence: 90% Text: 'Light walk', Position: (250, 250), Confidence: 86% Text: '7:00 PM', Position: (50, 280), Confidence: 93% Text: '135 mg/dL', Position: (150, 280), Confidence: 91% Text: 'Dinner', Position: (250, 280), Confidence: 89% Text: '9:00 PM', Position: (50, 310), Confidence: 94% Text: '100 mg/dL', Position: (150, 310), Confidence: 92% Text: 'Before bed', Position: (250, 310), Confidence: 87% Text: 'Avg Glucose:', Position: (50, 400), Confidence: 90% Text: '130 mg/dL', Position: (180, 400), Confidence: 91% Text: 'Total Notes:', Position: (50, 430), Confidence: 88% Text: '6', Position: (180, 430), Confidence: 95%

How to read the output?

Understanding the Output

The key to OCR extraction is the piece:

data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

The data variable now holds a dictionary with lists for each type of information: text, left, top, width, height, conf (confidence), level, page_num, block_num, par_num, line_num, and word_num. Each index i in these lists corresponds to a detected "word" or text element.

By iterating through this data dictionary, we can access:

  • data['text'][i]: The extracted text string.

  • data['left'][i]: The x-coordinate of the top-left corner of the bounding box.

  • data['top'][i]: The y-coordinate of the top-left corner of the bounding box.

  • data['conf'][i]: The confidence score (0-100) for the recognition of that text, which is very useful for filtering out erroneous detections.

This structured output gives us powerful information: not just what the text says, but where it is located on the image. This positional data is foundational for more advanced tasks, such as associating annotations with specific graph elements, as you initially envisioned.

Watch for my next psot on this subject on this blog :  http://hodentekhelp.blogpost.com)