1. PdfWiseAI
  2. Tous les guides
  3. How to convert a scanned PDF to text

Guides PdfWiseAI

How to convert a scanned PDF to text

A scanned PDF is a folder of photographs pretending to be a document. You can see the words, but the computer cannot — Find returns nothing, copy-paste does not work, and screen readers read nothing. This guide explains why, and shows how to add a real text layer with OCR so the scan becomes searchable without leaving your device.

Published 2026-06-18Last reviewed 2026-06-181660 words

Why a scanned PDF has no text

A normal PDF stores text as characters: the letter "A" is a glyph reference plus a position, and a reader can search, copy, and read it. A scanned PDF stores pages as images: the letter "A" is just a cluster of pixels that looks like an A to a human eye but is meaningless to the reader's Find function.

That is why a scanned contract, a scanned receipt, or a scanned book returns zero results when you search it. The text is not in the file as text. Optical character recognition (OCR) is the step that looks at the image pixels, recognizes the letters, and writes a text layer that sits on top of the image.

What OCR actually adds to the file

A good OCR pass does not replace the image. It adds an invisible text layer aligned to the image, so the page still looks exactly as it did but is now searchable and selectable. You can Find a word, copy a paragraph, and a screen reader can read the page. The visible scan is unchanged.

This is different from "converting to a Word document," which tries to reconstruct layout and editable text and usually makes a mess of anything beyond simple paragraphs. For most scanned documents, a searchable PDF with an OCR text layer is the right target, not a DOCX.

Step-by-step: run OCR in your browser

A local browser OCR tool reads the scan in the current tab, recognizes text, and writes a searchable PDF. The source is not uploaded for the OCR step when the engine runs locally.

  • Open the OCR tool in your browser and drop the scanned PDF.
  • Pick the language of the document. OCR accuracy drops sharply when the language is wrong.
  • Run OCR. The browser processes the pages and prepares a searchable PDF.
  • Download the result and open it in your reader.
  • Use Find on a word you can see on the first page. If Find locates it, the text layer is working.
  • Copy a short sentence and paste it into a text editor to check for obvious recognition errors.

What good OCR accuracy looks like

On a clean 200–300 dpi scan of printed text in a supported language, modern OCR recognizes well over 95 percent of characters. The remaining errors cluster in predictable places: small fonts, low contrast, curled or skewed pages, handwriting, stamps, and tables. For most documents, 95+ percent is enough to search and skim; for legal or financial use, proofread the parts that matter.

Accuracy falls off a cliff below 150 dpi, on handwritten content, and on documents in a language the engine does not support. If OCR returns garbage, the first things to check are the language setting and the scan resolution. Re-scanning at 300 dpi in greyscale fixes a large fraction of bad-OCR cases.

Skew, contrast, and other scan problems

OCR engines assume roughly upright, reasonably high-contrast text. A scan that was photographed at an angle, that has a dark gutter shadow on one side, or that has a coloured background will produce worse results even at good resolution. Deskew and despeckle steps — sometimes built into the OCR tool, sometimes a pre-processing pass — correct these before recognition.

If your scan is a phone photo of a page rather than a flatbed scan, expect to spend a moment on pre-processing. Crop to the page, straighten the text, and raise contrast before OCR. The result is dramatically better than running OCR on the raw photo.

Choosing the right language

OCR is not language-agnostic. An engine recognizes a specific set of scripts and languages, and picking the wrong one turns recognizable text into nonsense. A French document run through an English-only engine will mangle accented characters; a Japanese document run through a Latin engine will return nothing useful.

Set the language to match the document. For multilingual documents, some engines accept multiple languages at once. If the document mixes scripts, pick the dominant one first and verify the result on a mixed-script page.

Verifying the OCR result

Trust, but verify. After OCR, run these checks.

  • Find a word visible on the first page. If Find locates it, the text layer is present.
  • Find a word visible on the last page. Some tools OCR only the first few pages on a free tier.
  • Copy a sentence with numbers and punctuation. Numbers and punctuation are where OCR errs most.
  • If the document has a table, try selecting a cell. Table recognition is harder than prose; expect imperfection.
  • Open the document properties and confirm the page count matches the source.

When OCR needs a server

Local OCR in a browser handles single documents and small batches well. It struggles or fails on very large scans (hundreds of pages), on languages the local engine does not support, and on handwritten content that needs a specialized model. In those cases a server-side OCR service is the right tool.

When you use a server OCR service on a sensitive scan, read the privacy page before uploading. Confirm where the file is processed, how long it is retained, and whether it is shared with subprocessors. For a tax scan, a medical scan, or a legal scan, prefer a local tool or a vendor with clear deletion guarantees. For a public-domain book scan, the server path is fine.

After OCR: what else you can do

Once the scan is searchable, the rest of the PDF toolkit works on it. You can extract the text to a plain text file, merge several OCR'd scans into one searchable document, compress the result for email, or split it into per-chapter files. None of those steps need a server either.

If you only need the text and not the PDF, an extract-text step gives you a .txt file you can search, grep, or paste into another tool. That is useful for receipts, for notes, and for feeding a small excerpt into another workflow without sharing the whole document.

Handwriting, stamps, and the limits of OCR

OCR engines are trained on printed type. Handwriting is a different problem, and most general-purpose engines recognize it poorly or not at all. A handwritten note on a scanned form, a signature, or margin annotations will usually come through as garbage or be skipped entirely. For handwritten content, expect to transcribe the important parts manually rather than rely on the OCR layer.

Stamps, watermarks, and colored backgrounds sit between the text and the engine. A faint stamp over a word can cause a recognition error on that word; a dark background behind light text can drop entire lines. Pre-processing that raises text contrast and suppresses background noise helps, but there is a point of diminishing returns where re-scanning the page cleanly is faster than tuning the pre-processing pipeline.

Tables are the hardest common case. An OCR engine reads tables as a stream of text and loses the column structure, so a row of numbers can come out in the wrong order or merged with the next row. If you need the table data, export the recognized text and rebuild the table in a spreadsheet, or use a tool specifically designed for table extraction. Do not trust a prose OCR pass to preserve table structure.

  • Handwriting is rarely recognized well; transcribe important handwritten parts manually.
  • Stamps and dark backgrounds cause errors; re-scan cleanly rather than over-tuning pre-processing.
  • Tables lose their column structure in a prose OCR pass; rebuild them in a spreadsheet if needed.
  • A signature is an image, not text — OCR will skip it, which is usually what you want.

A quick reference for scanned PDFs

Before you run OCR on a scan:

  • Confirm the source is image-only (Find returns nothing). If it already has text, OCR is unnecessary.
  • Set the OCR language to match the document.
  • Use a 200–300 dpi scan if you can re-scan; below 150 dpi, expect poor results.
  • Deskew and raise contrast on phone photos before OCR.
  • After OCR, Find a word on the first and last page to verify the text layer.

How it works in PdfWiseAI

  1. PdfWiseAI OCR tool with a scanned PDF loaded and language selected
    Drop the scan and pick the document language before OCR.
  2. Find dialog locating a word in the OCR'd PDF
    Verify the text layer by finding a word on the first page.

Screenshots are placeholders for the editorial design pass; each manifest entry records the step, the alt text, and the caption that the screenshot should communicate.

Frequently asked questions

How do I make a scanned PDF searchable?
Run OCR. OCR adds an invisible text layer aligned to the image so Find, copy, and screen readers work. The visible scan is unchanged.
Why does Find return nothing on my scanned PDF?
Because the scan stores pages as images, not characters. The text is not in the file as text. OCR adds the text layer that makes Find work.
Can I run OCR without uploading my scan?
Yes. A local browser OCR tool recognizes text in the current tab and writes a searchable PDF. For very large batches or unsupported languages, a server tool may be needed.
How accurate is OCR on a scanned PDF?
On a clean 200–300 dpi scan in a supported language, well over 95 percent of characters. Accuracy drops with low resolution, handwriting, skew, and wrong language settings.
Does OCR replace the image?
No. A good OCR pass adds a text layer on top of the image. The page looks the same but becomes searchable and selectable.
Should I convert a scan to Word instead?
Usually no. A searchable PDF with an OCR text layer preserves the layout. Converting to DOCX tries to reconstruct editable text and often makes a mess of anything beyond simple paragraphs.
What language should I pick for OCR?
The language of the document. Wrong-language OCR mangles accented characters or returns nothing. For multilingual documents, pick the dominant script first.

Sources and further reading