receipt-ripper.com
Your receipts never leave your device
Two formats, two completely different code paths inside the parser, two very different accuracy profiles. Pick the right format for the job.
Most receipts arrive in one of two forms: a printed slip from a till that you photograph with your phone, or a PDF emailed by an online merchant. The two formats look superficially similar (both are "receipts"), but inside a parser like Receipt Ripper they go down completely different code paths with very different accuracy characteristics. This guide explains the difference and offers a rule of thumb.
Before comparing PDF to photo, it's worth knowing that "PDF" is two things wearing the same hat. Some PDFs contain an embedded text layer — the original characters, encoded as text in the file, exactly as the merchant's billing system emitted them. Other PDFs are essentially photographs wrapped in a PDF envelope, with no text layer at all. The first kind parses near-perfectly; the second is no better than a photo.
You can tell them apart by opening the PDF in any reader and trying to select text with your cursor. If selection works and you can copy "Subtotal $12.50" to the clipboard as actual text, it's a text-layer PDF. If selection just draws a rectangle without picking up anything, it's an image-only PDF and the receipt content lives only in the rasterised pixels.
Receipt Ripper currently routes every PDF through the OCR pipeline regardless — we render the first page of the PDF to a canvas, then OCR the canvas. That means even text-layer PDFs go through OCR rather than reading the text layer directly. This is a known limitation worth tracking; for the present article, treat all PDFs as taking the same OCR path as a high-quality photo would.
A photographed receipt has stacked-up reasons to be harder than a PDF:
For a short receipt with large text — a parking meter ticket, a coffee shop slip — both formats parse fine and the difference is academic. For a long restaurant receipt with twenty line items in small print at the bottom, the PDF version is significantly more accurate because the parser doesn't have to make line-item decisions on 6-pixel-tall digits.
The accuracy gap also widens dramatically for foreign-currency or unusual-character receipts. A photo of an Italian restaurant receipt with "€" symbols, decimal commas, and Italian month abbreviations is harder for the OCR engine than its PDF equivalent, because the photo introduces ambiguity on every distinguishing character.
For tax filing — where every digit on a receipt eventually matters — the PDF version is the one to keep when both exist. Photograph the paper original as a backup, but use the PDF for parsing if the merchant sent one.
In practice, the choice is often made for you. Paper-only receipts from offline shops, restaurants, taxis, and parking meters can only be photographed. Email receipts from Amazon, Uber, Stripe-billed services, and most modern e-commerce are PDF-only (or HTML emails you can save as PDF).
A few merchants are in both worlds. Hotel chains often print a paper receipt at check-out and email a PDF receipt simultaneously. Some restaurants print and email. When you have both, default to the PDF — but file the paper copy too if your jurisdiction's tax rules require originals (varies; in the US, in most cases a digital copy of an originally-paper receipt is sufficient as long as it's legible; in some European jurisdictions, the original VAT receipt has to be retained).
For freelancers and small businesses receiving a mix of formats, the workflow we see work best is roughly:
The tooling doesn't care which format you drop in; the parser routes JPG/PNG/HEIC through the image pipeline and PDF through the PDF.js pipeline automatically. What you should care about is that the parser sees the highest-fidelity copy of each receipt — which usually means the original PDF when one exists, the freshest possible photo when only paper exists.
For more on what makes photos parse cleanly, see how to photograph a receipt so OCR actually reads it. For dealing with the residual misreads after you've done everything right, see OCR accuracy troubleshooting.