͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Almost Timely News: 🗞️ Using Local AI for Document Scanning (2026-02-01)

Mundane? Yes. Useful? Also yes.

Christopher S. Penn

Feb 1

READ IN APP

Almost Timely News: 🗞️ Using Local AI for Document Scanning (2026-02-01) :: View in Browser

The Big Plug

Two new things to try out this week:

1. Got a stuck AI project? Try out Katie’s new, free AI Readiness Assessment tool. A simple quiz to help predict project success.
2. Wonder how your website is seen by AI? Try my new, free AI View tool (limited to 10 URLs per day while it’s in beta). It looks at your site and tells you what an AI crawler likely sees.

Content Authenticity Statement

100% of this week’s newsletter content was originated by me, the human. Learn why this kind of disclosure is a good idea and might be required for anyone doing business in any capacity with the EU in the near future.

Watch This Newsletter On YouTube 📺

Click here for the video 📺 version of this newsletter on YouTube »

Click here for an MP3 audio 🎧 only version »

What’s On My Mind: Using Local AI for Document Scanning

This week, let’s dig into a very specific application of local AI. Last week we covered how to get started with private, local AI which I recommend you review and do. We’ll be building on that. If you don’t want or can’t get a local AI model running, consider using an infrastructure provider like DeepInfra or Groq (note the spelling) as they can provide low cost access to today’s best models, often with zero data retention APIs.

The specific application of local AI we’re looking at this week is something seemingly mundane: document scanning. Now, you might say, “Chris, that is the more boring, mundane, unsexiest use of generative AI, don’t you have anything more interesting?” But something like document scanning is the epitome of the Shirky Principle: once a technology is technologically boring, it can be societally interesting.

Using generative AI for document scanning is boring, but there are plenty of documents in the world that are very difficult to read through normal scanning. Photographic scans of paperwork. Documents with charts and graphs and images embedded in them. Weirdly formatted tables. Partially redacted documents. All those are things that can throw regular document scanners for a loop. Generative AI models trained to be document scanners can overcome many of those issues.

So let’s dig in.

Part 1: A Pre-Emptive Glossary

Before we dig into the how-to, let’s take a few moments to describe the what. Document scanning is a profession unto itself and has a lot of lingo and jargon - jargon that, if you know, makes it easier to work with AI.

The most common term you’ll hear is OCR - optical character recognition. This is what a lot of computer vision software started with, the need to scan letters and convert analog text (like the printed page) into digital text. Additionally, as models got more powerful and compute got bigger, OCR improved to start reading handwriting. Today, you can take a photo of handwritten text even from centuries past and most AI models will be able to transcribe it.

As a fun aside, I tried that with some Sumerian text from a local museum, from the Museum of Fine Arts in Boston, and Google’s Gem and I was actually able to read it with reasonable accuracy.

Transcribe itself is a specific word in document scanning, something you’ll want to note for AI prompts. To transcribe is to write down text, word for word, as closely as possible to the original. In general, you often want AI to transcribe something first before doing anything else, so you can check the quality of its work. Many people make the mistake of trying to have AI do too much and just process an entire document in one shot, rather than break down the steps of the workflow.

When we talk about using AI for document scanning, we are often talking about VLMs, vision language models. VLMs are models that can work with images as well as text (and sometimes video). They can “see” in ways that a text model cannot, because they’ve been trained on images as well as text. When we’re doing document scanning, we want to make sure we’re using a vision language model for the processing part.

Speaking of which, there is a distinct workflow for doing document scanning with AI:

Transcribe. Get the raw goods as close as possible to the originals.
Repair. Fix obvious, common errors, especially with things like character confusion. Optically, .com and .corn look very similar and OCR models can often mistake one for the other. We can use LLMs to inspect a transcript and find obvious things to repair like christopherspenn.corn really should be christopherspenn.com contextually.
Format. Once we’ve got functionally clean text, we can format it for the kinds of output we want. Often with transcription, you don’t want a wall of text.
Output. Your document data has to be put into motion somehow. Just sitting there on a hard drive isn’t terribly useful.

A few other terms you’ll want to know:

SQLite is one of the most useful database formats there is, because it’s a single file that lives on your computer. Unlike bigger systems that require servers (MySQL, PostgreSQL, Microsoft SQL Server, Google BigQuery, etc.) SQLite is just a single flat file that lives in any folder. You can pick it up and move it around if needed. It’s also a database format that AI is especially fluent in and knows how to manipulate, which comes in handy when we’re talking about document scanning and storage.

Open source software is any software that is licensed for other people to use and modify, often for free, even for commercial use (depending on the license). Many of the world’s top systems and software are open source, such as the Apache web server, the Linux operating system, many programming languages, and other core technologies. Often abbreviated FOSS (free and open source software), open source is what powers a lot of the modern Internet. Generative AI has extensive knowledge of open source software, which comes in handy for not reinventing the wheel.

Python is probably the most common programming language in the world now, and certainly the most popular language in open source software. Python version 3.12 and 3.13 are the versions that many libraries (basically like plugins) that AI tools depend upon.

Finally, context window. All AI has two kinds of memory, long term and short term. Long term memory is the data the AI has been trained on, and as of today, when you use any AI model, that memory does not change. It’s why so many AI tools integrate web search. The short term memory, or working memory, is called the context window. For the purposes of building and working with your own local AI, the bigger you make it (you set it in your software, like LM Studio/AnythingLLM) the more memory your AI consumes and the slower it runs. That’s another reason why lots of small tasks are better than one big task - it’s far more resource efficient.

Okay, now that we’ve got the book learning out of the way, let’s dig in.

Part 2: Choosing a Model

Assuming you completed the setup from last week’s newsletter, you should have either LM Studio or AnythingLLM set up on your computer. We now need to find an AI model that will work for document scanning. There are a ton of excellent choices out there, some of which include:

Qwen3-VL - the most versatile, most compatible model right now, works with almost any major system
DeepSeek OCR 2 - DeepSeek’s comprehensive scanning model - the highest quality choice if you have the computer power
Mistral OCR 3 - Mistral’s larger, smarter AI for document scanning really weird, complex documents (like old handwriting)

Ideally, if you have the time and resources, take a couple of pages from the kind of document you want to work with, install all 3 models, and then do a test run to see which model is most accurate for the specific kind of task you’re doing. For example, if you’re doing old manuscripts that were handwritten in 18th century penmanship, you might find that Mistral OCR 3 does a better job than Qwen or DeepSeek. If you’re dealing with scanned notes from cold cases from the 1970s written on old typewriters, you might find that DeepSeek does a better job.

Logically, you wouldn’t be investigating OCR models unless you had a LOT of documents you needed scanned, so take the time to test out models and see how their accuracy performs.

You might ask at this point, why wouldn’t we just use Gemini or ChatGPT or Claude for these kinds of tasks? The answer is based on last week’s newsletter - my assumption is that there are plenty of documents that are confidential, that you wouldn’t want in the hands of third parties, especially medical and legal documents. Maybe there are documents that you don’t want someone else to know you’re looking at. For those kinds of documents, local AI is the best choice.

Additionally, there are plenty of documents that you might want to process that you simply don’t want to pay for. When you use cloud-based artificial intelligence, you are going to pay per token. And while an individual token may not cost much, if you’re talking about millions of pages of documents, that bill can get very large, very quickly. Local AI will save you that money in exchange for electricity and the computer you’re running it on.

Part 3: The Automation

Here’s the challenge with lots of document scanning. You can, pretty easily, take a short document that’s a few pages long and just drop it into chat and have nearly any AI model give you a good transcript. This is true of local models and cloud-based models (like ChatGPT). But when a document is dozens or hundreds of pages long, or you have archives of thousands of documents, that’s no longer practical. Even fifty pages is a lot for a model like Gemini to handle, and the risk of it hallucinating or skipping pages gets higher as you add more work to its plate.

If, on the other hand, you could feed AI one page at a time? It’s perfectly comfortable with that and will give you great results. As I mentioned in the trashy romance novel issue, AI does great when you take a big task and turn it into a ton of small tasks.

So our next step is to do exactly that. Using the AI of your choice, we want to design a Python script that can take apart a big document or set of documents, split them into single pages, and then process each page and store it in a database. If you followed the instructions from last week’s newsletter, you should have Python on your computer and ready to go. If you didn’t, go back and do that.

Here’s the prompt you can paste into any big AI model like Gemini, ChatGPT, or Claude. Modify the part in the curly braces, then copy and paste into your favorite AI.

You’re an expert Python developer skilled at document processing and OCR. Your task is to write a Python 3 script that, when given a folder of documents of varying types such as PDFs, images, and common office document formats, will ingest each document, split it into individual pages or files (for example, a 50 page PDF will be processed one page at a time), pass it to a local OCR Vision Language Model (VLM) running on an OpenAI compatible endpoint with {LM Studio / AnythingLLM, depending on what you installed last week}, then passed back into a SQLite database. Use your web search tools to find the appropriate endpoint formats, and how to force strict JSON output. The LLM/VLM output should be strict JSON and the script should interpret the JSON with the json/pyjq library and store the data in SQLite as ocr.db. The fields in the SQLite database should be the document filename, a timestamp, the scan status, the page number (for multi-page documents), the raw JSON results, and the actual text from the scan in Markdown format, all in a single table named documents. The Python script should accept a folder name or document name as a command line argument, —input, that the user will specify when running. The invocation is python3 ocr.py —input (file or folder). The script must autonomously recognize the input types; do not ask the user to specify input file types. Supported types should include PDF, PNG, JPEG, DOCX, PPTX, XLSX, HTML, TXT, MD. Use the VLM for true image types (PNG, JPG, PDF) with pymupdf and extract raw text from text-based files. For all documents that have pages, the script MUST process a single page at a time. If a document fails to scan, the script should retry twice and then mark its status as failed in the database. If a document succeeds, mark its status as success in the database. The script should invoke pip and automatically check/install any necessary libraries or dependencies automatically.

After a whole bunch of thinking, you should end up with a single Python script you download and put somewhere on your computer, ideally in the same folder as the folder of documents you want to scan. If you’re not sure what to do, put this prompt in and follow its guidelines, editing the parts in curly braces:

You’re an expert IT consultant. I have a python script, ocr.py, that takes a file or folder of files as the command line argument —input, such as python3 ocr.py —input myfolder. I am a {beginner / intermediate / advanced} user of {MacOS / Windows} and need step by step instructions for what to do with this file - how to move it, where to move it, and how to run it on the command line. Ask me questions until you have enough information to successfully complete the task of giving me step by step guidance, keeping in mind my technical skill level.

This will give you the step by step instructions for how to actually use the script that the AI software spit out.

Run a quick test on a folder with a couple of PDFs and see how it goes.

Using the tools for troubleshooting is just as important. If you give your AI exactly the error messages it gives you, it will help you straighten out any bugs that it’s created as well.

Part 4: Using the Data

Once your script has used its AI to do vision-based OCR, you should have a database of results. The question is, what do you do with that database? You have a couple of options. First, if you know your way around databases, SQLite is nothing more than a SQL database. You can use any FOSS SQLite app to browse the data, write inquiries, etc. My recommended client is one called DB Browser for SQLite - it’s available for free for Mac, Windows, and Linux.

If you’re not fluent in SQL (the language of SQLite), then as long as you have an AI agent that can see the database file itself - Claude Code, Claude Cowork, Google Antigravity, OpenAI Codex, etc. - then you can ask natural language questions of the AI agent and have it interrogate the database directly.

Here’s an example prompt - note that you MUST have the AI agent running in the folder you’re working in; see this Trust Insights livestream for getting started with Claude Cowork and this one for getting started with Claude Code.

You’re a SQLite database expert. I need to know from the ocr.db SQLite database what documents contain the word ChatGPT, in descending order by frequency. The table name is documents. The columns in the database are filename, timestamp, status, page, results, text. Query the database directly and write the top 10 results in descending order by frequency to a CSV file named database-results.csv. Exclude the results column.

This will give you a CSV spreadsheet you can open in the spreadsheet software of your choice - though I do recommend learning SQL, as it’s a super handy language to know.

Once you’ve got the data, it’s up to you what you do with it. And you can ask generative AI to route the data any way you want it; you don’t have to use SQLite. If you’ve got existing systems that work well for you, if you have APIs that work well for you, you can have your AI of choice modify your Python script for those instead.

For example, suppose you were scanning in a ton of receipts, and maybe you have a system like Quickbooks. As long as Quickbooks has an API you’re allowed to use, you could have your script send your receipts text to it instead.

Part 5: Wrapping Up

Document transcription can be incredibly boring to start, but once you start learning how to work with local AI on it, you’ll find tons of uses. For example, one of my favorite use cases for a vision language model exactly as we set up in this article is to rename all those “Screenshot 2026-01-31 14:00:12.png” files littering my computer into things like “cat_sitting_on_my_head.png” so that I have a better sense of what’s in my actual images folder. The same thing is true for PDFs - “EFTA01264412.pdf” becomes “att_wireline_phone_records.pdf”, a far more helpful filename.

The sky’s the limit once you start using local AI, in partnership with code you write with big cloud AI, to get seemingly boring, mundane tasks done. Is it the fanciest thing that’s gonna get you lots of clout on LinkedIn or the other social networks of your choice? No. Will you be more productive? Yes.

How Was This Issue?

Rate this week’s newsletter issue with a single click/tap. Your feedback over time helps me figure out what content to create for you.

Here’s The Unsubscribe

It took me a while to find a convenient way to link it up, but here’s how to get to the unsubscribe.

[](https://almosttimely.substack.com/action/disable_email?utm_source=almost-timely-newsletter&utm_medium=email&utm_campaign=almost-timely-2024- 12-28)

If you don’t see anything, here’s the text link to copy and paste:

https://almosttimely.substack.com/action/disable_email

Share With a Friend or Colleague

Please share this newsletter with two other people.

Send this URL to your friends/colleagues:

https://www.christopherspenn.com/newsletter

For enrolled subscribers on Substack, there are referral rewards if you refer 100, 200, or 300 other readers. Visit the Leaderboard here.

ICYMI: In Case You Missed It

Here’s content from the last week in case things fell through the cracks:

On The Tubes

Here’s what debuted on my YouTube channel this week:

Skill Up With Classes

These are just a few of the classes I have available over at the Trust Insights website that you can take.

Premium

Free

Advertisement: New AI Book!

In Almost Timeless, generative AI expert Christopher Penn provides the definitive playbook. Drawing on 18 months of in-the-trenches work and insights from thousands of real-world questions, Penn distills the noise into 48 foundational principles—durable mental models that give you a more permanent, strategic understanding of this transformative technology.

In this book, you will learn to:

Master the Machine: Finally understand why AI acts like a “brilliant but forgetful intern” and turn its quirks into your greatest strength.
Deploy the Playbook: Move from theory to practice with frameworks for driving real, measurable business value with AI.
Secure Your Human Advantage: Discover why your creativity, judgment, and ethics are more valuable than ever—and how to leverage them to win.

Stop feeling overwhelmed. Start leading with confidence. By the time you finish Almost Timeless, you won’t just know what to do; you will understand why you are doing it. And in an age of constant change, that understanding is the only real competitive advantage.

👉 Order your copy of Almost Timeless: 48 Foundation Principles of Generative AI today!

Get Back To Work!

Folks who post jobs in the free Analytics for Marketers Slack community may have those jobs shared here, too. If you’re looking for work, check out these recent open positions, and check out the Slack group for the comprehensive list.

Advertisement: New AI Strategy Course

Almost every AI course is the same, conceptually. They show you how to prompt, how to set things up - the cooking equivalents of how to use a blender or how to cook a dish. These are foundation skills, and while they’re good and important, you know what’s missing from all of them? How to run a restaurant successfully. That’s the big miss. We’re so focused on the how that we completely lose sight of the why and the what.

This is why our new course, the AI-Ready Strategist, is different. It’s not a collection of prompting techniques or a set of recipes; it’s about why we do things with AI. AI strategy has nothing to do with prompting or the shiny object of the day — it has everything to do with extracting value from AI and avoiding preventable disasters. This course is for everyone in a decision-making capacity because it answers the questions almost every AI hype artist ignores: Why are you even considering AI in the first place? What will you do with it? If your AI strategy is the equivalent of obsessing over blenders while your steakhouse goes out of business, this is the course to get you back on course.

👉 Take the course now!

How to Stay in Touch

Let’s make sure we’re connected in the places it suits you best. Here’s where you can find different content:

My blog - daily videos, blog posts, and podcast episodes
My YouTube channel - daily videos, conference talks, and all things video
My company, Trust Insights - marketing analytics help
My podcast, Marketing over Coffee - weekly episodes of what’s worth noting in marketing
My second podcast, In-Ear Insights - the Trust Insights weekly podcast focused on data and analytics
On Bluesky - random personal stuff and chaos
On LinkedIn - daily videos and news
On Instagram - personal photos and travels
My free Slack discussion forum, Analytics for Marketers - open conversations about marketing and analytics

Listen to my theme song as a new single:

Advertisement: Ukraine 🇺🇦 Humanitarian Fund

The war to free Ukraine continues. If you’d like to support humanitarian efforts in Ukraine, the Ukrainian government has set up a special portal, United24, to help make contributing easy. The effort to free Ukraine from Russia’s illegal invasion needs your ongoing support.

👉 Donate today to the Ukraine Humanitarian Relief Fund »

Events I’ll Be At

Here are the public events where I’m speaking and attending. Say hi if you’re at an event also:

Tourism Industry Association of Alberta, Edmonton, February 2026
Social Media Marketing World, Anaheim, April 2026
SMPS AI Conference, November 2026

There are also private events that aren’t open to the public.

If you’re an event organizer, let me help your event shine. Visit my speaking page for more details.

Can’t be at an event? Stop by my private Slack group instead, Analytics for Marketers.

Required Disclosures

Events with links have purchased sponsorships in this newsletter and as a result, I receive direct financial compensation for promoting them.

Advertisements in this newsletter have paid to be promoted, and as a result, I receive direct financial compensation for promoting them.

My company, Trust Insights, maintains business partnerships with companies including, but not limited to, IBM, Cisco Systems, Amazon, Talkwalker, MarketingProfs, MarketMuse, Agorapulse, Hubspot, Informa, Demandbase, The Marketing AI Institute, and others. While links shared from partners are not explicit endorsements, nor do they directly financially benefit Trust Insights, a commercial relationship exists for which Trust Insights may receive indirect financial benefit, and thus I may receive indirect financial benefit from them as well.

Thank You

Thanks for subscribing and reading this far. I appreciate it. As always, thank you for your support, your attention, and your kindness.

Please share this newsletter with two other people.

See you next week,

Christopher S. Penn

Invite your friends and earn rewards

If you enjoy Almost Timely Newsletter, share it with your friends and earn rewards when they subscribe.

Invite Friends

Comment

Restack

Search This Blog

bosa@entrepreneur.blogspot.com