From Inputs to Infrastructure

Designing a Scalable Foundation for an AI That Reads the World

Hi, I’m Florian, Fractional CTO and builder of AI-powered tools designed to solve real, everyday problems.

After outlining the core idea in Part 1, it’s time to move from why to how.

This second chapter explores not just what the tool should do, but how it should be designed under the hood so it stays useful, scalable, and trustworthy across different fields and use cases. You’ll see the kind of questions I ask myself and the thinking that leads to my design decisions.

What should the tool actually support?

To stay relevant for a wide variety of users, the tool has to do more than scrape a few websites or send alerts.

So the first question is: What kind of input should the system handle?

The goal is clear: support as many content types as possible, not just webpages. That means:

Articles and blog posts
PDFs and academic papers
Newsletters and mailing lists
Legal updates
JSON feeds and structured reports

The tool should be able to take in all these formats, clean them up, and turn them into a consistent internal format. That’s what I mean when I talk about multi-input support.

Next question: Who is this tool actually for?

While my starting point is recruiting professionals (via our UPFLINX platform), I want the tool to work across sectors. That means designing for multi-purpose use from day one. A lawyer might care about new legislation, a doctor about medical papers, a policy expert about regulatory trends. Everyone reads different content, but they share one big challenge: staying informed without getting overwhelmed.

Should the system run in real-time or on a schedule?

So the next question becomes: How often do people actually need this information?

In many cases, real-time updates aren’t necessary. Most professionals prefer structured updates they can actually digest, think:

Daily briefings before work
Weekly newsletters
Monday-morning overviews

That’s why I’ve chosen batch processing over live streaming. It’s simpler, more efficient, and easier to control. Real-time pipelines make sense for emergency alerts, stock trading, or breaking news, not for thoughtful insight.

That said, the architecture will remain flexible so that real-time or near-real-time updates can be added later if needed.

Where and how should the content be stored?

This brings us to one of the most foundational questions:

How do you store the content in a way that supports both analysis and discovery?

To make the assistant actually useful, we need to preserve not just the content, but also the context:

When was it published?
Who wrote it?
What topics does it cover?
Why does it matter to the user?

This is what metadata provides and it’s critical. Without metadata, you can’t:

Filter articles by date or topic
Track how coverage changes over time
Tailor content to individual user profiles

Think of metadata as the “index” of a smart library: without it, you’re just dumping books on a shelf with no way to find the right one.

To support both detailed filtering and deep semantic search, I’m building a dual storage setup:

Structured database (like PostgreSQL) to hold all articles, tags, summaries, and metadata
Vector database (like Qdrant or Weaviate) to store embeddings of the actual content chunks

Why both? Because they solve different problems:

The structured DB helps you organize and manage content.
The vector DB lets you search by meaning, not just keywords, which is key when dealing with nuanced or technical content.

Example: If a compliance officer searches for “data protection changes in hiring,” a vector search can find relevant articles even if they use different phrasing like “GDPR extensions impacting recruitment.”

What technologies make sense for this kind of project?

Since I want this tool to be open, secure, and scalable, the stack must be:

Open source, no black boxes
Self-hostable, no data leaks to external services
Cloud-friendly, but not dependent on a specific provider
Affordable to run, even at 500 documents per day

Here’s the current setup:

Backend: Python + FastAPI (clean, simple APIs)
Structured DB: PostgreSQL (mature and robust)
Vector DB: Qdrant or Weaviate (both easy to host and use)
Embeddings: SentenceTransformers (for self-hosting) or OpenAI (for better accuracy if needed)
Containerization: Docker (makes it portable and modular)

Why not MongoDB? It doesn’t natively support vector search and can become costly at scale.

What might a document look like inside the system?

Let’s break this down with an example.

In the structured database, an entry could look like this:

{
  "id": "doc_8765",
  "title": "Data Privacy Laws Revised",
  "source": "official-journal.eu",
  "published": "2025-06-22",
  "tags": ["privacy", "law", "compliance"],
  "summary": "New GDPR extensions include...",
  "llm_insights": "Relevant for HR software tools",
  "vector_id": "vec_8765"
}

In the vector database, the content is broken into chunks for embedding:

{
  "id": "vec_8765",
  "doc_id": "doc_8765",
  "chunks": [chunk1, chunk2, ...]
}

This setup allows us to ask:

“Find me all documents about GDPR changes since 2024” (structured filter)

“Give me similar articles to this one” (semantic search via vector)

Can it scale to hundreds of documents per day?

Yes. But only with smart design.

At 500 documents per day, you need to:

Avoid unnecessary compute work
Automate data cleaning and chunking
Use lightweight tools that won’t balloon your cloud bills

This is why open source and containerized deployment are key: low overhead, high control.

The system can run on a basic VPS or scale up on AWS or Google Cloud with minimal friction.

If needed, we can isolate deployments (e.g., for clients with strict data security needs) without changing the core design.

What’s next in the build?

In Part 3 and ongoing, I’ll share:

How I collect content from different sources and keep it clean
How metadata is extracted and normalized and why that improves everything
How embeddings are generated and stored with tradeoffs in quality and speed
How I summarize articles using LLMs and how to get useful summaries, not generic ones

All shared openly, with the same goal: build tools that bring clarity, not complexity.

From Inputs to Infrastructure

Designing a Scalable Foundation for an AI That Reads the World

Contact

Locations

Legal