2225 words
11 minutes
Building a Semantic Note Network: How Vector Search Turns Isolated Notes into a Knowledge Graph
Part 3 of 5 Obsidian Notes Pipeline

1,024 notes. Zero manual links. 2,757 bidirectional connections discovered automatically.

By the Dotzlaw Team


Before and after comparison of the Obsidian knowledge graph -- the left shows a sparse, disconnected network of isolated notes clustered by folder; the right shows a dense, interconnected web of blue lines after semantic linking

Figure 1 — Initial disconnected state (left) vs. semantic knowledge graph (right). 1,024 notes, zero manual links, 2,757 auto-connections.

The Transformation#

1,024 isolated notes. 2,757 bidirectional links. A knowledge graph where every note connects to 3-5 semantically similar notes — all discovered automatically.

Before we built this system, the vault looked like a galaxy of orphans. Open Obsidian’s graph view and you’d see clusters of notes huddled together by folder, with vast empty space between them. A note about RAG architectures sat in one corner. A note about vector databases floated in another. A third about LangChain drifted somewhere else entirely. They were obviously related — but nothing connected them.

Zoomed view of the disconnected vault graph showing isolated clusters -- Cluster A labeled RAG Architectures and Cluster B labeled Vector Databases with no connection between them, marked with a red X

Figure 2 — The disconnected state: related concept clusters (RAG Architectures, Vector Databases) sit in isolated corners with no awareness of each other.

After running the semantic linking pipeline, the graph view transformed. Those same notes now form a dense web of blue lines, every dot radiating connections outward. Click into any note and you can see exactly what else in your vault relates to it. The note about RAG architectures links to vector databases, which links to embedding models, which links to LangChain retrieval patterns. Knowledge that previously required searching or remembering now surfaces through navigation.

The key insight behind this transformation: when we link Note A to Note B, we also link Note B back to Note A. Bidirectional linking is what turns a flat collection of files into a navigable knowledge graph.


How Semantic Similarity Works#

Two notes are related if they’re about similar things. “Similar things” is fuzzy for humans but precise for math: convert notes to vectors (embeddings) and measure the angle between them.

Note A: "RAG architectures combine retrieval and generation..."
--> embed
[0.23, -0.45, 0.78, 0.12, ...] (1536 dimensions)
Note B: "Vector databases enable semantic search..."
--> embed
[0.21, -0.42, 0.75, 0.15, ...] (1536 dimensions)
Cosine similarity: 0.87 <-- These notes are related!

Each note becomes a point in 1,536-dimensional space. Notes about similar topics cluster near each other. A note about Docker Compose will land far from a note about trading psychology, but close to a note about container orchestration. Cosine similarity measures how closely two vectors point in the same direction — 1.0 means identical, 0.0 means completely unrelated.

We use OpenAI’s text-embedding-3-small model to generate these vectors. It’s fast, cheap, and produces embeddings that capture semantic meaning well enough to find non-obvious connections across a thousand-note vault.

3D vector space diagram showing Docker Compose, Container Orchestration, and PostgreSQL clustered together with cosine similarity ~0.85, while Trading Psychology sits far away with cosine similarity ~0.1

Figure 3 — Vector representation and cosine similarity. Related topics (Docker Compose, Container Orchestration, PostgreSQL) cluster tightly in vector space while unrelated topics (Trading Psychology) are distant.

KEY INSIGHT: What you embed matters more than how you embed it. Title + description + tags + a content preview produces dramatically better similarity matches than embedding the raw note body, because frontmatter captures the intent of a note while the body contains noise.


Building the Index#

Every note in the vault gets indexed, but we don’t embed the entire note. Raw body text is noisy — full of filler words, code blocks, and formatting artifacts that dilute the semantic signal. Instead, we extract the parts that carry the most meaning.

def build_embedding_text(frontmatter: dict, body: str) -> str:
"""Build text for embedding from note components."""
parts = []
# Title carries heavy semantic weight
if title := frontmatter.get("title"):
parts.append(f"Title: {title}")
# Description is a concise summary
if desc := frontmatter.get("description"):
parts.append(f"Description: {desc}")
# Tags indicate topic areas
if tags := frontmatter.get("tags"):
tag_str = ", ".join(tags)
parts.append(f"Topics: {tag_str}")
# First 500 chars of body for context
if body:
preview = body[:500].strip()
parts.append(f"Content: {preview}")
return " | ".join(parts)

This produces embedding text like: Title: Building RAG Systems | Description: A guide to retrieval-augmented generation | Topics: ai/rag, ai/llm, coding/python | Content: RAG combines the power of large language models with external knowledge retrieval...

The title carries the heaviest semantic weight. Tags act as topic classifiers. The description provides a concise summary. And the first 500 characters of body text add just enough context without drowning the signal in noise. This composition consistently outperformed full-body embeddings in our similarity testing.

Funnel diagram showing raw body text being filtered -- noise like code blocks, markdown syntax, and filler words is discarded as waste, while Title, Description, Tags, and First 500 Characters are kept as the high-signal embedding payload

Figure 4 — Signal vs. noise: optimizing embedding input by filtering raw body text down to high-semantic-weight components. Intent outperforms content.

The generated embedding gets stored in Qdrant, a self-hosted vector database running on our Proxmox server. We chose it because our notes contain personal and business information — keeping the vector database on-premise means the data never leaves our network. Alongside the vector, we store metadata (file path, title, tags, content hash) as a Qdrant payload, so search results come back with everything we need to create links.

Architecture diagram showing the privacy-focused embedding stack -- Obsidian Vault connects to Python Script on-premise, which sends text to OpenAI API for embeddings and stores vectors with metadata payloads in the self-hosted Qdrant Vector DB, all within a local network boundary on Proxmox

Figure 5 — Architectural flow of the privacy-focused embedding stack. Personal data stays on-premise; only transient text for embedding touches the API.

We also track content hashes in PostgreSQL to support incremental indexing. When re-indexing the vault, only notes whose content has actually changed get re-embedded. This makes a full vault re-index fast — unchanged notes are skipped entirely.


Finding Similar Notes#

When we need to find notes related to a given note, we generate an embedding for it and query Qdrant for the nearest neighbors. The query excludes the source note itself and filters results by a similarity threshold.

The 0.70 Threshold#

The threshold determines the quality-quantity tradeoff. We tuned it empirically across our vault:

ThresholdResults
0.90+Almost no matches (too strict)
0.80-0.90Very high quality, few matches
0.70-0.80Good balance of quality and quantity
0.60-0.70More matches, some noise
0.60-Too many irrelevant matches

Visual table showing similarity threshold ranges with the 0.70-0.80 range highlighted in blue as THE SWEET SPOT -- scores above 0.90 produce silence, below 0.60 produce chaos, with a callout noting that at 0.70 you discover cross-domain links like Testing AI Agents to Integration Testing

Figure 6 — Impact of similarity threshold on connection quality. The 0.70-0.80 sweet spot discovers cross-domain connections a human might miss.

At 0.70, the matches are genuinely related. You won’t get a note about Docker linked to a note about philosophy. But you will get a note about “Testing AI Agents” linked to “Integration Testing Best Practices” — a cross-domain connection a human might miss but that makes perfect sense once you see it.

We cap results at 10 similar notes per query, though most notes return 3-5 matches above the threshold. This natural distribution means the system self-regulates: highly specific notes get fewer links, broad topics get more.


Bidirectional Linking#

This is the key differentiator. Most similarity systems stop at “here are notes related to X.” We go further: when we link Note A to Note B, we also write a link from Note B back to Note A.

KEY INSIGHT: Bidirectional links are what turn a collection of notes into a knowledge graph. If A relates to B, B relates to A — always link both directions. Without this, you get a tree. With it, you get a web.

Comparison diagram showing unidirectional linking as The Tree with Note A pointing to Note B in one direction only, versus bidirectional linking as The Web with arrows flowing both directions between Note A and Note B

Figure 7 — Unidirectional vs. bidirectional linking. One-way links create a tree; two-way links create the dense web that makes a knowledge graph navigable.

The logic is straightforward:

for each similar_note found:
# Forward link: source --> target
add "[[target]]" to source's Related Notes section
# Backward link: target --> source
add "[[source]]" to target's Related Notes section

The linking function finds or creates a “Related Notes” section at the bottom of each markdown file and appends Obsidian wiki links ([[Note Title]]). Before adding a link, it checks whether the link already exists to avoid duplicates. The result in each note looks like:

## Related Notes
- [[Building RAG Systems]]
- [[Vector Database Comparison]]
- [[LangChain Deep Dive]]

These are standard Obsidian wiki links. Click one and you navigate directly to the related note. Obsidian’s graph view picks them up automatically, which is what produces the dense visual web of connections.

Self-Reference Detection#

A subtle edge case nearly undermined the whole system: notes linking to themselves. A note about “Building RAG Systems” might exist as Building-RAG-Systems.md or Building_RAG_Systems.md or Building RAG Systems.md, depending on how the filename was sanitized during creation. Without careful matching, the note would appear as its own top similarity match — because nothing is more similar to a note than itself.

We handle this at two levels. First, we exclude the source note’s file path from the Qdrant query, which catches exact path matches. But that’s not enough. The same note might be referenced by a slightly different path or a truncated filename. So we also compare sanitized versions of the title and link target — stripping special characters, normalizing spaces, and comparing the first 50 characters to catch truncation. This layered approach catches self-references even when the title, filename, and stored path don’t match exactly.

Diagram showing the self-reference bug where Building-RAG.md and Building RAG.md create an infinite loop linking to each other, and the fix below: a checklist of rigid exclusion logic including exact file path exclusion from Qdrant query, title string sanitization, and content hash comparison, resulting in clean connections only

Figure 8 — The self-reference loop bug and its multi-layered fix. Filename variations can trick vector search into linking a note to itself.


The Results#

We ran the batch linking script on our entire vault: 1,024 files processed, 2,757 links added, average 2.7 links per note. Processing time: about 15 minutes.

MetricValue
Files processed1,024
Links added2,757
Average links per note2.7
Processing time~15 minutes

Dark-themed visualization of the completed knowledge graph -- a dense web of blue connections linking yellow note dots across the entire vault, with a stats overlay showing Processing Time ~15 Minutes, Links Added 2,757, Avg Links/Note 2.7

Figure 9 — The result: a connected brain. Every dot radiates connections. The blue web represents 2,757 relationships discovered entirely by algorithm.

What those numbers mean in practice: every note in the vault now connects to at least one other note, and most connect to three to five. The connections surface relationships that aren’t obvious from titles or folder structure alone.

Some examples from the vault:

NoteAuto-Linked To
”RAG Architecture Deep Dive""Vector Database Comparison”, “LangChain Retrieval”, “Embedding Models"
"Trading Psychology""Risk Management”, “Journal Review Process”, “Emotional Discipline"
"Docker Compose Patterns""Container Orchestration”, “Development Environment”, “PostgreSQL Setup”

Side-by-side comparison of an Obsidian note properties panel before and after automated linking -- the left shows sparse manual metadata with basic tags, the right shows the same note with enriched tags and a dense Related Notes section filled with automated WikiLinks highlighted by a blue circle

Figure 10 — The UI transformation in Obsidian: before (manual metadata only) vs. after (automated WikiLinks enabling immediate knowledge surfing).

The system finds connections humans would miss. It links across topic boundaries and folder structures. And because new notes get linked automatically on save — the same embed-search-link pipeline runs every time a note is created — the knowledge graph grows denser over time without any manual effort.

Circular diagram showing the automated curation cycle: User Saves Note leads to Check Content Hash, which leads to Generate Embedding, which leads to Vector Search and Link, which cycles back to User Saves Note, with Self-Regulating Graph labeled in the center

Figure 11 — The automated curation cycle that scales the knowledge graph. Broad topics accumulate many links; niche topics get few — the system self-regulates.

KEY INSIGHT: The 0.70 similarity threshold is where signal separates from noise. Below it, you get spurious connections that erode trust. Above 0.80, you miss the cross-domain links that make a knowledge graph valuable. The sweet spot produces connections you didn’t expect but immediately recognize as correct.

Circular view of the completed knowledge graph on a dark background with the quote: Knowledge that previously required searching or remembering now surfaces through navigation

Figure 12 — Navigation over search: the semantic note network surfaces knowledge through interconnection rather than recall.


The Series#

This is Part 3 of a 5-part series on building an AI-powered knowledge management system:

  1. From YouTube to Knowledge Graph — Turning 1,000+ videos into an interconnected knowledge base for $1.50
  2. Anthropic Batch API in Production — 50% cost savings at scale, and the bug that almost corrupted everything
  3. Building a Semantic Note Network (this article) — Vector search turned 1,024 isolated notes into a dense knowledge graph
  4. Obsidian Vault Curation at Scale — Three years of tag chaos, fixed in 30 minutes for $1.50
  5. Ask Your Vault Anything — A RAG chatbot that answers from your notes in 2.5 seconds

Next: Obsidian Vault Curation at Scale — What happens when you hand an AI 1,280 chaotic tags, including a hex color that somehow became a category

Building a Semantic Note Network: How Vector Search Turns Isolated Notes into a Knowledge Graph
https://dotzlaw.com/insights/obsidian-notes-03/
Author
Gary Dotzlaw, Katrina Dotzlaw, Ryan Dotzlaw
Published at
2026-03-12
License
CC BY-NC-SA 4.0
← Back to Insights