Knowledge MCP

Apache Camel documentation search via hybrid semantic search

The Knowledge MCP server provides AI agents with real-time access to Apache Camel documentation — component references, migration guides, CVE advisories, release notes, and JIRA issues. Instead of relying on potentially outdated training data, agents query a 166,973-document index using hybrid semantic search.

5 MCP Tools

Hybrid Search Algorithm

The Knowledge MCP uses a two-signal search combining keyword precision with semantic understanding:

BM25 (20% weight)

Keyword matching — exact term lookup using TF-IDF scoring.

Best for:

  • Exact component names (kafka, http)
  • CVE identifiers (CVE-2024-22369)
  • JIRA issue IDs (CAMEL-22784)
  • Property names (autoOffsetReset)

Without BM25, searching for CAMEL-22784 would return semantically similar but wrong results.

KNN Vector (80% weight)

Semantic similarity — 384-dimensional vector embeddings using Granite embedding model.

Best for:

  • Natural language questions (“how do I configure SSL?”)
  • Conceptual queries (“error handling best practices”)
  • Cross-reference discovery (“components similar to Kafka”)

Without vector search, typos or rephrased questions would return zero results.

What’s Indexed

70,798 documents — component reference pages across multiple Apache Camel versions.

Each component doc includes:

  • URI syntax and options
  • Producer/consumer properties
  • Code examples (Java DSL, XML, YAML)
  • Related EIPs and data formats

186 CVE advisories from the Apache Camel security page.

Each CVE includes:

  • CVE identifier and description
  • CVSS score and CWE classification
  • Affected versions
  • Fixed versions

104 release notes covering Apache Camel releases.

Each includes:

  • New features and improvements
  • Bug fixes with JIRA references
  • Breaking changes and migration notes
  • Dependency updates

~96,000 additional documents including:

  • Migration guides (2.x → 3.x → 4.x)
  • EIP pattern documentation
  • User manual chapters
  • Getting started guides
  • Best practices

Embedding Model

PropertyValue
Modelgranite-embedding-small-english-r2
QuantizationQ8 (ONNX)
Dimensions384
Context window8,192 tokens
Size52 MB
ArchitectureModernBERT

The model runs locally via ONNX Runtime — no external API calls, no data leaves the machine.

Index Storage

The knowledge index is a pre-built Apache Lucene 9.12.1 index shipped as a Maven artifact:

PropertyValue
Storage engineApache Lucene 9.12.1
Index size472 MB (88 segment files)
Vector storageKnnFloatVectorField (384-dim per document)
Total documents166,973

Why Lucene?

Camel-Kit chose Lucene over vector databases (Pinecone, Weaviate, Chroma, Milvus) and full search platforms (Elasticsearch, OpenSearch) for specific reasons:

  • Zero infrastructure — Lucene is an embedded library, not a server. No Docker containers, no ports, no configuration. The index loads from the classpath at JVM startup. This keeps the MCP server self-contained — one JAR, one process.

  • Native hybrid search — Lucene 9.x supports both BM25 text search and KnnFloatVectorField vector search in the same index. No need for two separate systems or a coordination layer. The 20/80 BM25+KNN blend runs in a single query.

  • Pre-built, portable index — The index is built once by the indexer and shipped as a Maven artifact. Users don’t need to run an indexer or download docs — the knowledge is embedded in the JAR. This makes deployment trivial: jbang org.apache.camel:camel-jbang-mcp:{version}:runner and it’s ready.

  • Java ecosystem alignment — Camel-Kit is a Java/JBang project. Lucene is a Java library with no native dependencies (except ONNX for embeddings). No Python, no gRPC, no REST clients needed.

  • Proven at scale — 166,973 documents with 384-dim vectors, hybrid search under 50ms on commodity hardware. Lucene powers Wikipedia, Stack Overflow, and Elasticsearch. The scale is well within its comfort zone.

The tradeoff: no built-in replication or distributed search. But for a single-user MCP server running locally, that’s not needed.

The index module has no Java code — it’s a pure resource artifact containing the pre-built Lucene segments. The MCP server loads it at startup from the classpath.

Rebuild the index:

mvn package -pl camel-kit-knowledge/index -Prebuild-index -Drevision=$(date +%Y%m%d%H%M) -am

This triggers the indexer to re-crawl Apache Camel documentation, re-embed with the Granite model, and write new Lucene segments.

Knowledge repo structure:

ModulePurpose
schemaLucene field definitions (KnowledgeFields, KnowledgeDocument)
embeddingONNX model loading and vector generation
indexerDocument crawling, parsing, chunking, and index building
indexPre-built Lucene index artifact (no code)
mcpQuarkus MCP server exposing 5 search tools

/camel-knowledge Skill

The /camel-knowledge slash command is a prescriptive Q&A layer over the Knowledge MCP. It routes user questions to the appropriate tool:

Question TypeTool Used
“What options does camel-kafka have?”camel_docs_component_info
“How do I configure SSL for HTTP?”camel_docs_search
“Are there CVEs affecting camel-sql?”camel_docs_cve_search
“What changed in Camel 4.18?”camel_docs_release_info
“Was CAMEL-22784 fixed?”camel_docs_jira_lookup

The skill works identically across all 5 AI agents — entirely MCP-driven, no agent-specific logic.

Next Steps