DEV Community 1h ago

Knowledge-and-Memory-Management v0.0.2: Portable Knowledge Collection and Memory Management

Collection Capabilities

The knowledge module ingests from three primary sources:

Web: Adaptive scraping with headless browser support for JavaScript-heavy sites, tree-based parsing for static HTML, and built-in rate limiting and robots.txt compliance.
Video: Transcript extraction from platforms like YouTube via API or subtitle heuristics, preserving metadata such as title, duration, and timestamps.
Articles: Readability-driven extraction of long-form content, including author, publication date, and tags, normalized into a structured format.

Each collector returns a uniform data object, which the memory system indexes before storage.

Memory Management with $AGENT_HOME

Storage now relies entirely on $AGENT_HOME, which defaults to ~/agent_home but is overridable via environment variables. The memory system maintains two layers:

Raw content at $AGENT_HOME/data/collections/ (JSON/text snapshots).
Processed memory at $AGENT_HOME/memory/ (vector index with embeddings and metadata).

Deduplication uses content hashes to merge identical sources, and nightly maintenance tasks prune outdated entries. This structure supports straightforward backup, migration, or Docker volume mounting.

Workflow Example

The following snippet collects a web article, stores it in memory, and performs a retrieval query:

from agent.knowledge import Collector
from agent.memory import MemoryManager

# Initialize with portable paths
collector = Collector(source="web", storage="$AGENT_HOME/data")
memory = MemoryManager(index="$AGENT_HOME/memory")

# Crawl and extract content
content = collector.crawl("https://example.com/technical-article")

# Create and persist memory entry
entry = memory.create_entry(
    content=content,
    source="article",
    tags=["distributed-systems", "tutorial"]
)
memory.save(entry)

# Semantic search over collected knowledge
results = memory.search("replication algorithms")
for r in results:
    print(f"Score: {r.score:.2f} | Summary: {r.summary[:120]}...")

The Collector handles scraping and normalization; MemoryManager manages indexing and retrieval. Both operate relative to $AGENT_HOME for zero-config portability.

Why $AGENT_HOME Matters

Hardcoded paths are a common source of friction when moving from local development to staging or production. The $AGENT_HOME variable resolves this cleanly. Developers can set it to any directory during testing, and operations teams can redirect it in production without touching code. In containerized setups, mounting a volume to $AGENT_HOME ensures persistent memory across restarts.

Extensibility and Performance

Adding a new source requires subclassing the base Collector and implementing extract(). The memory system automatically ingests new data types. The vector index uses HNSW for fast semantic search, and collection pipelines are fully asynchronous to handle concurrent crawling.

This release provides a solid foundation for agent developers who need reliable, environment-agnostic knowledge management. The three collection sources cover common ingestion needs, and the portable path architecture simplifies deployment across diverse infrastructure.

Read on DEV Community ↗ ← Back to News