Skip to content

Data Preparation

DeepTutor provides demo knowledge bases and sample questions to help you get started quickly.

Demo Knowledge Bases

We provide two pre-built knowledge bases on Google Drive:

1. Research Papers Collection

📄5 Research Papers (20-50 pages each)

A curated collection of cutting-edge research papers from our lab, covering RAG and Agent fields.

Included Papers:

Best for: Research scenarios, broad knowledge coverage

2. Data Science Textbook

📚8 Chapters, 296 Pages

A comprehensive deep learning textbook from UC Berkeley.

Source: Deep Representation Learning Book

Topics Covered:

  • Neural Network Fundamentals
  • Representation Learning
  • Deep Learning Architectures
  • Advanced Topics

Best for: Learning scenarios, deep knowledge depth

Download & Setup

Step 1: Download

Visit our Google Drive folder and download:

  • knowledge_bases.zip - Pre-built knowledge bases with embeddings
  • questions.zip - Sample questions and usage examples (optional)

Step 2: Extract

Extract the downloaded files into the data/ directory:

DeepTutor/
├── data/
│   └── knowledge_bases/
│       ├── research_papers/      # Research papers KB
│       ├── data_science_book/    # Textbook KB
│       └── kb_config.json        # Knowledge base config
└── user/                         # User data (auto-created)

Step 3: Verify

After extracting, your knowledge bases will be automatically available when you start DeepTutor.

Embedding Compatibility

Our demo knowledge bases use text-embedding-3-large with dimensions = 3072.

If your embedding model has different dimensions, you'll need to create your own knowledge base instead.

Creating Custom Knowledge Bases

Supported File Formats

FormatExtensionNotes
PDF.pdfSupports text extraction and layout analysis
Text.txtPlain text files
Markdown.mdMarkdown with formatting support

Via Web Interface

  1. Navigate to http://localhost:3782/knowledge
  2. Click "New Knowledge Base"
  3. Enter a unique name for your knowledge base
  4. Upload your documents (single or batch upload)
  5. Wait for processing to complete

Processing Time

  • Small documents (< 10 pages): ~1 minute
  • Medium documents (10-100 pages): ~5-10 minutes
  • Large documents (100+ pages): May take longer

Via Command Line

bash
# Initialize a new knowledge base with documents
python -m src.knowledge.start_kb init <kb_name> --docs <pdf_path>

# Add documents to existing knowledge base
python -m src.knowledge.add_documents <kb_name> --docs <new_document.pdf>

Data Storage Structure

All user data is stored in the data/ directory:

data/
├── knowledge_bases/              # Knowledge base storage
│   ├── <kb_name>/
│   │   ├── documents/            # Original documents
│   │   ├── chunks/               # Chunked content
│   │   ├── embeddings/           # Vector embeddings
│   │   └── graph/                # Knowledge graph data
└── user/                         # User activity data
    ├── solve/                    # Problem solving results
    ├── question/                 # Generated questions
    ├── research/                 # Research reports
    ├── notebook/                 # Notebook records
    └── logs/                     # System logs

Next Step: Local Installation →

Released under the AGPL-3.0 License.