Skip to content

Getting Started

This guide walks through a complete workflow: find a book, download it, explore its structure, and search its text. Both CLI and Python examples use Pride and Prejudice (Project Gutenberg ID 1342).

Installation

Gutenbit is not published on PyPI yet, so start by trying the CLI directly from GitHub:

uvx --from git+https://github.com/keinan1/gutenbit gutenbit --help

Install it persistently once you want a normal gutenbit command:

uv tool install git+https://github.com/keinan1/gutenbit

Then run gutenbit --help. Remove it later with uv tool uninstall gutenbit. Gutenbit stores its database and catalog cache in a .gutenbit/ folder.

If you want to use the Python package inside a uv project instead of installing the CLI globally:

uv add git+https://github.com/keinan1/gutenbit

CLI walkthrough

Find a book

Search the Project Gutenberg catalog by author, title, subject, or language:

gutenbit catalog --author "Austen, Jane"
  Downloaded catalog from Project Gutenberg (English text corpus).
      ID  AUTHORS                                   TITLE
  ------  ----------------------------------------  -----
    1342  Austen, Jane                              Pride and Prejudice
     158  Austen, Jane                              Emma
     161  Austen, Jane                              Sense and Sensibility
     105  Austen, Jane                              Persuasion

Download and store

Pass one or more Project Gutenberg IDs to add:

gutenbit add 1342

The book's HTML is downloaded, parsed into paragraph-level chunks with structural metadata, and stored in a local SQLite database (.gutenbit/gutenbit.db by default).

Explore structure

View the table of contents with numbered sections:

gutenbit toc 1342

Each section number can be used with view --section to jump directly to that part of the book.

Read text

View the opening of the book:

gutenbit view 1342

Read a specific section:

gutenbit view 1342 --section 1 --forward 10

Read a full section:

gutenbit view 1342 --section 1 --all

Read from an exact chunk position:

gutenbit view 1342 --position 1 --forward 5

Read surrounding passage around a position or section start:

gutenbit view 1342 --position 1 --radius 2
gutenbit view 1342 --section 1 --radius 2

Use --forward for forward reading, --radius for a surrounding passage window, and --all for a full book or section. --all does not apply to --position.

Full-text search across all stored books. Search targets text chunks by default:

gutenbit search "pride"

Search headings explicitly when needed:

gutenbit search "chapter" --book 1342 --kind heading

Narrow results to a single book:

gutenbit search "pride" --book 1342

Search for an exact phrase:

gutenbit search "truth universally acknowledged" --phrase

Search with nearby chunk context:

gutenbit search "truth universally acknowledged" --book 1342 --limit 3 --radius 1

All commands accept --json for machine-readable output.

Python walkthrough

Fetch the catalog

from gutenbit import Catalog

catalog = Catalog.fetch()
books = catalog.search(author="Austen, Jane")
for book in books[:5]:
    print(book.id, book.title)

The catalog is cached locally for two hours under .gutenbit/cache/, filtered to English text, and deduplicated by normalized title plus primary author, keeping the lowest Project Gutenberg ID as canonical. Use --refresh to force a redownload.

Ingest books

from gutenbit import Database

with Database(".gutenbit/gutenbit.db") as db:
    db.ingest(books[:3])

ingest downloads each book's HTML, parses it into chunks, and stores everything in SQLite. Books already in the database are skipped.

Search

results = db.search("pride")
for hit in results:
    print(f"{hit.title} | {hit.div1} | {hit.content[:80]}")

Results use BM25 rank ordering by default. Each SearchResult includes the matching text, its structural position (div1 through div4), book metadata, and a relevance score.

Read structured chunks

# All chunks for a book
chunks = db.chunk_records(1342)

# Chunks in a specific section
section = db.chunks_by_div(1342, "Chapter 1")

# A window of chunks around a position
window = db.chunk_window(1342, position=50, around=2)

Full text

text = db.text(1342)
print(text[:500])

What just happened

The pipeline has four stages. The catalog provides book metadata and IDs. The downloader prefers official mirror HTML and falls back to the main site's HTML zip when needed. The chunker parses the HTML using its table of contents as a structural map, turning each paragraph into a discrete chunk with a position and a place in the book's heading hierarchy. The database stores chunks in SQLite with FTS5 indexing for fast full-text search with BM25 ranking.