Getting Started
This guide walks through a complete workflow: find a book, download it, explore its structure, and search its text. Both CLI and Python examples use Pride and Prejudice (Project Gutenberg ID 1342).
Installation
Gutenbit is not published on PyPI yet, so start by trying the CLI directly from GitHub:
uvx --from git+https://github.com/keinan1/gutenbit gutenbit --help
Install it persistently once you want a normal gutenbit command:
uv tool install git+https://github.com/keinan1/gutenbit
Then run gutenbit --help. Remove it later with uv tool uninstall gutenbit.
Gutenbit stores its database and catalog cache in a .gutenbit/ folder.
If you want to use the Python package inside a uv project instead of installing the CLI globally:
uv add git+https://github.com/keinan1/gutenbit
CLI walkthrough
Find a book
Search the Project Gutenberg catalog by author, title, subject, or language:
gutenbit catalog --author "Austen, Jane"
Downloaded catalog from Project Gutenberg (English text corpus).
ID AUTHORS TITLE
------ ---------------------------------------- -----
1342 Austen, Jane Pride and Prejudice
158 Austen, Jane Emma
161 Austen, Jane Sense and Sensibility
105 Austen, Jane Persuasion
Download and store
Pass one or more Project Gutenberg IDs to add:
gutenbit add 1342
The book's HTML is downloaded, parsed into paragraph-level chunks with structural metadata, and stored in a local SQLite database (.gutenbit/gutenbit.db by default).
Explore structure
View the table of contents with numbered sections:
gutenbit toc 1342
Each section number can be used with view --section to jump directly to that part of the book.
Read text
View the opening of the book:
gutenbit view 1342
Read a specific section:
gutenbit view 1342 --section 1 --forward 10
Read a full section:
gutenbit view 1342 --section 1 --all
Read from an exact chunk position:
gutenbit view 1342 --position 1 --forward 5
Read surrounding passage around a position or section start:
gutenbit view 1342 --position 1 --radius 2
gutenbit view 1342 --section 1 --radius 2
Use --forward for forward reading, --radius for a surrounding passage window, and --all for a full book or section. --all does not apply to --position.
Search
Full-text search across all stored books. Search targets text chunks by default:
gutenbit search "pride"
Search headings explicitly when needed:
gutenbit search "chapter" --book 1342 --kind heading
Narrow results to a single book:
gutenbit search "pride" --book 1342
Search for an exact phrase:
gutenbit search "truth universally acknowledged" --phrase
Search with nearby chunk context:
gutenbit search "truth universally acknowledged" --book 1342 --limit 3 --radius 1
All commands accept --json for machine-readable output.
Python walkthrough
Fetch the catalog
from gutenbit import Catalog
catalog = Catalog.fetch()
books = catalog.search(author="Austen, Jane")
for book in books[:5]:
print(book.id, book.title)
The catalog is cached locally for two hours under .gutenbit/cache/, filtered to English text, and deduplicated by normalized title plus primary author, keeping the lowest Project Gutenberg ID as canonical. Use --refresh to force a redownload.
Ingest books
from gutenbit import Database
with Database(".gutenbit/gutenbit.db") as db:
db.ingest(books[:3])
ingest downloads each book's HTML, parses it into chunks, and stores everything in SQLite. Books already in the database are skipped.
Search
results = db.search("pride")
for hit in results:
print(f"{hit.title} | {hit.div1} | {hit.content[:80]}")
Results use BM25 rank ordering by default. Each SearchResult includes the matching text, its structural position (div1 through div4), book metadata, and a relevance score.
Read structured chunks
# All chunks for a book
chunks = db.chunk_records(1342)
# Chunks in a specific section
section = db.chunks_by_div(1342, "Chapter 1")
# A window of chunks around a position
window = db.chunk_window(1342, position=50, around=2)
Full text
text = db.text(1342)
print(text[:500])
What just happened
The pipeline has four stages. The catalog provides book metadata and IDs. The downloader prefers official mirror HTML and falls back to the main site's HTML zip when needed. The chunker parses the HTML using its table of contents as a structural map, turning each paragraph into a discrete chunk with a position and a place in the book's heading hierarchy. The database stores chunks in SQLite with FTS5 indexing for fast full-text search with BM25 ranking.