Python API
Gutenbit exposes four public classes and one function from its top-level package.
Because the project is not published on PyPI yet, add it from GitHub when using it as a library:
uv add git+https://github.com/keinan1/gutenbit
from gutenbit import Catalog, BookRecord, Database, SearchResult, Chunk, chunk_html
Catalog
Catalog fetches and searches the Project Gutenberg metadata catalog.
Fetch
from gutenbit import Catalog
catalog = Catalog.fetch()
fetch() downloads the CSV catalog from Project Gutenberg, filters it to English text records, and deduplicates entries so each work maps to a single canonical ID (the lowest Gutenberg ID for that title/author pair). The result is a Catalog instance held in memory.
Search
results = catalog.search(author="Dickens")
results = catalog.search(title="Christmas", author="Dickens")
results = catalog.search(subject="Philosophy")
All filters use case-insensitive substring matching. When multiple filters are given, they combine with AND logic. Returns a list of BookRecord objects.
Lookup
book = catalog.get(1342) # BookRecord or None
cid = catalog.canonical_id(1342) # canonical ID or None
canonical_id resolves alternate edition IDs to the canonical one.
BookRecord
A frozen dataclass with these fields:
| Field | Type | Description |
|---|---|---|
id |
int |
Project Gutenberg ID |
title |
str |
Book title |
authors |
str |
Semicolon-separated author names |
language |
str |
Language code (e.g. "en") |
subjects |
str |
Semicolon-separated subjects |
locc |
str |
Library of Congress Classification |
bookshelves |
str |
Gutenberg bookshelves |
issued |
str |
Publication date |
type |
str |
Media type (e.g. "Text") |
Catalog policy
By default, fetch() applies a policy that keeps only English text and deduplicates by lowest ID per work. To customize:
from gutenbit.catalog import CatalogPolicy
policy = CatalogPolicy(
allowed_language_codes=frozenset({"en", "fr"}),
dedupe_strategy="none",
)
catalog = Catalog.fetch(policy=policy)
See the API Reference for full details on CatalogPolicy.
Database
Database wraps a SQLite file. Use it as a context manager:
from gutenbit import Database
with Database(".gutenbit/gutenbit.db") as db:
# all operations here
...
Or manage the connection manually with db.close().
Ingest
catalog = Catalog.fetch()
books = catalog.search(author="Tolstoy")
with Database(".gutenbit/gutenbit.db") as db:
db.ingest(books)
ingest downloads each book's HTML from Project Gutenberg, parses it into chunks, and stores everything in the database. Books already present (at the current chunker version) are skipped unless you pass force=True.
The delay parameter controls the pause between downloads. The default is 1 second, which is polite to Gutenberg's servers:
db.ingest(books, delay=2.0)
db.ingest(books, force=True) # reprocess even if already current
Search
results = db.search("battle")
Returns a list of SearchResult objects ordered by BM25 rank by default.
Filters narrow the result set:
results = db.search("battle", author="Tolstoy")
results = db.search("battle", book_id=2600)
results = db.search("battle", kind="text")
results = db.search("battle", title="War")
Metadata filters (author, title, language, subject) use substring matching. book_id and kind are exact.
Order controls result ordering:
db.search("battle", order="rank") # BM25 score (default)
db.search("battle", order="first") # book_id asc, position asc
db.search("battle", order="last") # book_id desc, position desc
Limit controls the maximum number of results:
db.search("battle", limit=50) # default is 20
FTS5 query syntax is supported directly:
db.search('"to be or not to be"') # exact phrase
db.search("war AND peace") # boolean
db.search("war NOT peace") # exclusion
db.search("philos*") # prefix match
SearchResult
A frozen dataclass with these fields:
| Field | Type | Description |
|---|---|---|
chunk_id |
int |
Internal row ID |
book_id |
int |
Project Gutenberg ID |
title |
str |
Book title |
authors |
str |
Author names |
language |
str |
Language code |
subjects |
str |
Subjects |
div1 |
str |
Broadest structural division |
div2 |
str |
Second level |
div3 |
str |
Third level |
div4 |
str |
Deepest level |
position |
int |
Chunk index in document order |
content |
str |
Full text of the matching chunk |
kind |
str |
"heading" or "text" |
char_count |
int |
Character length of content |
score |
float |
BM25 relevance score (higher is better) |
Reading chunks
Several methods retrieve chunks without a search query.
All chunks for a book:
records = db.chunk_records(1342)
for chunk in records:
print(chunk.position, chunk.div1, chunk.kind, chunk.content[:60])
Filter by kind:
headings = db.chunk_records(1342, kinds=["heading"])
By position:
chunk = db.chunk_by_position(1342, position=50)
A window around a position:
window = db.chunk_window(1342, position=50, around=3)
# Returns chunks at positions 47, 48, 49, 50, 51, 52, 53
The CLI view --radius and search --radius options use this same centered-window concept, but present it as a simple surrounding passage in reading order.
By section path:
section = db.chunks_by_div(1342, "Chapter 1")
section = db.chunks_by_div(1342, "BOOK ONE/CHAPTER I", kinds=["text"], limit=10)
chunks_by_div matches by prefix on the div hierarchy. Matching ignores trailing punctuation and is case-insensitive.
ChunkRecord
A frozen dataclass with these fields:
| Field | Type | Description |
|---|---|---|
chunk_id |
int |
Internal row ID |
book_id |
int |
Project Gutenberg ID |
div1 |
str |
Broadest structural division |
div2 |
str |
Second level |
div3 |
str |
Third level |
div4 |
str |
Deepest level |
position |
int |
Chunk index in document order |
content |
str |
Full text |
kind |
str |
Chunk kind |
char_count |
int |
Character length of content |
Full text
text = db.text(1342)
Returns the full reconstructed text (all chunks joined with double newlines), or None if the book is not stored.
Book management
all_books = db.books() # list of BookRecord
stale_books = db.stale_books() # stored books that need reprocessing
book = db.book(1342) # BookRecord or None
db.has_text(1342) # True if stored
db.has_current_text(1342) # True if stored at current chunker version
db.delete_book(1342) # returns True if deleted, False if not found
Chunking HTML directly
For advanced use, you can chunk HTML without the database:
from gutenbit import chunk_html
html = open("book.html").read()
chunks = chunk_html(html)
for chunk in chunks[:10]:
print(chunk.position, chunk.div1, chunk.kind, chunk.content[:60])
Chunk
A frozen dataclass with these fields:
| Field | Type | Description |
|---|---|---|
position |
int |
Chunk index in document order |
div1 |
str |
Broadest structural division |
div2 |
str |
Second level |
div3 |
str |
Third level |
div4 |
str |
Deepest level |
content |
str |
Text content |
kind |
str |
"heading" or "text" |
See Concepts for how divisions and chunk kinds work.