Skip to content

Python API

Gutenbit exposes four public classes and one function from its top-level package.

Because the project is not published on PyPI yet, add it from GitHub when using it as a library:

uv add git+https://github.com/keinan1/gutenbit
from gutenbit import Catalog, BookRecord, Database, SearchResult, Chunk, chunk_html

Catalog

Catalog fetches and searches the Project Gutenberg metadata catalog.

Fetch

from gutenbit import Catalog

catalog = Catalog.fetch()

fetch() downloads the CSV catalog from Project Gutenberg, filters it to English text records, and deduplicates entries so each work maps to a single canonical ID (the lowest Gutenberg ID for that title/author pair). The result is a Catalog instance held in memory.

results = catalog.search(author="Dickens")
results = catalog.search(title="Christmas", author="Dickens")
results = catalog.search(subject="Philosophy")

All filters use case-insensitive substring matching. When multiple filters are given, they combine with AND logic. Returns a list of BookRecord objects.

Lookup

book = catalog.get(1342)           # BookRecord or None
cid = catalog.canonical_id(1342)   # canonical ID or None

canonical_id resolves alternate edition IDs to the canonical one.

BookRecord

A frozen dataclass with these fields:

Field Type Description
id int Project Gutenberg ID
title str Book title
authors str Semicolon-separated author names
language str Language code (e.g. "en")
subjects str Semicolon-separated subjects
locc str Library of Congress Classification
bookshelves str Gutenberg bookshelves
issued str Publication date
type str Media type (e.g. "Text")

Catalog policy

By default, fetch() applies a policy that keeps only English text and deduplicates by lowest ID per work. To customize:

from gutenbit.catalog import CatalogPolicy

policy = CatalogPolicy(
    allowed_language_codes=frozenset({"en", "fr"}),
    dedupe_strategy="none",
)
catalog = Catalog.fetch(policy=policy)

See the API Reference for full details on CatalogPolicy.

Database

Database wraps a SQLite file. Use it as a context manager:

from gutenbit import Database

with Database(".gutenbit/gutenbit.db") as db:
    # all operations here
    ...

Or manage the connection manually with db.close().

Ingest

catalog = Catalog.fetch()
books = catalog.search(author="Tolstoy")

with Database(".gutenbit/gutenbit.db") as db:
    db.ingest(books)

ingest downloads each book's HTML from Project Gutenberg, parses it into chunks, and stores everything in the database. Books already present (at the current chunker version) are skipped unless you pass force=True.

The delay parameter controls the pause between downloads. The default is 1 second, which is polite to Gutenberg's servers:

db.ingest(books, delay=2.0)
db.ingest(books, force=True)  # reprocess even if already current

Search

results = db.search("battle")

Returns a list of SearchResult objects ordered by BM25 rank by default.

Filters narrow the result set:

results = db.search("battle", author="Tolstoy")
results = db.search("battle", book_id=2600)
results = db.search("battle", kind="text")
results = db.search("battle", title="War")

Metadata filters (author, title, language, subject) use substring matching. book_id and kind are exact.

Order controls result ordering:

db.search("battle", order="rank")    # BM25 score (default)
db.search("battle", order="first")   # book_id asc, position asc
db.search("battle", order="last")    # book_id desc, position desc

Limit controls the maximum number of results:

db.search("battle", limit=50)  # default is 20

FTS5 query syntax is supported directly:

db.search('"to be or not to be"')         # exact phrase
db.search("war AND peace")                 # boolean
db.search("war NOT peace")                 # exclusion
db.search("philos*")                       # prefix match

SearchResult

A frozen dataclass with these fields:

Field Type Description
chunk_id int Internal row ID
book_id int Project Gutenberg ID
title str Book title
authors str Author names
language str Language code
subjects str Subjects
div1 str Broadest structural division
div2 str Second level
div3 str Third level
div4 str Deepest level
position int Chunk index in document order
content str Full text of the matching chunk
kind str "heading" or "text"
char_count int Character length of content
score float BM25 relevance score (higher is better)

Reading chunks

Several methods retrieve chunks without a search query.

All chunks for a book:

records = db.chunk_records(1342)
for chunk in records:
    print(chunk.position, chunk.div1, chunk.kind, chunk.content[:60])

Filter by kind:

headings = db.chunk_records(1342, kinds=["heading"])

By position:

chunk = db.chunk_by_position(1342, position=50)

A window around a position:

window = db.chunk_window(1342, position=50, around=3)
# Returns chunks at positions 47, 48, 49, 50, 51, 52, 53

The CLI view --radius and search --radius options use this same centered-window concept, but present it as a simple surrounding passage in reading order.

By section path:

section = db.chunks_by_div(1342, "Chapter 1")
section = db.chunks_by_div(1342, "BOOK ONE/CHAPTER I", kinds=["text"], limit=10)

chunks_by_div matches by prefix on the div hierarchy. Matching ignores trailing punctuation and is case-insensitive.

ChunkRecord

A frozen dataclass with these fields:

Field Type Description
chunk_id int Internal row ID
book_id int Project Gutenberg ID
div1 str Broadest structural division
div2 str Second level
div3 str Third level
div4 str Deepest level
position int Chunk index in document order
content str Full text
kind str Chunk kind
char_count int Character length of content

Full text

text = db.text(1342)

Returns the full reconstructed text (all chunks joined with double newlines), or None if the book is not stored.

Book management

all_books = db.books()             # list of BookRecord
stale_books = db.stale_books()     # stored books that need reprocessing
book = db.book(1342)               # BookRecord or None
db.has_text(1342)                  # True if stored
db.has_current_text(1342)          # True if stored at current chunker version
db.delete_book(1342)               # returns True if deleted, False if not found

Chunking HTML directly

For advanced use, you can chunk HTML without the database:

from gutenbit import chunk_html

html = open("book.html").read()
chunks = chunk_html(html)

for chunk in chunks[:10]:
    print(chunk.position, chunk.div1, chunk.kind, chunk.content[:60])

Chunk

A frozen dataclass with these fields:

Field Type Description
position int Chunk index in document order
div1 str Broadest structural division
div2 str Second level
div3 str Third level
div4 str Deepest level
content str Text content
kind str "heading" or "text"

See Concepts for how divisions and chunk kinds work.