Skip to content

gutenbit

gutenbit

gutenbit — Download, parse, and store Project Gutenberg texts.

BookRecord(id: int, title: str, authors: str, language: str, subjects: str, locc: str, bookshelves: str, issued: str, type: str) dataclass

A book entry from the Project Gutenberg catalog.

Catalog(records: list[BookRecord], *, canonical_id_by_id: dict[int, int] | None = None, fetch_info: CatalogFetchInfo | None = None)

The Project Gutenberg catalog, searchable in memory.

canonical_id(book_id: int) -> int | None

Resolve any known id to the canonical id under current policy.

fetch(*, policy: CatalogPolicy = DEFAULT_CATALOG_POLICY, cache_dir: str | Path | None = None, refresh: bool = False) -> Catalog classmethod

Download the CSV catalog from Project Gutenberg.

get(book_id: int) -> BookRecord | None

Return a canonical book record for a requested id.

is_canonical_id(book_id: int) -> bool

Return True when an id is already canonical under current policy.

search(*, author: str = '', title: str = '', language: str = '', subject: str = '') -> list[BookRecord]

Search for books matching all given criteria.

All filters use case-insensitive matching. Each query is first tried as a contiguous substring; if it contains multiple words and the substring fails, every word must appear individually (so "Jane Austen" matches "Austen, Jane, 1775-1817").

Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str) dataclass

A discrete block extracted from a book, labelled by kind.

Structural divisions (div1–div4) are compacted so the shallowest heading level always fills div1 first. For a chapter-only book, chapters go in div1; for a book with BOOK + CHAPTER, BOOK fills div1 and CHAPTER fills div2.

Kinds: "heading", "text"

Database(path: str | Path)

SQLite database for storing and searching Project Gutenberg books.

book(book_id: int) -> BookRecord | None

Return one stored book by Project Gutenberg id.

books() -> list[BookRecord]

Return all stored books.

chunk_by_id(book_id: int, chunk_id: int) -> ChunkRecord | None

Return one chunk by internal row id within a specific book.

chunk_by_position(book_id: int, position: int) -> ChunkRecord | None

Return one chunk by structural position within a specific book.

chunk_records(book_id: int, *, kinds: list[str] | None = None) -> list[ChunkRecord]

Return all chunks for a book as ChunkRecord objects.

chunk_window(book_id: int, position: int, *, around: int = 0) -> list[ChunkRecord]

Return the selected position and N neighboring chunks on each side.

chunks(book_id: int, *, kinds: list[str] | None = None) -> list[tuple[int, str, str, str, str, str, str, int]]

Return chunks as (position, div1, div2, div3, div4, content, kind, char_count).

chunks_by_div(book_id: int, div_path: str, *, kinds: list[str] | None = None, limit: int = 0) -> list[ChunkRecord]

Return chunks under a division path prefix.

Each segment is matched exactly, except that the deepest query segment also accepts a prefix match (so "CHAPTER I" matches "CHAPTER I DESCRIPTION OF A PALACE"). Trailing punctuation is always ignored.

delete_book(book_id: int) -> bool

Delete a stored book and all associated rows. Returns False if missing.

has_current_text(book_id: int) -> bool

Return True when stored text matches the current chunker version.

has_text(book_id: int) -> bool

Return True when a book has already been downloaded and stored.

ingest(books: list[BookRecord], *, delay: float = 1.0, force: bool = False) -> None

Download, chunk, and store books.

Enforces package ingestion boundaries: English text records only, with in-request duplicate work IDs collapsed to a canonical edition.

search(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> list[SearchResult]

Search chunks via FTS5 with BM25 ranking.

When div_path is given, results are post-filtered using the same path-prefix matching as :meth:chunks_by_div (normalized, with word-boundary prefix on the deepest segment).

search_count(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None) -> int

Return the total number of search hits before any CLI display limit.

search_page(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> SearchPage

Return one CLI search page plus an exact total-hit count.

stale_books() -> list[BookRecord]

Return stored books whose text is missing or stale for this chunker version.

text(book_id: int) -> str | None

Return the clean text for a book, or None if not found.

text_states(book_ids: list[int]) -> dict[int, TextState]

Return stored text presence/currentness for the requested ids.

SearchResult(chunk_id: int, book_id: int, title: str, authors: str, language: str, subjects: str, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int, score: float) dataclass

A single search hit — one chunk with its book metadata.

chunk_html(html: str) -> list[Chunk]

Split an HTML book into labelled chunks using TOC plus body heading cues.

Each <p> element becomes its own chunk. Returns chunks in document order.