gutenbit
gutenbit
gutenbit — Download, parse, and store Project Gutenberg texts.
BookRecord(id: int, title: str, authors: str, language: str, subjects: str, locc: str, bookshelves: str, issued: str, type: str)
dataclass
A book entry from the Project Gutenberg catalog.
Catalog(records: list[BookRecord], *, canonical_id_by_id: dict[int, int] | None = None, fetch_info: CatalogFetchInfo | None = None)
The Project Gutenberg catalog, searchable in memory.
canonical_id(book_id: int) -> int | None
Resolve any known id to the canonical id under current policy.
fetch(*, policy: CatalogPolicy = DEFAULT_CATALOG_POLICY, cache_dir: str | Path | None = None, refresh: bool = False) -> Catalog
classmethod
Download the CSV catalog from Project Gutenberg.
get(book_id: int) -> BookRecord | None
Return a canonical book record for a requested id.
is_canonical_id(book_id: int) -> bool
Return True when an id is already canonical under current policy.
search(*, author: str = '', title: str = '', language: str = '', subject: str = '') -> list[BookRecord]
Search for books matching all given criteria.
All filters use case-insensitive matching. Each query is first tried as
a contiguous substring; if it contains multiple words and the substring
fails, every word must appear individually (so "Jane Austen" matches
"Austen, Jane, 1775-1817").
Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str)
dataclass
A discrete block extracted from a book, labelled by kind.
Structural divisions (div1–div4) are compacted so the shallowest heading level always fills div1 first. For a chapter-only book, chapters go in div1; for a book with BOOK + CHAPTER, BOOK fills div1 and CHAPTER fills div2.
Kinds: "heading", "text"
Database(path: str | Path)
SQLite database for storing and searching Project Gutenberg books.
book(book_id: int) -> BookRecord | None
Return one stored book by Project Gutenberg id.
books() -> list[BookRecord]
Return all stored books.
chunk_by_id(book_id: int, chunk_id: int) -> ChunkRecord | None
Return one chunk by internal row id within a specific book.
chunk_by_position(book_id: int, position: int) -> ChunkRecord | None
Return one chunk by structural position within a specific book.
chunk_records(book_id: int, *, kinds: list[str] | None = None) -> list[ChunkRecord]
Return all chunks for a book as ChunkRecord objects.
chunk_window(book_id: int, position: int, *, around: int = 0) -> list[ChunkRecord]
Return the selected position and N neighboring chunks on each side.
chunks(book_id: int, *, kinds: list[str] | None = None) -> list[tuple[int, str, str, str, str, str, str, int]]
Return chunks as (position, div1, div2, div3, div4, content, kind, char_count).
chunks_by_div(book_id: int, div_path: str, *, kinds: list[str] | None = None, limit: int = 0) -> list[ChunkRecord]
Return chunks under a division path prefix.
Each segment is matched exactly, except that the deepest query segment
also accepts a prefix match (so "CHAPTER I" matches
"CHAPTER I DESCRIPTION OF A PALACE"). Trailing punctuation is
always ignored.
delete_book(book_id: int) -> bool
Delete a stored book and all associated rows. Returns False if missing.
has_current_text(book_id: int) -> bool
Return True when stored text matches the current chunker version.
has_text(book_id: int) -> bool
Return True when a book has already been downloaded and stored.
ingest(books: list[BookRecord], *, delay: float = 1.0, force: bool = False) -> None
Download, chunk, and store books.
Enforces package ingestion boundaries: English text records only, with in-request duplicate work IDs collapsed to a canonical edition.
search(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> list[SearchResult]
Search chunks via FTS5 with BM25 ranking.
When div_path is given, results are post-filtered using the same
path-prefix matching as :meth:chunks_by_div (normalized, with
word-boundary prefix on the deepest segment).
search_count(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None) -> int
Return the total number of search hits before any CLI display limit.
search_page(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> SearchPage
Return one CLI search page plus an exact total-hit count.
stale_books() -> list[BookRecord]
Return stored books whose text is missing or stale for this chunker version.
text(book_id: int) -> str | None
Return the clean text for a book, or None if not found.
text_states(book_ids: list[int]) -> dict[int, TextState]
Return stored text presence/currentness for the requested ids.
SearchResult(chunk_id: int, book_id: int, title: str, authors: str, language: str, subjects: str, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int, score: float)
dataclass
A single search hit — one chunk with its book metadata.
chunk_html(html: str) -> list[Chunk]
Split an HTML book into labelled chunks using TOC plus body heading cues.
Each <p> element becomes its own chunk. Returns chunks in document order.