`gutenbit`

gutenbit — Download, parse, and store Project Gutenberg texts.

`BookRecord(id: int, title: str, authors: str, language: str, subjects: str, locc: str, bookshelves: str, issued: str, type: str)` `dataclass`

A book entry from the Project Gutenberg catalog.

`Catalog(records: list[BookRecord], *, canonical_id_by_id: dict[int, int] | None = None, fetch_info: CatalogFetchInfo | None = None)`

The Project Gutenberg catalog, searchable in memory.

`canonical_id(book_id: int) -> int | None`

Resolve any known id to the canonical id under current policy.

`fetch(*, policy: CatalogPolicy = DEFAULT_CATALOG_POLICY, cache_dir: str | Path | None = None, refresh: bool = False) -> Catalog` `classmethod`

Download the CSV catalog from Project Gutenberg.

`get(book_id: int) -> BookRecord | None`

Return a canonical book record for a requested id.

`is_canonical_id(book_id: int) -> bool`

Return True when an id is already canonical under current policy.

`search(*, author: str = '', title: str = '', language: str = '', subject: str = '') -> list[BookRecord]`

Search for books matching all given criteria.

All filters use case-insensitive matching. Each query is first tried as a contiguous substring; if it contains multiple words and the substring fails, every word must appear individually (so "Jane Austen" matches "Austen, Jane, 1775-1817").

`Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str)` `dataclass`

A discrete block extracted from a book, labelled by kind.

Structural divisions (div1–div4) are compacted so the shallowest heading level always fills div1 first. For a chapter-only book, chapters go in div1; for a book with BOOK + CHAPTER, BOOK fills div1 and CHAPTER fills div2.

Kinds: "heading", "text"

`Database(path: str | Path)`

SQLite database for storing and searching Project Gutenberg books.

`book(book_id: int) -> BookRecord | None`

Return one stored book by Project Gutenberg id.

`books() -> list[BookRecord]`

Return all stored books.

`chunk_by_id(book_id: int, chunk_id: int) -> ChunkRecord | None`

Return one chunk by internal row id within a specific book.

`chunk_by_position(book_id: int, position: int) -> ChunkRecord | None`

Return one chunk by structural position within a specific book.

`chunk_records(book_id: int, *, kinds: list[str] | None = None) -> list[ChunkRecord]`

Return all chunks for a book as ChunkRecord objects.

`chunk_window(book_id: int, position: int, *, around: int = 0) -> list[ChunkRecord]`

Return the selected position and N neighboring chunks on each side.

`chunks(book_id: int, *, kinds: list[str] | None = None) -> list[tuple[int, str, str, str, str, str, str, int]]`

Return chunks as (position, div1, div2, div3, div4, content, kind, char_count).

`chunks_by_div(book_id: int, div_path: str, *, kinds: list[str] | None = None, limit: int = 0) -> list[ChunkRecord]`

Return chunks under a division path prefix.

Each segment is matched exactly, except that the deepest query segment also accepts a prefix match (so "CHAPTER I" matches "CHAPTER I DESCRIPTION OF A PALACE"). Trailing punctuation is always ignored.

`delete_book(book_id: int) -> bool`

Delete a stored book and all associated rows. Returns False if missing.

`has_current_text(book_id: int) -> bool`

Return True when stored text matches the current chunker version.

`has_text(book_id: int) -> bool`

Return True when a book has already been downloaded and stored.

`ingest(books: list[BookRecord], *, delay: float = 1.0, force: bool = False) -> None`

Download, chunk, and store books.

Enforces package ingestion boundaries: English text records only, with in-request duplicate work IDs collapsed to a canonical edition.

`search(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> list[SearchResult]`

Search chunks via FTS5 with BM25 ranking.

When div_path is given, results are post-filtered using the same path-prefix matching as :meth:chunks_by_div (normalized, with word-boundary prefix on the deepest segment).

`search_count(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None) -> int`

Return the total number of search hits before any CLI display limit.

`search_page(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> SearchPage`

Return one CLI search page plus an exact total-hit count.

`stale_books() -> list[BookRecord]`

Return stored books whose text is missing or stale for this chunker version.

`text(book_id: int) -> str | None`

Return the clean text for a book, or None if not found.

`text_states(book_ids: list[int]) -> dict[int, TextState]`

Return stored text presence/currentness for the requested ids.

`SearchResult(chunk_id: int, book_id: int, title: str, authors: str, language: str, subjects: str, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int, score: float)` `dataclass`

A single search hit — one chunk with its book metadata.

`chunk_html(html: str) -> list[Chunk]`

Split an HTML book into labelled chunks using TOC plus body heading cues.

Each <p> element becomes its own chunk. Returns chunks in document order.

gutenbit

gutenbit

BookRecord(id: int, title: str, authors: str, language: str, subjects: str, locc: str, bookshelves: str, issued: str, type: str) dataclass

Catalog(records: list[BookRecord], *, canonical_id_by_id: dict[int, int] | None = None, fetch_info: CatalogFetchInfo | None = None)

canonical_id(book_id: int) -> int | None

fetch(*, policy: CatalogPolicy = DEFAULT_CATALOG_POLICY, cache_dir: str | Path | None = None, refresh: bool = False) -> Catalog classmethod

get(book_id: int) -> BookRecord | None

is_canonical_id(book_id: int) -> bool

search(*, author: str = '', title: str = '', language: str = '', subject: str = '') -> list[BookRecord]

Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str) dataclass

Database(path: str | Path)

book(book_id: int) -> BookRecord | None

books() -> list[BookRecord]

chunk_by_id(book_id: int, chunk_id: int) -> ChunkRecord | None

chunk_by_position(book_id: int, position: int) -> ChunkRecord | None

chunk_records(book_id: int, *, kinds: list[str] | None = None) -> list[ChunkRecord]

chunk_window(book_id: int, position: int, *, around: int = 0) -> list[ChunkRecord]

chunks(book_id: int, *, kinds: list[str] | None = None) -> list[tuple[int, str, str, str, str, str, str, int]]

chunks_by_div(book_id: int, div_path: str, *, kinds: list[str] | None = None, limit: int = 0) -> list[ChunkRecord]

delete_book(book_id: int) -> bool

has_current_text(book_id: int) -> bool

has_text(book_id: int) -> bool

ingest(books: list[BookRecord], *, delay: float = 1.0, force: bool = False) -> None

search(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> list[SearchResult]

search_count(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None) -> int

search_page(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> SearchPage

stale_books() -> list[BookRecord]

text(book_id: int) -> str | None

text_states(book_ids: list[int]) -> dict[int, TextState]

SearchResult(chunk_id: int, book_id: int, title: str, authors: str, language: str, subjects: str, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int, score: float) dataclass

chunk_html(html: str) -> list[Chunk]

`gutenbit`

`gutenbit`

`BookRecord(id: int, title: str, authors: str, language: str, subjects: str, locc: str, bookshelves: str, issued: str, type: str)` `dataclass`

`Catalog(records: list[BookRecord], *, canonical_id_by_id: dict[int, int] | None = None, fetch_info: CatalogFetchInfo | None = None)`

`canonical_id(book_id: int) -> int | None`

`fetch(*, policy: CatalogPolicy = DEFAULT_CATALOG_POLICY, cache_dir: str | Path | None = None, refresh: bool = False) -> Catalog` `classmethod`

`get(book_id: int) -> BookRecord | None`

`is_canonical_id(book_id: int) -> bool`

`search(*, author: str = '', title: str = '', language: str = '', subject: str = '') -> list[BookRecord]`

`Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str)` `dataclass`

`Database(path: str | Path)`

`book(book_id: int) -> BookRecord | None`

`books() -> list[BookRecord]`

`chunk_by_id(book_id: int, chunk_id: int) -> ChunkRecord | None`

`chunk_by_position(book_id: int, position: int) -> ChunkRecord | None`

`chunk_records(book_id: int, *, kinds: list[str] | None = None) -> list[ChunkRecord]`

`chunk_window(book_id: int, position: int, *, around: int = 0) -> list[ChunkRecord]`

`chunks(book_id: int, *, kinds: list[str] | None = None) -> list[tuple[int, str, str, str, str, str, str, int]]`

`chunks_by_div(book_id: int, div_path: str, *, kinds: list[str] | None = None, limit: int = 0) -> list[ChunkRecord]`

`delete_book(book_id: int) -> bool`

`has_current_text(book_id: int) -> bool`

`has_text(book_id: int) -> bool`

`ingest(books: list[BookRecord], *, delay: float = 1.0, force: bool = False) -> None`

`search(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> list[SearchResult]`

`search_count(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None) -> int`

`search_page(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> SearchPage`

`stale_books() -> list[BookRecord]`

`text(book_id: int) -> str | None`

`text_states(book_ids: list[int]) -> dict[int, TextState]`

`SearchResult(chunk_id: int, book_id: int, title: str, authors: str, language: str, subjects: str, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int, score: float)` `dataclass`

`chunk_html(html: str) -> list[Chunk]`