Skip to content

gutenbit.db

db

SQLite storage and full-text search for Project Gutenberg books.

ChunkRecord(chunk_id: int, book_id: int, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int) dataclass

One stored chunk with structural metadata.

Database(path: str | Path)

SQLite database for storing and searching Project Gutenberg books.

book(book_id: int) -> BookRecord | None

Return one stored book by Project Gutenberg id.

books() -> list[BookRecord]

Return all stored books.

chunk_by_id(book_id: int, chunk_id: int) -> ChunkRecord | None

Return one chunk by internal row id within a specific book.

chunk_by_position(book_id: int, position: int) -> ChunkRecord | None

Return one chunk by structural position within a specific book.

chunk_records(book_id: int, *, kinds: list[str] | None = None) -> list[ChunkRecord]

Return all chunks for a book as ChunkRecord objects.

chunk_window(book_id: int, position: int, *, around: int = 0) -> list[ChunkRecord]

Return the selected position and N neighboring chunks on each side.

chunks(book_id: int, *, kinds: list[str] | None = None) -> list[tuple[int, str, str, str, str, str, str, int]]

Return chunks as (position, div1, div2, div3, div4, content, kind, char_count).

chunks_by_div(book_id: int, div_path: str, *, kinds: list[str] | None = None, limit: int = 0) -> list[ChunkRecord]

Return chunks under a division path prefix.

Each segment is matched exactly, except that the deepest query segment also accepts a prefix match (so "CHAPTER I" matches "CHAPTER I DESCRIPTION OF A PALACE"). Trailing punctuation is always ignored.

delete_book(book_id: int) -> bool

Delete a stored book and all associated rows. Returns False if missing.

has_current_text(book_id: int) -> bool

Return True when stored text matches the current chunker version.

has_text(book_id: int) -> bool

Return True when a book has already been downloaded and stored.

ingest(books: list[BookRecord], *, delay: float = 1.0, force: bool = False) -> None

Download, chunk, and store books.

Enforces package ingestion boundaries: English text records only, with in-request duplicate work IDs collapsed to a canonical edition.

search(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> list[SearchResult]

Search chunks via FTS5 with BM25 ranking.

When div_path is given, results are post-filtered using the same path-prefix matching as :meth:chunks_by_div (normalized, with word-boundary prefix on the deepest segment).

search_count(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None) -> int

Return the total number of search hits before any CLI display limit.

search_page(query: str, *, author: str | None = None, title: str | None = None, language: str | None = None, subject: str | None = None, book_id: int | None = None, kind: str | None = None, div_path: str | None = None, order: SearchOrder = 'rank', limit: int = 20) -> SearchPage

Return one CLI search page plus an exact total-hit count.

stale_books() -> list[BookRecord]

Return stored books whose text is missing or stale for this chunker version.

text(book_id: int) -> str | None

Return the clean text for a book, or None if not found.

text_states(book_ids: list[int]) -> dict[int, TextState]

Return stored text presence/currentness for the requested ids.

SearchPage(items: list[SearchResult], total_results: int) dataclass

One CLI search page plus exact total-hit metadata.

SearchResult(chunk_id: int, book_id: int, title: str, authors: str, language: str, subjects: str, div1: str, div2: str, div3: str, div4: str, position: int, content: str, kind: str, char_count: int, score: float) dataclass

A single search hit — one chunk with its book metadata.

TextState(has_text: bool, has_current_text: bool) dataclass

Presence/currentness snapshot for one stored book.