Skip to content

gutenbit.html_chunker

html_chunker

HTML chunker for Project Gutenberg books.

Uses the table of contents <a class="pginternal"> links as the primary structural map. When a TOC is present but coarse, body headings can refine it without replacing TOC-derived hierarchy.

Corpus boundaries are defined by Gutenberg's explicit text delimiters: *** START OF (THE|THIS) PROJECT GUTENBERG EBOOK ... *** through *** END OF (THE|THIS) PROJECT GUTENBERG EBOOK ... ***.

Each <p> element becomes its own chunk — no accumulation or merging.

Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str) dataclass

A discrete block extracted from a book, labelled by kind.

Structural divisions (div1–div4) are compacted so the shallowest heading level always fills div1 first. For a chapter-only book, chapters go in div1; for a book with BOOK + CHAPTER, BOOK fills div1 and CHAPTER fills div2.

Kinds: "heading", "text"

chunk_html(html: str) -> list[Chunk]

Split an HTML book into labelled chunks using TOC plus body heading cues.

Each <p> element becomes its own chunk. Returns chunks in document order.