gutenbit.html_chunker
html_chunker
HTML chunker for Project Gutenberg books.
Uses the table of contents <a class="pginternal"> links as the primary
structural map. When a TOC is present but coarse, body headings can refine it
without replacing TOC-derived hierarchy.
Corpus boundaries are defined by Gutenberg's explicit text delimiters:
*** START OF (THE|THIS) PROJECT GUTENBERG EBOOK ... *** through
*** END OF (THE|THIS) PROJECT GUTENBERG EBOOK ... ***.
Each <p> element becomes its own chunk — no accumulation or merging.
Chunk(position: int, div1: str, div2: str, div3: str, div4: str, content: str, kind: str)
dataclass
A discrete block extracted from a book, labelled by kind.
Structural divisions (div1–div4) are compacted so the shallowest heading level always fills div1 first. For a chapter-only book, chapters go in div1; for a book with BOOK + CHAPTER, BOOK fills div1 and CHAPTER fills div2.
Kinds: "heading", "text"
chunk_html(html: str) -> list[Chunk]
Split an HTML book into labelled chunks using TOC plus body heading cues.
Each <p> element becomes its own chunk. Returns chunks in document order.