Hyntax project logo — lego bricks in the shape of a capital letter H

# Hyntax Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest). - **Simple.** API is straightforward, output is clear. - **Forgiving.** Just like a browser, normally parses invalid HTML. - **Supports streaming.** Can process HTML while it's still being loaded. - **No dependencies.** ## Table Of Contents - [Usage](#usage) - [TypeScript Typings](#typescript-typings) - [Streaming](#streaming) - [Tokens](#tokens) - [AST Format](#ast-format) - [API Reference](#api-reference) - [Types Reference](#types-reference) ## Usage ```bash npm install hyntax ``` ```javascript const { tokenize, constructTree } = require('hyntax') const util = require('util') const inputHTML = ` ` const { tokens } = tokenize(inputHTML) const { ast } = constructTree(tokens) console.log(JSON.stringify(tokens, null, 2)) console.log(util.inspect(ast, { showHidden: false, depth: null })) ``` ## TypeScript Typings Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types. ## Streaming Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk. ```javascript const { StreamTokenizer, StreamTreeConstructor } = require('hyntax') const http = require('http') const util = require('util') http.get('http://info.cern.ch', (res) => { const streamTokenizer = new StreamTokenizer() const streamTreeConstructor = new StreamTreeConstructor() let resultTokens = [] let resultAst res.pipe(streamTokenizer).pipe(streamTreeConstructor) streamTokenizer .on('data', (tokens) => { resultTokens = resultTokens.concat(tokens) }) .on('end', () => { console.log(JSON.stringify(resultTokens, null, 2)) }) streamTreeConstructor .on('data', (ast) => { resultAst = ast }) .on('end', () => { console.log(util.inspect(resultAst, { showHidden: false, depth: null })) }) }).on('error', (err) => { throw err; }) ``` ## Tokens Here are all kinds of tokens which Hyntax will extract out of HTML string. ![Overview of all possible tokens](./tokens-list.png) Each token conforms to [Tokenizer.Token](#TokenizerToken) interface. ## AST Format Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within. ```javascript { nodeType: TreeConstructor.NodeTypes.Document, content: { children: [ { nodeType: TreeConstructor.NodeTypes.AnyNodeType, content: {…} }, { nodeType: TreeConstructor.NodeTypes.AnyNodeType, content: {…} } ] } } ``` Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference. ## API Reference ### Tokenizer Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST. #### Interface ```typescript tokenize(html: String): Tokenizer.Result ``` #### Arguments - `html` HTML string to process Required. Type: string. #### Returns [Tokenizer.Result](#TokenizerResult) ### Tree Constructor After you've got an array of tokens, you can pass them into tree constructor to build an AST. #### Interface ```typescript constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result ``` #### Arguments - `tokens` Array of tokens received from the tokenizer. Required. Type: [Tokenizer.AnyToken[]](#tokenizeranytoken) #### Returns [TreeConstructor.Result](#TreeConstructorResult) ## Types Reference #### Tokenizer.Result ```typescript interface Result { state: Tokenizer.State tokens: Tokenizer.AnyToken[] } ``` - `state` The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks. - `tokens` Array of resulting tokens. Type: [Tokenizer.AnyToken[]](#tokenizeranytoken) #### TreeConstructor.Result ```typescript interface Result { state: State ast: AST } ``` - `state` The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks. - `ast` Resulting AST. Type: [TreeConstructor.AST](#treeconstructorast) #### Tokenizer.Token Generic Token, other interfaces use it to create a specific Token type. ```typescript interface Token { type: T content: string startPosition: number endPosition: number } ``` - `type` One of the [Token types](#TokenizerTokenTypesAnyTokenType). - `content ` Piece of original HTML string which was recognized as a token. - `startPosition ` Index of a character in the input HTML string where the token starts. - `endPosition` Index of a character in the input HTML string where the token ends. #### Tokenizer.TokenTypes.AnyTokenType Shortcut type of all possible tokens. ```typescript type AnyTokenType = | Text | OpenTagStart | AttributeKey | AttributeAssigment | AttributeValueWrapperStart | AttributeValue | AttributeValueWrapperEnd | OpenTagEnd | CloseTag | OpenTagStartScript | ScriptTagContent | OpenTagEndScript | CloseTagScript | OpenTagStartStyle | StyleTagContent | OpenTagEndStyle | CloseTagStyle | DoctypeStart | DoctypeEnd | DoctypeAttributeWrapperStart | DoctypeAttribute | DoctypeAttributeWrapperEnd | CommentStart | CommentContent | CommentEnd ``` #### Tokenizer.AnyToken Shortcut to reference any possible token. ```typescript type AnyToken = Token ``` #### TreeConstructor.AST Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types) ```typescript type AST = TreeConstructor.DocumentNode ``` ### AST Node Types There are 7 possible types of Node. Each type has a specific content. ```typescript type DocumentNode = Node ``` ```typescript type DoctypeNode = Node ``` ```typescript type TextNode = Node ``` ```typescript type TagNode = Node ``` ```typescript type CommentNode = Node ``` ```typescript type ScriptNode = Node ``` ```typescript type StyleNode = Node ``` Interfaces for each content type: - [Document](#TreeConstructorNodeContentsDocument) - [Doctype](#TreeConstructorNodeContentsDoctype) - [Text](#TreeConstructorNodeContentsText) - [Tag](#TreeConstructorNodeContentsTag) - [Comment](#TreeConstructorNodeContentsComment) - [Script](#TreeConstructorNodeContentsScript) - [Style](#TreeConstructorNodeContentsStyle) #### TreeConstructor.Node Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node. ```typescript interface Node { nodeType: T content: C } ``` #### TreeConstructor.NodeTypes.AnyNodeType Shortcut type of all possible Node types. ```typescript type AnyNodeType = | Document | Doctype | Tag | Text | Comment | Script | Style ``` ### Node Content Types #### TreeConstructor.NodeTypes.AnyNodeContent Shortcut type of all possible types of content inside a Node. ```typescript type AnyNodeContent = | Document | Doctype | Text | Tag | Comment | Script | Style ``` #### TreeConstructor.NodeContents.Document ```typescript interface Document { children: AnyNode[] } ``` #### TreeConstructor.NodeContents.Doctype ```typescript interface Doctype { start: Tokenizer.Token attributes?: DoctypeAttribute[] end: Tokenizer.Token } ``` #### TreeConstructor.NodeContents.Text ```typescript interface Text { value: Tokenizer.Token } ``` #### TreeConstructor.NodeContents.Tag ```typescript interface Tag { name: string selfClosing: boolean openStart: Tokenizer.Token attributes?: TagAttribute[] openEnd: Tokenizer.Token children?: AnyNode[] close?: Tokenizer.Token } ``` #### TreeConstructor.NodeContents.Comment ```typescript interface Comment { start: Tokenizer.Token value: Tokenizer.Token end: Tokenizer.Token } ``` #### TreeConstructor.NodeContents.Script ```typescript interface Script { openStart: Tokenizer.Token attributes?: TagAttribute[] openEnd: Tokenizer.Token value: Tokenizer.Token close: Tokenizer.Token } ``` #### TreeConstructor.NodeContents.Style ```typescript interface Style { openStart: Tokenizer.Token, attributes?: TagAttribute[], openEnd: Tokenizer.Token, value: Tokenizer.Token, close: Tokenizer.Token } ``` #### TreeConstructor.DoctypeAttribute ```typescript interface DoctypeAttribute { startWrapper?: Tokenizer.Token, value: Tokenizer.Token, endWrapper?: Tokenizer.Token } ``` #### TreeConstructor.TagAttribute ```typescript interface TagAttribute { key?: Tokenizer.Token, startWrapper?: Tokenizer.Token, value?: Tokenizer.Token, endWrapper?: Tokenizer.Token } ```