# Hyntax
Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).
- **Simple.** API is straightforward, output is clear.
- **Forgiving.** Just like a browser, normally parses invalid HTML.
- **Supports streaming.** Can process HTML while it's still being loaded.
- **No dependencies.**
## Table Of Contents
- [Usage](#usage)
- [TypeScript Typings](#typescript-typings)
- [Streaming](#streaming)
- [Tokens](#tokens)
- [AST Format](#ast-format)
- [API Reference](#api-reference)
- [Types Reference](#types-reference)
## Usage
```bash
npm install hyntax
```
```javascript
const { tokenize, constructTree } = require('hyntax')
const util = require('util')
const inputHTML = `
`
const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)
console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))
```
## TypeScript Typings
Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.
## Streaming
Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.
```javascript
const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')
http.get('http://info.cern.ch', (res) => {
const streamTokenizer = new StreamTokenizer()
const streamTreeConstructor = new StreamTreeConstructor()
let resultTokens = []
let resultAst
res.pipe(streamTokenizer).pipe(streamTreeConstructor)
streamTokenizer
.on('data', (tokens) => {
resultTokens = resultTokens.concat(tokens)
})
.on('end', () => {
console.log(JSON.stringify(resultTokens, null, 2))
})
streamTreeConstructor
.on('data', (ast) => {
resultAst = ast
})
.on('end', () => {
console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
})
}).on('error', (err) => {
throw err;
})
```
## Tokens
Here are all kinds of tokens which Hyntax will extract out of HTML string.

Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.
## AST Format
Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.
```javascript
{
nodeType: TreeConstructor.NodeTypes.Document,
content: {
children: [
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
},
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
}
]
}
}
```
Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.
## API Reference
### Tokenizer
Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
#### Interface
```typescript
tokenize(html: String): Tokenizer.Result
```
#### Arguments
- `html`
HTML string to process
Required.
Type: string.
#### Returns [Tokenizer.Result](#TokenizerResult)
### Tree Constructor
After you've got an array of tokens, you can pass them into tree constructor to build an AST.
#### Interface
```typescript
constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
```
#### Arguments
- `tokens`
Array of tokens received from the tokenizer.
Required.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
#### Returns [TreeConstructor.Result](#TreeConstructorResult)
## Types Reference
#### Tokenizer.Result
```typescript
interface Result {
state: Tokenizer.State
tokens: Tokenizer.AnyToken[]
}
```
- `state`
The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
- `tokens`
Array of resulting tokens.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
#### TreeConstructor.Result
```typescript
interface Result {
state: State
ast: AST
}
```
- `state`
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
- `ast`
Resulting AST.
Type: [TreeConstructor.AST](#treeconstructorast)
#### Tokenizer.Token
Generic Token, other interfaces use it to create a specific Token type.
```typescript
interface Token {
type: T
content: string
startPosition: number
endPosition: number
}
```
- `type`
One of the [Token types](#TokenizerTokenTypesAnyTokenType).
- `content `
Piece of original HTML string which was recognized as a token.
- `startPosition `
Index of a character in the input HTML string where the token starts.
- `endPosition`
Index of a character in the input HTML string where the token ends.
#### Tokenizer.TokenTypes.AnyTokenType
Shortcut type of all possible tokens.
```typescript
type AnyTokenType =
| Text
| OpenTagStart
| AttributeKey
| AttributeAssigment
| AttributeValueWrapperStart
| AttributeValue
| AttributeValueWrapperEnd
| OpenTagEnd
| CloseTag
| OpenTagStartScript
| ScriptTagContent
| OpenTagEndScript
| CloseTagScript
| OpenTagStartStyle
| StyleTagContent
| OpenTagEndStyle
| CloseTagStyle
| DoctypeStart
| DoctypeEnd
| DoctypeAttributeWrapperStart
| DoctypeAttribute
| DoctypeAttributeWrapperEnd
| CommentStart
| CommentContent
| CommentEnd
```
#### Tokenizer.AnyToken
Shortcut to reference any possible token.
```typescript
type AnyToken = Token
```
#### TreeConstructor.AST
Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)
```typescript
type AST = TreeConstructor.DocumentNode
```
### AST Node Types
There are 7 possible types of Node. Each type has a specific content.
```typescript
type DocumentNode = Node
```
```typescript
type DoctypeNode = Node
```
```typescript
type TextNode = Node
```
```typescript
type TagNode = Node
```
```typescript
type CommentNode = Node
```
```typescript
type ScriptNode = Node
```
```typescript
type StyleNode = Node
```
Interfaces for each content type:
- [Document](#TreeConstructorNodeContentsDocument)
- [Doctype](#TreeConstructorNodeContentsDoctype)
- [Text](#TreeConstructorNodeContentsText)
- [Tag](#TreeConstructorNodeContentsTag)
- [Comment](#TreeConstructorNodeContentsComment)
- [Script](#TreeConstructorNodeContentsScript)
- [Style](#TreeConstructorNodeContentsStyle)
#### TreeConstructor.Node
Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
```typescript
interface Node {
nodeType: T
content: C
}
```
#### TreeConstructor.NodeTypes.AnyNodeType
Shortcut type of all possible Node types.
```typescript
type AnyNodeType =
| Document
| Doctype
| Tag
| Text
| Comment
| Script
| Style
```
### Node Content Types
#### TreeConstructor.NodeTypes.AnyNodeContent
Shortcut type of all possible types of content inside a Node.
```typescript
type AnyNodeContent =
| Document
| Doctype
| Text
| Tag
| Comment
| Script
| Style
```
#### TreeConstructor.NodeContents.Document
```typescript
interface Document {
children: AnyNode[]
}
```
#### TreeConstructor.NodeContents.Doctype
```typescript
interface Doctype {
start: Tokenizer.Token
attributes?: DoctypeAttribute[]
end: Tokenizer.Token
}
```
#### TreeConstructor.NodeContents.Text
```typescript
interface Text {
value: Tokenizer.Token
}
```
#### TreeConstructor.NodeContents.Tag
```typescript
interface Tag {
name: string
selfClosing: boolean
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
children?: AnyNode[]
close?: Tokenizer.Token
}
```
#### TreeConstructor.NodeContents.Comment
```typescript
interface Comment {
start: Tokenizer.Token
value: Tokenizer.Token
end: Tokenizer.Token
}
```
#### TreeConstructor.NodeContents.Script
```typescript
interface Script {
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
value: Tokenizer.Token
close: Tokenizer.Token
}
```
#### TreeConstructor.NodeContents.Style
```typescript
interface Style {
openStart: Tokenizer.Token,
attributes?: TagAttribute[],
openEnd: Tokenizer.Token,
value: Tokenizer.Token,
close: Tokenizer.Token
}
```
#### TreeConstructor.DoctypeAttribute
```typescript
interface DoctypeAttribute {
startWrapper?: Tokenizer.Token,
value: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
```
#### TreeConstructor.TagAttribute
```typescript
interface TagAttribute {
key?: Tokenizer.Token,
startWrapper?: Tokenizer.Token,
value?: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
```