Files
rappaurio-sae501_502/app/node_modules/hyntax/README.md
2023-09-25 09:41:55 +02:00

464 lines
10 KiB
Markdown

<p align="center">
<img src="./logo.png" alt="Hyntax project logo — lego bricks in the shape of a capital letter H" width="150">
</p>
# Hyntax
Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).
- **Simple.** API is straightforward, output is clear.
- **Forgiving.** Just like a browser, normally parses invalid HTML.
- **Supports streaming.** Can process HTML while it's still being loaded.
- **No dependencies.**
## Table Of Contents
- [Usage](#usage)
- [TypeScript Typings](#typescript-typings)
- [Streaming](#streaming)
- [Tokens](#tokens)
- [AST Format](#ast-format)
- [API Reference](#api-reference)
- [Types Reference](#types-reference)
## Usage
```bash
npm install hyntax
```
```javascript
const { tokenize, constructTree } = require('hyntax')
const util = require('util')
const inputHTML = `
<html>
<body>
<input type="text" placeholder="Don't type">
<button>Don't press</button>
</body>
</html>
`
const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)
console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))
```
## TypeScript Typings
Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.
## Streaming
Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.
```javascript
const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')
http.get('http://info.cern.ch', (res) => {
const streamTokenizer = new StreamTokenizer()
const streamTreeConstructor = new StreamTreeConstructor()
let resultTokens = []
let resultAst
res.pipe(streamTokenizer).pipe(streamTreeConstructor)
streamTokenizer
.on('data', (tokens) => {
resultTokens = resultTokens.concat(tokens)
})
.on('end', () => {
console.log(JSON.stringify(resultTokens, null, 2))
})
streamTreeConstructor
.on('data', (ast) => {
resultAst = ast
})
.on('end', () => {
console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
})
}).on('error', (err) => {
throw err;
})
```
## Tokens
Here are all kinds of tokens which Hyntax will extract out of HTML string.
![Overview of all possible tokens](./tokens-list.png)
Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.
## AST Format
Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.
<!-- You can play around with the [AST Explorer](https://astexplorer.net) to see how AST looks like. -->
```javascript
{
nodeType: TreeConstructor.NodeTypes.Document,
content: {
children: [
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {}
},
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {}
}
]
}
}
```
Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.
## API Reference
### Tokenizer
Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
#### Interface
```typescript
tokenize(html: String): Tokenizer.Result
```
#### Arguments
- `html`
HTML string to process
Required.
Type: string.
#### Returns [Tokenizer.Result](#TokenizerResult)
### Tree Constructor
After you've got an array of tokens, you can pass them into tree constructor to build an AST.
#### Interface
```typescript
constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
```
#### Arguments
- `tokens`
Array of tokens received from the tokenizer.
Required.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
#### Returns [TreeConstructor.Result](#TreeConstructorResult)
## Types Reference
#### Tokenizer.Result
```typescript
interface Result {
state: Tokenizer.State
tokens: Tokenizer.AnyToken[]
}
```
- `state`
The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
- `tokens`
Array of resulting tokens.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
#### TreeConstructor.Result
```typescript
interface Result {
state: State
ast: AST
}
```
- `state`
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
- `ast`
Resulting AST.
Type: [TreeConstructor.AST](#treeconstructorast)
#### Tokenizer.Token
Generic Token, other interfaces use it to create a specific Token type.
```typescript
interface Token<T extends TokenTypes.AnyTokenType> {
type: T
content: string
startPosition: number
endPosition: number
}
```
- `type`
One of the [Token types](#TokenizerTokenTypesAnyTokenType).
- `content `
Piece of original HTML string which was recognized as a token.
- `startPosition `
Index of a character in the input HTML string where the token starts.
- `endPosition`
Index of a character in the input HTML string where the token ends.
#### Tokenizer.TokenTypes.AnyTokenType
Shortcut type of all possible tokens.
```typescript
type AnyTokenType =
| Text
| OpenTagStart
| AttributeKey
| AttributeAssigment
| AttributeValueWrapperStart
| AttributeValue
| AttributeValueWrapperEnd
| OpenTagEnd
| CloseTag
| OpenTagStartScript
| ScriptTagContent
| OpenTagEndScript
| CloseTagScript
| OpenTagStartStyle
| StyleTagContent
| OpenTagEndStyle
| CloseTagStyle
| DoctypeStart
| DoctypeEnd
| DoctypeAttributeWrapperStart
| DoctypeAttribute
| DoctypeAttributeWrapperEnd
| CommentStart
| CommentContent
| CommentEnd
```
#### Tokenizer.AnyToken
Shortcut to reference any possible token.
```typescript
type AnyToken = Token<TokenTypes.AnyTokenType>
```
#### TreeConstructor.AST
Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)
```typescript
type AST = TreeConstructor.DocumentNode
```
### AST Node Types
There are 7 possible types of Node. Each type has a specific content.
```typescript
type DocumentNode = Node<NodeTypes.Document, NodeContents.Document>
```
```typescript
type DoctypeNode = Node<NodeTypes.Doctype, NodeContents.Doctype>
```
```typescript
type TextNode = Node<NodeTypes.Text, NodeContents.Text>
```
```typescript
type TagNode = Node<NodeTypes.Tag, NodeContents.Tag>
```
```typescript
type CommentNode = Node<NodeTypes.Comment, NodeContents.Comment>
```
```typescript
type ScriptNode = Node<NodeTypes.Script, NodeContents.Script>
```
```typescript
type StyleNode = Node<NodeTypes.Style, NodeContents.Style>
```
Interfaces for each content type:
- [Document](#TreeConstructorNodeContentsDocument)
- [Doctype](#TreeConstructorNodeContentsDoctype)
- [Text](#TreeConstructorNodeContentsText)
- [Tag](#TreeConstructorNodeContentsTag)
- [Comment](#TreeConstructorNodeContentsComment)
- [Script](#TreeConstructorNodeContentsScript)
- [Style](#TreeConstructorNodeContentsStyle)
#### TreeConstructor.Node
Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
```typescript
interface Node<T extends NodeTypes.AnyNodeType, C extends NodeContents.AnyNodeContent> {
nodeType: T
content: C
}
```
#### TreeConstructor.NodeTypes.AnyNodeType
Shortcut type of all possible Node types.
```typescript
type AnyNodeType =
| Document
| Doctype
| Tag
| Text
| Comment
| Script
| Style
```
### Node Content Types
#### TreeConstructor.NodeTypes.AnyNodeContent
Shortcut type of all possible types of content inside a Node.
```typescript
type AnyNodeContent =
| Document
| Doctype
| Text
| Tag
| Comment
| Script
| Style
```
#### TreeConstructor.NodeContents.Document
```typescript
interface Document {
children: AnyNode[]
}
```
#### TreeConstructor.NodeContents.Doctype
```typescript
interface Doctype {
start: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeStart>
attributes?: DoctypeAttribute[]
end: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeEnd>
}
```
#### TreeConstructor.NodeContents.Text
```typescript
interface Text {
value: Tokenizer.Token<Tokenizer.TokenTypes.Text>
}
```
#### TreeConstructor.NodeContents.Tag
```typescript
interface Tag {
name: string
selfClosing: boolean
openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
attributes?: TagAttribute[]
openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
children?: AnyNode[]
close?: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
}
```
#### TreeConstructor.NodeContents.Comment
```typescript
interface Comment {
start: Tokenizer.Token<Tokenizer.TokenTypes.CommentStart>
value: Tokenizer.Token<Tokenizer.TokenTypes.CommentContent>
end: Tokenizer.Token<Tokenizer.TokenTypes.CommentEnd>
}
```
#### TreeConstructor.NodeContents.Script
```typescript
interface Script {
openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartScript>
attributes?: TagAttribute[]
openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndScript>
value: Tokenizer.Token<Tokenizer.TokenTypes.ScriptTagContent>
close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagScript>
}
```
#### TreeConstructor.NodeContents.Style
```typescript
interface Style {
openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartStyle>,
attributes?: TagAttribute[],
openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndStyle>,
value: Tokenizer.Token<Tokenizer.TokenTypes.StyleTagContent>,
close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagStyle>
}
```
#### TreeConstructor.DoctypeAttribute
```typescript
interface DoctypeAttribute {
startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperStart>,
value: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttribute>,
endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperEnd>
}
```
#### TreeConstructor.TagAttribute
```typescript
interface TagAttribute {
key?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeKey>,
startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperStart>,
value?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValue>,
endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperEnd>
}
```