Tokenizing Mechanics¶
Tokenization for all input.
Translates string into iterable TexSoup.utils.Buffer, yielding one token at a time.
Tokenizer¶
-
TexSoup.tokens.
tokenize
(text)[source]¶ Generator for LaTeX tokens on text, ignoring comments.
Parameters: text (Union[str,iterator,Buffer]) – LaTeX to process >>> print(*tokenize(categorize(r'\\%}'))) \\ %} >>> print(*tokenize(categorize(r'\textbf{hello \\%}'))) \ textbf { hello \\ %} >>> print(*tokenize(categorize(r'\textbf{Do play \textit{nice}.}'))) \ textbf { Do play \ textit { nice } . } >>> print(*tokenize(categorize(r'\begin{tabular} 0 & 1 \\ 2 & 0 \end{tabular}'))) \ begin { tabular } 0 & 1 \\ 2 & 0 \ end { tabular }
-
TexSoup.tokens.
next_token
(text, prev=None)[source]¶ Returns the next possible token, advancing the iterator to the next position to start processing from.
Parameters: text (Union[str,iterator,Buffer]) – LaTeX to process Return str: the token >>> b = categorize(r'\textbf{Do play\textit{nice}.} $$\min_w \|w\|_2^2$$') >>> print(next_token(b), next_token(b), next_token(b), next_token(b)) \ textbf { Do play >>> print(next_token(b), next_token(b), next_token(b), next_token(b)) \ textit { nice >>> print(next_token(b)) } >>> print(next_token(categorize('.}'))) . >>> next_token(b) '.' >>> next_token(b) '}'
Escape Tokenizer¶
-
TexSoup.tokens.
tokenize_escaped_symbols
(text, prev=None)[source]¶ Process an escaped symbol or a known punctuation command.
Parameters: text (Buffer) – iterator over line, with current position >>> tokenize_escaped_symbols(categorize(r'\\')) '\\\\' >>> tokenize_escaped_symbols(categorize(r'\\%')) '\\\\' >>> tokenize_escaped_symbols(categorize(r'\}')) '\\}' >>> tokenize_escaped_symbols(categorize(r'\%')) '\\%' >>> tokenize_escaped_symbols(categorize(r'\ %')) # not even one spacer is allowed
-
TexSoup.tokens.
tokenize_line_comment
(text, prev=None)[source]¶ Process a line comment
Parameters: text (Buffer) – iterator over line, with current position >>> tokenize_line_comment(categorize('%hello world\\')) '%hello world\\' >>> tokenize_line_comment(categorize('hello %world')) >>> tokenize_line_comment(categorize('%}hello world')) '%}hello world' >>> tokenize_line_comment(categorize('%} ')) '%} ' >>> tokenize_line_comment(categorize('%hello\n world')) '%hello' >>> b = categorize(r'\\%') >>> _ = next(b), next(b) >>> tokenize_line_comment(b) '%' >>> tokenize_line_comment(categorize(r'\%'))
Math Tokenizer¶
-
TexSoup.tokens.
tokenize_math_sym_switch
(text, prev=None)[source]¶ Group characters in math switches.
Parameters: text (Buffer) – iterator over line, with current position >>> tokenize_math_sym_switch(categorize(r'$\min_x$ \command')) '$' >>> tokenize_math_sym_switch(categorize(r'$$\min_x$$ \command')) '$$'
-
TexSoup.tokens.
tokenize_math_asym_switch
(text, prev=None)[source]¶ Group characters in begin-end-style math switches
Parameters: text (Buffer) – iterator over line, with current position >>> tokenize_math_asym_switch(categorize(r'\[asf')) '\\[' >>> tokenize_math_asym_switch(categorize(r'\] sdf')) '\\]' >>> tokenize_math_asym_switch(categorize(r'[]'))
-
TexSoup.tokens.
tokenize_punctuation_command_name
(text, prev=None)[source]¶ Process command that augments or modifies punctuation.
This is important to the tokenization of a string, as opening or closing punctuation is not supposed to match.
Parameters: text (Buffer) – iterator over text, with current position
Command Tokenizer¶
-
TexSoup.tokens.
tokenize_command_name
(text, prev=None)[source]¶ Extract most restrictive subset possibility for command name.
Parser can later join allowed spacers and macros to assemble the final command name and arguments.
>>> b = categorize(r'\bf{') >>> _ = next(b) >>> tokenize_command_name(b) 'bf' >>> b = categorize(r'\bf,') >>> _ = next(b) >>> tokenize_command_name(b) 'bf' >>> b = categorize(r'\bf*{') >>> _ = next(b) >>> tokenize_command_name(b) 'bf*'
Text Tokenizer¶
-
TexSoup.tokens.
tokenize_symbols
(text, prev=None)[source]¶ Process singletone symbols as standalone tokens.
Parameters: text (Buffer) – iterator over line, with current position. Escape is isolated if not part of escaped char >>> next(tokenize(categorize(r'\begin turing'))) '\\' >>> next(tokenize(categorize(r'\bf {turing}'))) '\\' >>> next(tokenize(categorize(r'{]}'))).category <TokenCode.GroupBegin: 23>
-
TexSoup.tokens.
tokenize_string
(text, prev=None)[source]¶ Process a string of text
Parameters: >>> tokenize_string(categorize('hello')) 'hello' >>> b = categorize(r'hello again\command') >>> tokenize_string(b) 'hello again' >>> print(b.peek()) \ >>> print(tokenize_string(categorize(r'0 & 1\\\command'))) 0 & 1