Tokenizing Mechanics¶

Tokenization for all input.

Translates string into iterable TexSoup.utils.Buffer, yielding one token at a time.

Tokenizer¶

TexSoup.tokens.tokenize(text)[source]¶

Generator for LaTeX tokens on text, ignoring comments.

Parameters:	text (Union[str,iterator,Buffer]) – LaTeX to process

>>> print(*tokenize(categorize(r'\\%}')))
\\ %}
>>> print(*tokenize(categorize(r'\textbf{hello \\%}')))
\ textbf { hello  \\ %}
>>> print(*tokenize(categorize(r'\textbf{Do play \textit{nice}.}')))
\ textbf { Do play  \ textit { nice } . }
>>> print(*tokenize(categorize(r'\begin{tabular} 0 & 1 \\ 2 & 0 \end{tabular}')))
\ begin { tabular }  0 & 1  \\  2 & 0  \ end { tabular }

TexSoup.tokens.next_token(text, prev=None)[source]¶

Returns the next possible token, advancing the iterator to the next position to start processing from.

Parameters:	text (Union[str,iterator,Buffer]) – LaTeX to process
Return str:	the token

>>> b = categorize(r'\textbf{Do play\textit{nice}.}   $$\min_w \|w\|_2^2$$')
>>> print(next_token(b), next_token(b), next_token(b), next_token(b))
\ textbf { Do play
>>> print(next_token(b), next_token(b), next_token(b), next_token(b))
\ textit { nice
>>> print(next_token(b))
}
>>> print(next_token(categorize('.}')))
.
>>> next_token(b)
'.'
>>> next_token(b)
'}'

Escape Tokenizer¶

TexSoup.tokens.tokenize_escaped_symbols(text, prev=None)[source]¶

Process an escaped symbol or a known punctuation command.

Parameters:	text (Buffer) – iterator over line, with current position

>>> tokenize_escaped_symbols(categorize(r'\\'))
'\\\\'
>>> tokenize_escaped_symbols(categorize(r'\\%'))
'\\\\'
>>> tokenize_escaped_symbols(categorize(r'\}'))
'\\}'
>>> tokenize_escaped_symbols(categorize(r'\%'))
'\\%'
>>> tokenize_escaped_symbols(categorize(r'\ %'))  # not even one spacer is allowed

TexSoup.tokens.tokenize_line_comment(text, prev=None)[source]¶

Process a line comment

Parameters:	text (Buffer) – iterator over line, with current position

>>> tokenize_line_comment(categorize('%hello world\\'))
'%hello world\\'
>>> tokenize_line_comment(categorize('hello %world'))
>>> tokenize_line_comment(categorize('%}hello world'))
'%}hello world'
>>> tokenize_line_comment(categorize('%}  '))
'%}  '
>>> tokenize_line_comment(categorize('%hello\n world'))
'%hello'
>>> b = categorize(r'\\%')
>>> _ = next(b), next(b)
>>> tokenize_line_comment(b)
'%'
>>> tokenize_line_comment(categorize(r'\%'))

TexSoup.tokens.tokenize_line_break(text, prev=None)[source]¶

Extract LaTeX line breaks.

>>> tokenize_line_break(categorize(r'\\aaa'))
'\\\\'
>>> tokenize_line_break(categorize(r'\aaa'))

TexSoup.tokens.tokenize_ignore(text, prev=None)[source]¶

Filter out ignored or invalid characters

>>> print(*tokenize(categorize('\x00hello')))
hello

Math Tokenizer¶

TexSoup.tokens.tokenize_math_sym_switch(text, prev=None)[source]¶

Group characters in math switches.

Parameters:	text (Buffer) – iterator over line, with current position

>>> tokenize_math_sym_switch(categorize(r'$\min_x$ \command'))
'$'
>>> tokenize_math_sym_switch(categorize(r'$$\min_x$$ \command'))
'$$'

TexSoup.tokens.tokenize_math_asym_switch(text, prev=None)[source]¶

Group characters in begin-end-style math switches

Parameters:	text (Buffer) – iterator over line, with current position

>>> tokenize_math_asym_switch(categorize(r'\[asf'))
'\\['
>>> tokenize_math_asym_switch(categorize(r'\] sdf'))
'\\]'
>>> tokenize_math_asym_switch(categorize(r'[]'))

TexSoup.tokens.tokenize_punctuation_command_name(text, prev=None)[source]¶

Process command that augments or modifies punctuation.

This is important to the tokenization of a string, as opening or closing punctuation is not supposed to match.

Parameters:	text (Buffer) – iterator over text, with current position

Command Tokenizer¶

TexSoup.tokens.tokenize_command_name(text, prev=None)[source]¶

Extract most restrictive subset possibility for command name.

Parser can later join allowed spacers and macros to assemble the final command name and arguments.

>>> b = categorize(r'\bf{')
>>> _ = next(b)
>>> tokenize_command_name(b)
'bf'
>>> b = categorize(r'\bf,')
>>> _ = next(b)
>>> tokenize_command_name(b)
'bf'
>>> b = categorize(r'\bf*{')
>>> _ = next(b)
>>> tokenize_command_name(b)
'bf*'

Text Tokenizer¶

TexSoup.tokens.tokenize_symbols(text, prev=None)[source]¶

Process singletone symbols as standalone tokens.

Parameters:	text (Buffer) – iterator over line, with current position. Escape is isolated if not part of escaped char

>>> next(tokenize(categorize(r'\begin turing')))
'\\'
>>> next(tokenize(categorize(r'\bf  {turing}')))
'\\'
>>> next(tokenize(categorize(r'{]}'))).category
<TokenCode.GroupBegin: 23>

TexSoup.tokens.tokenize_string(text, prev=None)[source]¶

Process a string of text

Parameters:	text (Buffer) – iterator over line, with current position delimiters (Union[None,iterable,str]) – defines the delimiters

>>> tokenize_string(categorize('hello'))
'hello'
>>> b = categorize(r'hello again\command')
>>> tokenize_string(b)
'hello again'
>>> print(b.peek())
\
>>> print(tokenize_string(categorize(r'0 & 1\\\command')))
0 & 1