Parsing Mechanics¶

Parsing mechanisms should not be directly invoked publicly, as they are subject to change.

Parser¶

TexSoup.reader.read_tex(buf, skip_envs=(), tolerance=0)[source]¶

Parse all expressions in buffer

Parameters:	buf (Buffer) – a buffer of tokens skip_envs (Tuple[str]) – environments to skip parsing tolerance (int) – error tolerance level (only supports 0 or 1)
Returns:	iterable over parsed expressions
Return type:	Iterable[TexExpr]

TexSoup.reader.read_expr(src, skip_envs=(), tolerance=0, mode='mode:non-math')[source]¶

Environment Parser¶

TexSoup.reader.read_item(src, tolerance=0)[source]¶

Read the item content. Assumes escape has just been parsed.

There can be any number of whitespace characters between item and the first non-whitespace character. Any amount of whitespace between subsequent characters is also allowed.

item can also take an argument.

Parameters:	src (Buffer) – a buffer of tokens tolerance (int) – error tolerance level (only supports 0 or 1)
Returns:	contents of the item and any item arguments

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> def read_item_from(string, skip=2):
...     buf = tokenize(categorize(string))
...     _ = buf.forward(skip)
...     return read_item(buf)
>>> read_item_from(r'\item aaa {bbb} ccc\end{itemize}')
[' aaa ', BraceGroup('bbb'), ' ccc']
>>> read_item_from(r'\item aaa \textbf{itemize}\item no')
[' aaa ', TexCmd('textbf', [BraceGroup('itemize')])]
>>> read_item_from(r'\item WITCH [nuuu] DOCTORRRR 👩🏻‍⚕️')
[' WITCH ', '[', 'nuuu', ']', ' DOCTORRRR 👩🏻‍⚕️']
>>> read_item_from(r'''\begin{itemize}
... \item
... \item first item
... \end{itemize}''', skip=8)
['\n']
>>> read_item_from(r'''\def\itemeqn{\item}''', skip=7)
[]

TexSoup.reader.unclosed_env_handler(src, expr, end)[source]¶

Handle unclosed environments.

Currently raises an end-of-file error. In the future, this can be the hub for unclosed-environment fault tolerance.

Parameters:	src (Buffer) – a buffer of tokens expr (TexExpr) – expression for the environment tolerance (int) – error tolerance level (only supports 0 or 1) str (end) – Actual end token (as opposed to expected)

TexSoup.reader.read_math_env(src, expr, tolerance=0)[source]¶

Read the environment from buffer.

Advances the buffer until right after the end of the environment. Adds parsed content to the expression automatically.

Parameters:	src (Buffer) – a buffer of tokens expr (TexExpr) – expression for the environment
Return type:	TexExpr

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize(r'\min_x \|Xw-y\|_2^2'))
>>> read_math_env(buf, TexMathModeEnv())
Traceback (most recent call last):
    ...
EOFError: [Line: 0, Offset: 7] "$" env expecting $. Reached end of file.

TexSoup.reader.read_skip_env(src, expr)[source]¶

Read the environment from buffer, WITHOUT parsing contents

Advances the buffer until right after the end of the environment. Adds UNparsed content to the expression automatically.

Parameters:	src (Buffer) – a buffer of tokens expr (TexExpr) – expression for the environment
Return type:	TexExpr

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize(r' \textbf{aa \end{foobar}ha'))
>>> read_skip_env(buf, TexNamedEnv('foobar'))
TexNamedEnv('foobar', [' \\textbf{aa '], [])
>>> buf = tokenize(categorize(r' \textbf{aa ha'))
>>> read_skip_env(buf, TexNamedEnv('foobar'))  
Traceback (most recent call last):
    ...
EOFError: ...

TexSoup.reader.read_env(src, expr, tolerance=0, mode='mode:non-math')[source]¶

Read the environment from buffer.

Advances the buffer until right after the end of the environment. Adds parsed content to the expression automatically.

Parameters:	src (Buffer) – a buffer of tokens expr (TexExpr) – expression for the environment tolerance (int) – error tolerance level (only supports 0 or 1) mode (str) – math or not math mode
Return type:	TexExpr

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize(' tingtang \\end\n{foobar}walla'))
>>> read_env(buf, TexNamedEnv('foobar'))
TexNamedEnv('foobar', [' tingtang '], [])
>>> buf = tokenize(categorize(' tingtang \\end\n\n{foobar}walla'))
>>> read_env(buf, TexNamedEnv('foobar')) 
Traceback (most recent call last):
    ...
EOFError: [Line: 0, Offset: 1] ...
>>> buf = tokenize(categorize(' tingtang \\end\n\n{nope}walla'))
>>> read_env(buf, TexNamedEnv('foobar'), tolerance=1)  # error tolerance
TexNamedEnv('foobar', [' tingtang '], [])

Argument Parser¶

TexSoup.reader.read_args(src, n_required=-1, n_optional=-1, args=None, tolerance=0, mode='mode:non-math')[source]¶

Read all arguments from buffer.

This function assumes that the command name has already been parsed. By default, LaTeX allows only up to 9 arguments of both types, optional and required. If n_optional is not set, all valid bracket groups are captured. If n_required is not set, all valid brace groups are captured.

Parameters:	src (Buffer) – a buffer of tokens args (TexArgs) – existing arguments to extend n_required (int) – Number of required arguments. If < 0, all valid brace groups will be captured. n_optional (int) – Number of optional arguments. If < 0, all valid bracket groups will be captured. tolerance (int) – error tolerance level (only supports 0 or 1) mode (str) – math or not math mode
Returns:	parsed arguments
Return type:	TexArgs

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> test = lambda s, *a, **k: read_args(tokenize(categorize(s)), *a, **k)
>>> test('[walla]{walla}{ba]ng}')  # 'regular' arg parse
[BracketGroup('walla'), BraceGroup('walla'), BraceGroup('ba', ']', 'ng')]
>>> test('\t[wa]\n{lla}\n\n{b[ing}')  # interspersed spacers + 2 newlines
[BracketGroup('wa'), BraceGroup('lla')]
>>> test('\t[\t{a]}bs', 2, 0)  # use char as arg, since no opt args
[BraceGroup('['), BraceGroup('a', ']')]
>>> test('\n[hue]\t[\t{a]}', 2, 1)  # check stop opt arg capture
[BracketGroup('hue'), BraceGroup('['), BraceGroup('a', ']')]
>>> test('\t\\item')
[]
>>> test('   \t    \n\t \n{bingbang}')
[]
>>> test('[tempt]{ing}[WITCH]{doctorrrr}', 0, 0)
[]

TexSoup.reader.read_arg_optional(src, args, n_optional=-1, tolerance=0, mode='mode:non-math')[source]¶

Read next optional argument from buffer.

If the command has remaining optional arguments, look for:

A spacer. Skip the spacer if it exists.

A bracket delimiter. If the optional argument is bracket-delimited, the contents of the bracket group are used as the argument.

Parameters:	src (Buffer) – a buffer of tokens args (TexArgs) – existing arguments to extend n_optional (int) – Number of optional arguments. If < 0, all valid bracket groups will be captured. tolerance (int) – error tolerance level (only supports 0 or 1) mode (str) – math or not math mode
Returns:	number of remaining optional arguments
Return type:	int

TexSoup.reader.read_arg_required(src, args, n_required=-1, tolerance=0, mode='mode:non-math')[source]¶

Read next required argument from buffer.

If the command has remaining required arguments, look for:

A spacer. Skip the spacer if it exists.

A curly-brace delimiter. If the required argument is brace-delimited, the contents of the brace group are used as the argument.

Spacer or not, if a brace group is not found, simply use the next character.

Parameters:	src (Buffer) – a buffer of tokens args (TexArgs) – existing arguments to extend n_required (int) – Number of required arguments. If < 0, all valid brace groups will be captured. tolerance (int) – error tolerance level (only supports 0 or 1) mode (str) – math or not math mode
Returns:	number of remaining optional arguments
Return type:	int

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize('{wal]la}\n{ba ng}\n'))
>>> args = TexArgs()
>>> read_arg_required(buf, args)  # 'regular' arg parse
-3
>>> args
[BraceGroup('wal', ']', 'la'), BraceGroup('ba ng')]
>>> buf.hasNext() and buf.peek().category == TC.MergedSpacer
True

TexSoup.reader.read_arg(src, c, tolerance=0, mode='mode:non-math')[source]¶

Read the argument from buffer.

Advances buffer until right before the end of the argument.

Parameters:	src (Buffer) – a buffer of tokens c (str) – argument token (starting token) tolerance (int) – error tolerance level (only supports 0 or 1) mode (str) – math or not math mode
Returns:	the parsed argument
Return type:	TexGroup

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> s = r'''{\item\abovedisplayskip=2pt\abovedisplayshortskip=0pt~\vspace*{-\baselineskip}}'''
>>> buf = tokenize(categorize(s))
>>> read_arg(buf, next(buf))
BraceGroup(TexCmd('item'))
>>> buf = tokenize(categorize(r'{\incomplete! [complete]'))
>>> read_arg(buf, next(buf), tolerance=1)
BraceGroup(TexCmd('incomplete'), '! ', '[', 'complete', ']')

Command Parser¶

TexSoup.reader.read_command(buf, n_required_args=-1, n_optional_args=-1, skip=0, tolerance=0, mode='mode:non-math')[source]¶

Parses command and all arguments. Assumes escape has just been parsed.

No whitespace is allowed between escape and command name. e.g., \ textbf is a backslash command, then text textbf. Only \textbf is the bold command.

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = Buffer(tokenize(categorize('\\sect  \t    \n\t{wallawalla}')))
>>> next(buf)
'\\'
>>> read_command(buf)
('sect', [BraceGroup('wallawalla')])
>>> buf = Buffer(tokenize(categorize('\\sect  \t   \n\t \n{bingbang}')))
>>> _ = next(buf)
>>> read_command(buf)
('sect', [])
>>> buf = Buffer(tokenize(categorize('\\sect{ooheeeee}')))
>>> _ = next(buf)
>>> read_command(buf)
('sect', [BraceGroup('ooheeeee')])
>>> buf = Buffer(tokenize(categorize(r'\item aaa {bbb} ccc\end{itemize}')))
>>> read_command(buf, skip=1)
('item', [])
>>> buf.peek()
' aaa '

# >>> buf = Buffer(tokenize(categorize(‘\sect abcd’))) # >>> _ = next(buf) # >>> read_command(buf) # (‘sect’, (‘a’,))