Shortcuts

Parsing Mechanics

Parsing mechanisms should not be directly invoked publicly, as they are subject to change.

Parser

TexSoup.reader.read_tex(buf, skip_envs=(), tolerance=0)[source]

Parse all expressions in buffer

Parameters:
  • buf (Buffer) – a buffer of tokens
  • skip_envs (Tuple[str]) – environments to skip parsing
  • tolerance (int) – error tolerance level (only supports 0 or 1)
Returns:

iterable over parsed expressions

Return type:

Iterable[TexExpr]

TexSoup.reader.read_expr(src, skip_envs=(), tolerance=0, mode='mode:non-math')[source]

Read next expression from buffer

Parameters:
  • src (Buffer) – a buffer of tokens
  • skip_envs (Tuple[str]) – environments to skip parsing
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • mode (str) – math or not math mode
Returns:

parsed expression

Return type:

[TexExpr, Token]

TexSoup.reader.read_spacer(buf)[source]

Extracts the next spacer, if there is one, before non-whitespace

Define a spacer to be a contiguous string of only whitespace, with at most one line break.

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> read_spacer(Buffer(tokenize(categorize('   \t    \n'))))
'   \t    \n'
>>> read_spacer(Buffer(tokenize(categorize('   \t    \n\t \n  \t\n'))))
'   \t    \n\t '
>>> read_spacer(Buffer(tokenize(categorize('{'))))
''
>>> read_spacer(Buffer(tokenize(categorize('   \t    \na'))))
''
>>> read_spacer(Buffer(tokenize(categorize('   \t    \n\t \n  \t\na'))))
'   \t    \n\t '
TexSoup.reader.make_read_peek(f)[source]

Make any reader into a peek function.

The wrapped function still parses the next sequence of tokens in the buffer but rolls back the buffer position afterwards.

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> def read(buf):
...     buf.forward(3)
>>> buf = Buffer(tokenize(categorize(r'\item testing \textbf{hah}')))
>>> buf.position
0
>>> make_read_peek(read)(buf)
>>> buf.position
0

Environment Parser

TexSoup.reader.read_item(src, tolerance=0)[source]

Read the item content. Assumes escape has just been parsed.

There can be any number of whitespace characters between item and the first non-whitespace character. Any amount of whitespace between subsequent characters is also allowed.

item can also take an argument.

Parameters:
  • src (Buffer) – a buffer of tokens
  • tolerance (int) – error tolerance level (only supports 0 or 1)
Returns:

contents of the item and any item arguments

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> def read_item_from(string, skip=2):
...     buf = tokenize(categorize(string))
...     _ = buf.forward(skip)
...     return read_item(buf)
>>> read_item_from(r'\item aaa {bbb} ccc\end{itemize}')
[' aaa ', BraceGroup('bbb'), ' ccc']
>>> read_item_from(r'\item aaa \textbf{itemize}\item no')
[' aaa ', TexCmd('textbf', [BraceGroup('itemize')])]
>>> read_item_from(r'\item WITCH [nuuu] DOCTORRRR 👩🏻‍⚕️')
[' WITCH ', '[', 'nuuu', ']', ' DOCTORRRR 👩🏻‍⚕️']
>>> read_item_from(r'''\begin{itemize}
... \item
... \item first item
... \end{itemize}''', skip=8)
['\n']
>>> read_item_from(r'''\def\itemeqn{\item}''', skip=7)
[]
TexSoup.reader.unclosed_env_handler(src, expr, end)[source]

Handle unclosed environments.

Currently raises an end-of-file error. In the future, this can be the hub for unclosed-environment fault tolerance.

Parameters:
  • src (Buffer) – a buffer of tokens
  • expr (TexExpr) – expression for the environment
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • str (end) – Actual end token (as opposed to expected)
TexSoup.reader.read_math_env(src, expr, tolerance=0)[source]

Read the environment from buffer.

Advances the buffer until right after the end of the environment. Adds parsed content to the expression automatically.

Parameters:
  • src (Buffer) – a buffer of tokens
  • expr (TexExpr) – expression for the environment
Return type:

TexExpr

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize(r'\min_x \|Xw-y\|_2^2'))
>>> read_math_env(buf, TexMathModeEnv())
Traceback (most recent call last):
    ...
EOFError: [Line: 0, Offset: 7] "$" env expecting $. Reached end of file.
TexSoup.reader.read_skip_env(src, expr)[source]

Read the environment from buffer, WITHOUT parsing contents

Advances the buffer until right after the end of the environment. Adds UNparsed content to the expression automatically.

Parameters:
  • src (Buffer) – a buffer of tokens
  • expr (TexExpr) – expression for the environment
Return type:

TexExpr

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize(r' \textbf{aa \end{foobar}ha'))
>>> read_skip_env(buf, TexNamedEnv('foobar'))
TexNamedEnv('foobar', [' \\textbf{aa '], [])
>>> buf = tokenize(categorize(r' \textbf{aa ha'))
>>> read_skip_env(buf, TexNamedEnv('foobar'))  
Traceback (most recent call last):
    ...
EOFError: ...
TexSoup.reader.read_env(src, expr, tolerance=0, mode='mode:non-math')[source]

Read the environment from buffer.

Advances the buffer until right after the end of the environment. Adds parsed content to the expression automatically.

Parameters:
  • src (Buffer) – a buffer of tokens
  • expr (TexExpr) – expression for the environment
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • mode (str) – math or not math mode
Return type:

TexExpr

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize(' tingtang \\end\n{foobar}walla'))
>>> read_env(buf, TexNamedEnv('foobar'))
TexNamedEnv('foobar', [' tingtang '], [])
>>> buf = tokenize(categorize(' tingtang \\end\n\n{foobar}walla'))
>>> read_env(buf, TexNamedEnv('foobar')) 
Traceback (most recent call last):
    ...
EOFError: [Line: 0, Offset: 1] ...
>>> buf = tokenize(categorize(' tingtang \\end\n\n{nope}walla'))
>>> read_env(buf, TexNamedEnv('foobar'), tolerance=1)  # error tolerance
TexNamedEnv('foobar', [' tingtang '], [])

Argument Parser

TexSoup.reader.read_args(src, n_required=-1, n_optional=-1, args=None, tolerance=0, mode='mode:non-math')[source]

Read all arguments from buffer.

This function assumes that the command name has already been parsed. By default, LaTeX allows only up to 9 arguments of both types, optional and required. If n_optional is not set, all valid bracket groups are captured. If n_required is not set, all valid brace groups are captured.

Parameters:
  • src (Buffer) – a buffer of tokens
  • args (TexArgs) – existing arguments to extend
  • n_required (int) – Number of required arguments. If < 0, all valid brace groups will be captured.
  • n_optional (int) – Number of optional arguments. If < 0, all valid bracket groups will be captured.
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • mode (str) – math or not math mode
Returns:

parsed arguments

Return type:

TexArgs

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> test = lambda s, *a, **k: read_args(tokenize(categorize(s)), *a, **k)
>>> test('[walla]{walla}{ba]ng}')  # 'regular' arg parse
[BracketGroup('walla'), BraceGroup('walla'), BraceGroup('ba', ']', 'ng')]
>>> test('\t[wa]\n{lla}\n\n{b[ing}')  # interspersed spacers + 2 newlines
[BracketGroup('wa'), BraceGroup('lla')]
>>> test('\t[\t{a]}bs', 2, 0)  # use char as arg, since no opt args
[BraceGroup('['), BraceGroup('a', ']')]
>>> test('\n[hue]\t[\t{a]}', 2, 1)  # check stop opt arg capture
[BracketGroup('hue'), BraceGroup('['), BraceGroup('a', ']')]
>>> test('\t\\item')
[]
>>> test('   \t    \n\t \n{bingbang}')
[]
>>> test('[tempt]{ing}[WITCH]{doctorrrr}', 0, 0)
[]
TexSoup.reader.read_arg_optional(src, args, n_optional=-1, tolerance=0, mode='mode:non-math')[source]

Read next optional argument from buffer.

If the command has remaining optional arguments, look for:

  1. A spacer. Skip the spacer if it exists.
  2. A bracket delimiter. If the optional argument is bracket-delimited, the contents of the bracket group are used as the argument.
Parameters:
  • src (Buffer) – a buffer of tokens
  • args (TexArgs) – existing arguments to extend
  • n_optional (int) – Number of optional arguments. If < 0, all valid bracket groups will be captured.
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • mode (str) – math or not math mode
Returns:

number of remaining optional arguments

Return type:

int

TexSoup.reader.read_arg_required(src, args, n_required=-1, tolerance=0, mode='mode:non-math')[source]

Read next required argument from buffer.

If the command has remaining required arguments, look for:

  1. A spacer. Skip the spacer if it exists.
  2. A curly-brace delimiter. If the required argument is brace-delimited, the contents of the brace group are used as the argument.
  3. Spacer or not, if a brace group is not found, simply use the next character.
Parameters:
  • src (Buffer) – a buffer of tokens
  • args (TexArgs) – existing arguments to extend
  • n_required (int) – Number of required arguments. If < 0, all valid brace groups will be captured.
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • mode (str) – math or not math mode
Returns:

number of remaining optional arguments

Return type:

int

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = tokenize(categorize('{wal]la}\n{ba ng}\n'))
>>> args = TexArgs()
>>> read_arg_required(buf, args)  # 'regular' arg parse
-3
>>> args
[BraceGroup('wal', ']', 'la'), BraceGroup('ba ng')]
>>> buf.hasNext() and buf.peek().category == TC.MergedSpacer
True
TexSoup.reader.read_arg(src, c, tolerance=0, mode='mode:non-math')[source]

Read the argument from buffer.

Advances buffer until right before the end of the argument.

Parameters:
  • src (Buffer) – a buffer of tokens
  • c (str) – argument token (starting token)
  • tolerance (int) – error tolerance level (only supports 0 or 1)
  • mode (str) – math or not math mode
Returns:

the parsed argument

Return type:

TexGroup

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> s = r'''{\item\abovedisplayskip=2pt\abovedisplayshortskip=0pt~\vspace*{-\baselineskip}}'''
>>> buf = tokenize(categorize(s))
>>> read_arg(buf, next(buf))
BraceGroup(TexCmd('item'))
>>> buf = tokenize(categorize(r'{\incomplete! [complete]'))
>>> read_arg(buf, next(buf), tolerance=1)
BraceGroup(TexCmd('incomplete'), '! ', '[', 'complete', ']')

Command Parser

TexSoup.reader.read_command(buf, n_required_args=-1, n_optional_args=-1, skip=0, tolerance=0, mode='mode:non-math')[source]

Parses command and all arguments. Assumes escape has just been parsed.

No whitespace is allowed between escape and command name. e.g., \ textbf is a backslash command, then text textbf. Only \textbf is the bold command.

>>> from TexSoup.category import categorize
>>> from TexSoup.tokens import tokenize
>>> buf = Buffer(tokenize(categorize('\\sect  \t    \n\t{wallawalla}')))
>>> next(buf)
'\\'
>>> read_command(buf)
('sect', [BraceGroup('wallawalla')])
>>> buf = Buffer(tokenize(categorize('\\sect  \t   \n\t \n{bingbang}')))
>>> _ = next(buf)
>>> read_command(buf)
('sect', [])
>>> buf = Buffer(tokenize(categorize('\\sect{ooheeeee}')))
>>> _ = next(buf)
>>> read_command(buf)
('sect', [BraceGroup('ooheeeee')])
>>> buf = Buffer(tokenize(categorize(r'\item aaa {bbb} ccc\end{itemize}')))
>>> read_command(buf, skip=1)
('item', [])
>>> buf.peek()
' aaa '

# >>> buf = Buffer(tokenize(categorize(‘\sect abcd’))) # >>> _ = next(buf) # >>> read_command(buf) # (‘sect’, (‘a’,))