Parsing Mechanics¶
Parsing mechanisms should not be directly invoked publicly, as they are subject to change.
Parser¶
-
TexSoup.reader.
read_tex
(buf, skip_envs=(), tolerance=0)[source]¶ Parse all expressions in buffer
Parameters: Returns: iterable over parsed expressions
Return type: Iterable[TexExpr]
-
TexSoup.reader.
read_expr
(src, skip_envs=(), tolerance=0, mode='mode:non-math')[source]¶ Read next expression from buffer
Parameters: Returns: parsed expression
Return type:
-
TexSoup.reader.
read_spacer
(buf)[source]¶ Extracts the next spacer, if there is one, before non-whitespace
Define a spacer to be a contiguous string of only whitespace, with at most one line break.
>>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> read_spacer(Buffer(tokenize(categorize(' \t \n')))) ' \t \n' >>> read_spacer(Buffer(tokenize(categorize(' \t \n\t \n \t\n')))) ' \t \n\t ' >>> read_spacer(Buffer(tokenize(categorize('{')))) '' >>> read_spacer(Buffer(tokenize(categorize(' \t \na')))) '' >>> read_spacer(Buffer(tokenize(categorize(' \t \n\t \n \t\na')))) ' \t \n\t '
-
TexSoup.reader.
make_read_peek
(f)[source]¶ Make any reader into a peek function.
The wrapped function still parses the next sequence of tokens in the buffer but rolls back the buffer position afterwards.
>>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> def read(buf): ... buf.forward(3) >>> buf = Buffer(tokenize(categorize(r'\item testing \textbf{hah}'))) >>> buf.position 0 >>> make_read_peek(read)(buf) >>> buf.position 0
Environment Parser¶
-
TexSoup.reader.
read_item
(src, tolerance=0)[source]¶ Read the item content. Assumes escape has just been parsed.
There can be any number of whitespace characters between item and the first non-whitespace character. Any amount of whitespace between subsequent characters is also allowed.
item can also take an argument.
Parameters: Returns: contents of the item and any item arguments
>>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> def read_item_from(string, skip=2): ... buf = tokenize(categorize(string)) ... _ = buf.forward(skip) ... return read_item(buf) >>> read_item_from(r'\item aaa {bbb} ccc\end{itemize}') [' aaa ', BraceGroup('bbb'), ' ccc'] >>> read_item_from(r'\item aaa \textbf{itemize}\item no') [' aaa ', TexCmd('textbf', [BraceGroup('itemize')])] >>> read_item_from(r'\item WITCH [nuuu] DOCTORRRR 👩🏻⚕️') [' WITCH ', '[', 'nuuu', ']', ' DOCTORRRR 👩🏻⚕️'] >>> read_item_from(r'''\begin{itemize} ... \item ... \item first item ... \end{itemize}''', skip=8) ['\n'] >>> read_item_from(r'''\def\itemeqn{\item}''', skip=7) []
-
TexSoup.reader.
unclosed_env_handler
(src, expr, end)[source]¶ Handle unclosed environments.
Currently raises an end-of-file error. In the future, this can be the hub for unclosed-environment fault tolerance.
Parameters:
-
TexSoup.reader.
read_math_env
(src, expr, tolerance=0)[source]¶ Read the environment from buffer.
Advances the buffer until right after the end of the environment. Adds parsed content to the expression automatically.
Parameters: Return type: >>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> buf = tokenize(categorize(r'\min_x \|Xw-y\|_2^2')) >>> read_math_env(buf, TexMathModeEnv()) Traceback (most recent call last): ... EOFError: [Line: 0, Offset: 7] "$" env expecting $. Reached end of file.
-
TexSoup.reader.
read_skip_env
(src, expr)[source]¶ Read the environment from buffer, WITHOUT parsing contents
Advances the buffer until right after the end of the environment. Adds UNparsed content to the expression automatically.
Parameters: Return type: >>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> buf = tokenize(categorize(r' \textbf{aa \end{foobar}ha')) >>> read_skip_env(buf, TexNamedEnv('foobar')) TexNamedEnv('foobar', [' \\textbf{aa '], []) >>> buf = tokenize(categorize(r' \textbf{aa ha')) >>> read_skip_env(buf, TexNamedEnv('foobar')) Traceback (most recent call last): ... EOFError: ...
-
TexSoup.reader.
read_env
(src, expr, tolerance=0, mode='mode:non-math')[source]¶ Read the environment from buffer.
Advances the buffer until right after the end of the environment. Adds parsed content to the expression automatically.
Parameters: Return type: >>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> buf = tokenize(categorize(' tingtang \\end\n{foobar}walla')) >>> read_env(buf, TexNamedEnv('foobar')) TexNamedEnv('foobar', [' tingtang '], []) >>> buf = tokenize(categorize(' tingtang \\end\n\n{foobar}walla')) >>> read_env(buf, TexNamedEnv('foobar')) Traceback (most recent call last): ... EOFError: [Line: 0, Offset: 1] ... >>> buf = tokenize(categorize(' tingtang \\end\n\n{nope}walla')) >>> read_env(buf, TexNamedEnv('foobar'), tolerance=1) # error tolerance TexNamedEnv('foobar', [' tingtang '], [])
Argument Parser¶
-
TexSoup.reader.
read_args
(src, n_required=-1, n_optional=-1, args=None, tolerance=0, mode='mode:non-math')[source]¶ Read all arguments from buffer.
This function assumes that the command name has already been parsed. By default, LaTeX allows only up to 9 arguments of both types, optional and required. If n_optional is not set, all valid bracket groups are captured. If n_required is not set, all valid brace groups are captured.
Parameters: - src (Buffer) – a buffer of tokens
- args (TexArgs) – existing arguments to extend
- n_required (int) – Number of required arguments. If < 0, all valid brace groups will be captured.
- n_optional (int) – Number of optional arguments. If < 0, all valid bracket groups will be captured.
- tolerance (int) – error tolerance level (only supports 0 or 1)
- mode (str) – math or not math mode
Returns: parsed arguments
Return type: >>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> test = lambda s, *a, **k: read_args(tokenize(categorize(s)), *a, **k) >>> test('[walla]{walla}{ba]ng}') # 'regular' arg parse [BracketGroup('walla'), BraceGroup('walla'), BraceGroup('ba', ']', 'ng')] >>> test('\t[wa]\n{lla}\n\n{b[ing}') # interspersed spacers + 2 newlines [BracketGroup('wa'), BraceGroup('lla')] >>> test('\t[\t{a]}bs', 2, 0) # use char as arg, since no opt args [BraceGroup('['), BraceGroup('a', ']')] >>> test('\n[hue]\t[\t{a]}', 2, 1) # check stop opt arg capture [BracketGroup('hue'), BraceGroup('['), BraceGroup('a', ']')] >>> test('\t\\item') [] >>> test(' \t \n\t \n{bingbang}') [] >>> test('[tempt]{ing}[WITCH]{doctorrrr}', 0, 0) []
-
TexSoup.reader.
read_arg_optional
(src, args, n_optional=-1, tolerance=0, mode='mode:non-math')[source]¶ Read next optional argument from buffer.
If the command has remaining optional arguments, look for:
- A spacer. Skip the spacer if it exists.
- A bracket delimiter. If the optional argument is bracket-delimited, the contents of the bracket group are used as the argument.
Parameters: Returns: number of remaining optional arguments
Return type:
-
TexSoup.reader.
read_arg_required
(src, args, n_required=-1, tolerance=0, mode='mode:non-math')[source]¶ Read next required argument from buffer.
If the command has remaining required arguments, look for:
- A spacer. Skip the spacer if it exists.
- A curly-brace delimiter. If the required argument is brace-delimited, the contents of the brace group are used as the argument.
- Spacer or not, if a brace group is not found, simply use the next character.
Parameters: Returns: number of remaining optional arguments
Return type: >>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> buf = tokenize(categorize('{wal]la}\n{ba ng}\n')) >>> args = TexArgs() >>> read_arg_required(buf, args) # 'regular' arg parse -3 >>> args [BraceGroup('wal', ']', 'la'), BraceGroup('ba ng')] >>> buf.hasNext() and buf.peek().category == TC.MergedSpacer True
-
TexSoup.reader.
read_arg
(src, c, tolerance=0, mode='mode:non-math')[source]¶ Read the argument from buffer.
Advances buffer until right before the end of the argument.
Parameters: Returns: the parsed argument
Return type: >>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> s = r'''{\item\abovedisplayskip=2pt\abovedisplayshortskip=0pt~\vspace*{-\baselineskip}}''' >>> buf = tokenize(categorize(s)) >>> read_arg(buf, next(buf)) BraceGroup(TexCmd('item')) >>> buf = tokenize(categorize(r'{\incomplete! [complete]')) >>> read_arg(buf, next(buf), tolerance=1) BraceGroup(TexCmd('incomplete'), '! ', '[', 'complete', ']')
Command Parser¶
-
TexSoup.reader.
read_command
(buf, n_required_args=-1, n_optional_args=-1, skip=0, tolerance=0, mode='mode:non-math')[source]¶ Parses command and all arguments. Assumes escape has just been parsed.
No whitespace is allowed between escape and command name. e.g.,
\ textbf
is a backslash command, then texttextbf
. Only\textbf
is the bold command.>>> from TexSoup.category import categorize >>> from TexSoup.tokens import tokenize >>> buf = Buffer(tokenize(categorize('\\sect \t \n\t{wallawalla}'))) >>> next(buf) '\\' >>> read_command(buf) ('sect', [BraceGroup('wallawalla')]) >>> buf = Buffer(tokenize(categorize('\\sect \t \n\t \n{bingbang}'))) >>> _ = next(buf) >>> read_command(buf) ('sect', []) >>> buf = Buffer(tokenize(categorize('\\sect{ooheeeee}'))) >>> _ = next(buf) >>> read_command(buf) ('sect', [BraceGroup('ooheeeee')]) >>> buf = Buffer(tokenize(categorize(r'\item aaa {bbb} ccc\end{itemize}'))) >>> read_command(buf, skip=1) ('item', []) >>> buf.peek() ' aaa '
# >>> buf = Buffer(tokenize(categorize(‘\sect abcd’))) # >>> _ = next(buf) # >>> read_command(buf) # (‘sect’, (‘a’,))