Making a Soup

To parse a \(\LaTeX\) document, pass an open filehandle or a string into the TexSoup constructor:

>>> from TexSoup import TexSoup
>>> with open("main.tex") as f:
...     soup = TexSoup(f)
>>> soup2 = TexSoup(r'\begin{document}Hello world!\end{document}')

Alternatively, compute the data structure only:

>>> from TexSoup import read
>>> soup3, _ = read(r'\begin{document}Hello world!\end{document}')
>>> soup3
[TexNamedEnv('document', ['Hello world!'], [])]

You can also ask TexSoup to tolerate \(\LaTeX\) errors. In which case, TexSoup will make a best-effort guess:

>>> soup4 = TexSoup(r'\begin{itemize}\item hullo\end{enumerate}', tolerance=1)
>>> soup4
\begin{itemize}\item hullo\end{itemize}\end{enumerate}

To output the soup, you can call str() on a object, or any nested data structure.

>>> soup4
\begin{itemize}\item hullo\end{itemize}\end{enumerate}
>>> str(soup4)
'\\begin{itemize}\\item hullo\\end{itemize}\\end{enumerate}'
>>> soup4.item
\item hullo
>>> str(soup4.item)
'\\item hullo'

Kinds of Objects

TexSoup translates a \(\LaTeX\) document into a tree of Python objects. There are only three kinds of objects: commands, environments, and text.


A TexCmd corresponds to a command in the original document:

>>> soup = TexSoup(r'I am \textbf{\large Large and bold}')
>>> cmd = soup.textbf
>>> cmd
\textbf{\large Large and bold}

You can access the underlying data structures using .expr.

>>> cmd.expr
TexCmd('textbf', [BraceGroup(TexCmd('large'), ' Large and bold')])

Every command has a name:


You can change the command’s name too. This change will be reflected when you convert the TexSoup back to \(\LaTeX\):

>>> = 'textit'
>>> cmd
\textit{\large Large and bold}
>>> soup
I am \textit{\large Large and bold}

Commands may have any number of arguments, stored in .args as a list. Our command has just one argument:

>>> len(cmd.args)
>>> str(cmd.args[0])
'{\\large Large and bold}'

You can add, remove, and modify arguments, treating .args as a list:

>>> cmd.args.append('{moar}')  # add arguments
>>> str(cmd.args)
'{\\large Large and bold}{moar}'
>>> cmd.args.remove('{\large Large and bold}')  # remove arguments
>>> str(cmd.args)
>>> cmd.args[0].string = 'floating'  # modify arguments
>>> str(cmd.args)

All arguments are represented using TexSoup’s underlying data structures:

>>> cmd.args

The above commands all apply to optional arguments as well. Note that all changes are reflected when we convert the soup back to \(\LaTeX\):

>>> cmd.args.append('[optional]')  # add optional arg
>>> str(cmd.args)
>>> cmd.args.remove('[optional]')  # remove optional arg
>>> str(cmd.args)
>>> soup
I am \textit{floating}


A TexText represents floating bits of text:

>>> soup
I am \textit{floating}
>>> text = next(soup.contents)
>>> text
'I am '
>>> type(text)
<class ''>

You can set the .text attribute. As before, this will be reflected when you convert the data structure back into \(\LaTeX\).

>>> text.text = 'I am not '
>>> soup
I am not \textit{floating}


Environments, or TexEnv, are split into three types:

  1. TexNamedEnv: The typical environments you think of, with a begin and an end, such as \begin{itemize}...\end{itemize}.
  2. TexUnNamedEnv: Special environments such as math \(...\). All math environments fall in this category.
  3. TexGroup: Unnamed environments with single-character delimiters, like {...}.

You can access environments by name:

>>> soup = TexSoup(r'Haha \begin{itemize}[label=\alph]\item Huehue\end{itemize}')
>>> env = soup.itemize
>>> env
\begin{itemize}[label=\alph]\item Huehue\end{itemize}

Every environment’s name can be accessed and modified using .name:

>>> = 'enumerate'
>>> env
\begin{enumerate}[label=\alph]\item Huehue\end{enumerate}
>>> soup
Haha \begin{enumerate}[label=\alph]\item Huehue\end{enumerate}

As with commands, environments store arguments in a list .args:

>>> str(env.args)

Each environment will contain variable amounts of content, accessible via .contents:

>>> list(env.contents)
[\item Huehue]