我一直在尝试使用pypandoc将以下代码中的HTML字符串question_text_html(这是一个用HTML编写的数学问题)转换为latex字符串。但它一直包含不相关的字符串,如“\protect\hypertarget{MJX-...}.”在转换后的字符串中
import pypandoc
from selenium import webdriver
driver.get("https://nigerianscholars.com/past-questions/mathematics/?
show_answers=yes")
question_blocks=driver.find_elements_by_class_name('question_block')
for question_block in question_blocks:
question_text=question_block.find_element_by_class_name('question_text')
question_text_html=question_text.get_attribute('innerHTML')
question_latex=pypandoc.convert_text(question_text_html,'tex',format='html')
print(f'Question Html is {question_text_html}')
print(f'Question latex is {question_latex}')它通常会给出
Question Html is <html><body><p class="q_question">Differentiate <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>&#x2212;</mo><mn>4</mn><mo stretchy="false">)</mo></math>' id="MathJax-Element-1-Frame" role="presentation" style="font-size: 114%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-1"><span class="mjx-mrow" id="MJXc-Node-2"><span class="mjx-mo" id="MJXc-Node-3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mn" id="MJXc-Node-4"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span><span class="mjx-mi" id="MJXc-Node-5"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-6"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">+</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-7"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">5</span></span><span class="mjx-msubsup" id="MJXc-Node-8"><span class="mjx-base"><span class="mjx-mo" id="MJXc-Node-9"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-10" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span></span></span><span class="mjx-mo" id="MJXc-Node-11"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mi" id="MJXc-Node-12"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-13"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">−</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-14"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">4</span></span><span class="mjx-mo" id="MJXc-Node-15"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>−</mo><mn>4</mn><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-1" type="math/tex">(2x+5)^2(x-4)</script> with respect to x.</p></body></html>
Question latex is Differentiate
{}\protect\hypertarget{MathJax-Element-1-Frame}{}{\protect\hypertarget{MJXc-Node-1}{}{\protect\hypertarget{MJXc-Node-2}{}{\protect\hypertarget{MJXc-Node-3}{}{{(}}\protect\hypertarget{MJXc-Node-4}{}{{2}}\protect\hypertarget{MJXc-Node-5}{}{{x}}\protect\hypertarget{MJXc-Node-6}{}{{+}}\protect\hypertarget{MJXc-Node-7}{}{{5}}\protect\hypertarget{MJXc-Node-8}{}{{\protect\hypertarget{MJXc-Node-9}{}{{)}}}{\protect\hypertarget{MJXc-Node-10}{}{{2}}}}\protect\hypertarget{MJXc-Node-11}{}{{(}}\protect\hypertarget{MJXc-Node-12}{}{{x}}\protect\hypertarget{MJXc-Node-13}{}{{−}}\protect\hypertarget{MJXc-Node-14}{}{{4}}\protect\hypertarget{MJXc-Node-15}{}{{)}}}}{\((2x + 5)^{2}(x - 4)\)}}\((2x+5)^2(x-4)\)
with respect to x.如何从latex中删除所有"\protect\hypertarget{MJXc-Node-10}“,只留下
Differentiate {\((2x + 5)^{2}(x - 4)\)}}\((2x+5)^2(x-4)\)
with respect to x.发布于 2021-01-11 04:34:53
对于MathJax,方程最初实际上是以TeX表示法存在的。跨度是由为MathJax中的公式布局创建的。当前,您让MathJax首先渲染方程,获取渲染的方程,然后尝试将其转化回原始TeX方程。直接读取TeX公式会更直接,而不需要间接的Javascript渲染。
要实现这一点,您只需在Selenium中禁用Javascript。例如,对于Firefox驱动程序,这应该可以做到这一点:
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
opts = Options()
opts.preferences.update({
"javascript.enabled": False,
})
driver = webdriver.Firefox(options=opts)或者,如果您出于某种原因需要在启用了Javascript的情况下处理呈现的版本,则可以尝试获取<p>中脚本元素的内容。它包含完整的等式,但没有TeX数学标记:
<p class="q_question">...<script type="math/tex">(2x+5)^2(x-4)</script>...</p>这样,您就不必移除跨度了。然后,您需要将其封装在TeX的数学标记\(...\)中。
https://stackoverflow.com/questions/65619853
复制相似问题