有人能建议在使用(在Windows上)时给tika一个更大的堆大小(1 GByte左右)吗?
在处理非常大的Microsoft Word文件时,我从tika获得"status: 500“错误。如果我按照下面的方式从Windows命令行运行tika,错误就会消失:
C:>java -Xmx1G -jar tika-app-2.1.0.jar-Xmx1G指定最大堆大小为1 GByte (比默认值大得多)。
我已经看到了其他语言的几个答案,但没有一个是针对带有tika-python的Python的。
我试过:
os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
from tika import parser as tika_parser 以及:
def main():
global MODEL_LIST
os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
start_time = time.time()
... rest of code ...在Windows命令行中:
C:\<path>\findEm>set TIKA_JAVA_ARGS="-Xmx1G"
C:\<path>\findEm>python3 findEmv1.52.py所有三个方法都会导致相同的错误,如下所示
2021-10-19 14:43:55,782 [MainThread ] [WARNI] Tika server returned status: 500我认为主要的问题是,当我试图更改最大堆大小时,Java tika进程已经在运行--不知怎么的,我需要关闭它,设置堆大小最大值,并重新启动Java tika服务器。怎么做?
发布于 2021-10-22 23:06:57
你对已经在运行的过程的怀疑是正确的。让tika在后台运行意味着脚本启动时不会使用新标志重新启动java进程,这意味着不会增加堆。
至于解决这个问题,我们可以在psutil的帮助下用Python完全实现
from typing import Optional
import psutil
from tika import tika as tika_server
from tika import parser
def get_tika_process() -> Optional[psutil.Process]:
for process in psutil.process_iter(["name", "cmdline"]):
if "java" in process.name():
for part in process.cmdline():
if "tika" in part:
return process
if existing_tika_process := get_tika_process():
print("Found tika process:", existing_tika_process)
print("Existing process args:", existing_tika_process.cmdline())
existing_tika_process.terminate()
terminate_result = existing_tika_process.wait(10)
print(f"Terminated tika; exit code {terminate_result}")
else:
print("No existing tika process found")
tika_server.TikaJavaArgs += "-Xmx1G" # See note {1}
parsed = parser.from_file("spam.txt")
print("Tika server started")
new_tika_process = get_tika_process()
if new_tika_process:
print("New process args:", new_tika_process.cmdline())
print(parsed["metadata"])
print(parsed["content"]){1}当环境变量在导入tika_server.TikaJavaArgs时被解析时,我将直接追加到tika_server中。如果延迟导入(如问题中的第一次尝试),则可以用设置环境变量来替换。
结果:
(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
No existing tika process found
2021-10-22 22:50:04,476 [MainThread ] [WARNI] Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '54', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!
(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
Found tika process: psutil.Process(pid=11244, name='java.exe', status='running', started='22:50:04')
Existing process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
Terminated tika; exit code 15
2021-10-22 22:54:40,016 [MainThread ] [WARNI] Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-Xmx1G', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '55', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!
(venv) PS E:\DevProjects\stack-exchange-answers\69637621>您肯定可以改进这一点(例如,检查您的args是否相同,如果它们相同则跳过终止),但这至少会让您再次运行。
此外,您应该考虑在脚本的末尾添加一个对tika.tika.killServer()的调用,以便在完成时停止服务器。
https://stackoverflow.com/questions/69637621
复制相似问题