我正在构建一个Rails应用程序的原型,将文档上传到FSCrawler (运行REST接口),并将其合并到Elasticsearch索引中。使用他们的例子,这是有效的:
response = `curl -F "file=@#{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`文件被上传,内容被索引。这就是我得到的一个例子:
"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...当我在命令行运行curl时,我得到了所有东西,比如正确设置了“文件名”。如果我像上面一样使用它,如您所见,在Rails控制器中,文件名被设置为Tempfile的文件名。这不是一个可行的解决办法。尝试使用params[:document][:upload].tempfile (没有.path)或只使用params[:document][:upload]都完全失败。
我试图以“正确的方式”这样做,但每次使用适当的HTTP客户端都会失败。我不知道如何调用HTTP,它将以FSCrawler (在命令行上)的方式将文件提交给curl。
在本例中,我只是尝试使用Tempfile file对象发送文件。出于某种原因,FSCrawler给出了注释中的错误,并获得了一些元数据,但没有对任何内容进行索引:
## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
{ filename: params[:document][:upload].original_filename,
content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(request)
end如果我将上面的内容更改为使用params[:document][:upload].tempfile.path,那么我就不会得到关于InputStream的错误,但是我也(仍然)没有得到任何内容的索引。这就是我得到的一个例子:
{"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}如果我尝试使用RestClient,并尝试通过引用Tempfile的实际路径来发送文件,那么我将得到以下错误消息,但没有得到任何信息:
## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
file: params[:document][:upload].tempfile.path,
content_type: params[:document][:upload].content_type如果我尝试.read()文件并提交该文件,那么我就破坏了FSCrawler表单:
## Internal server error
request = RestClient::Request.new(
:method => :post,
:url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
:payload => {
:multipart => true,
:file => File.read(params[:document][:upload].tempfile),
:content_type => params[:document][:upload].content_type
})
response = request.execute显然,我一直在尽我所能地尝试这一点,但我无法复制curl对任何已知的基于Ruby客户机所做的一切。我完全不知道如何让Ruby以一种将文档内容正确索引的方式提交数据给FSCrawler。我在这里呆的时间远远超过了我想承认的时间。我在这里错过了什么?
发布于 2020-01-30 21:00:02
connection = Faraday.new('http://127.0.0.1:8080') do |f|
f.request :multipart
f.request :url_encoded
f.adapter :net_http
end
file = Faraday::UploadIO.new(
params[:document][:upload].tempfile.path,
params[:document][:upload].content_type,
params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)使用费德勒帮助我看到了我的尝试的结果,因为我越来越接近curl请求。这个片段发布的请求与curl几乎完全一样。要通过代理路由这个调用,我只需要将, proxy: 'http://localhost:8866'添加到连接设置的末尾。
https://stackoverflow.com/questions/59989742
复制相似问题