我正在使用Solr 4.6.0,每次索引大约10‘000个元素,而且我的导入性能很差。这意味着导入这些10,000份文档需要10分钟。当然我知道,这几乎不取决于服务器硬件,但我仍然想知道,如何实现性能提升--和--在现实世界中,(联接等)实际上是有用的?我也非常感谢的精确示例,,而不仅仅是官方文档的链接。
这是data-config.xml
<dataConfig>
<dataSource name="mysql" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://xxxx"
batchSize="-1"
user="xxxx" password="xxxx" />
<document name="publications">
<entity name="publication" transformer="RegexTransformer" pk="id" query="
SELECT
sm_publications.id AS p_id,
CONCAT(sm_publications.title, ' ', sm_publications.abstract) AS p_text,
sm_publications.year AS p_year,
sm_publications.doi AS p_doi,
sm_conferences.full_name AS c_fullname,
sm_journals.full_name AS j_fullname,
GROUP_CONCAT(DISTINCT sm_query_publications.query_id SEPARATOR '_-_-_-_-_') AS q_id
FROM sm_publications
LEFT JOIN sm_conferences ON sm_conferences.id = sm_publications.conference_id
LEFT JOIN sm_journals ON sm_journals.id = sm_publications.journal_id
INNER JOIN sm_query_publications ON sm_query_publications.publication_id = sm_publications.id
WHERE '${dataimporter.request.clean}' != 'false' OR
sm_publications.modified > '${dataimporter.last_index_time}' GROUP BY sm_publications.id">
<field column="p_id" name="id" />
<field column="p_text" name="text" />
<field column="p_text" name="text_tv" />
<field column="p_year" name="year" />
<field column="p_doi" name="doi" />
<field column="c_fullname" name="conference" />
<field column="j_fullname" name="journal" />
<field column="q_id" name="queries" splitBy="_-_-_-_-_" />
<entity name="publication_authors" query="
SELECT
CONCAT(
IF(sm_authors.first_name != '',sm_authors.first_name,''),
IF(sm_authors.middle_name != '',CONCAT(' ',sm_authors.middle_name),''),
IF(sm_authors.last_name != '',CONCAT(' ',sm_authors.last_name),'')
) AS a_name,
sm_affiliations.display_name AS aa_display_name,
CONCAT(sm_affiliations.latitude, ',', sm_affiliations.longitude) AS aa_geo,
sm_affiliations.country_name AS aa_country_name
FROM sm_publication_authors
INNER JOIN sm_authors ON sm_authors.id = sm_publication_authors.author_id
LEFT JOIN sm_affiliations ON sm_affiliations.id = sm_authors.affiliation_id
WHERE sm_publication_authors.publication_id = '${publication.p_id}'">
<field column="a_name" name="authors" />
<field column="aa_display_name" name="affiliations" />
<field column="aa_geo" name="geo" />
<field column="aa_country_name" name="countries" />
</entity>
<entity name="publication_keywords" query="
SELECT sm_keywords.name FROM sm_publication_keywords
INNER JOIN sm_keywords ON sm_keywords.id = sm_publication_keywords.keyword_id
WHERE sm_publication_keywords.publication_id = '${publication.p_id}'">
<field column="name" name="keywords" />
</entity>
</entity>
</document>
</dataConfig>发布于 2014-02-18 14:01:56
通过查询缓存,我指的是CachedSqlEntityProcessor。我喜欢合并的解决方案,就像您在另一个问题复本条目中一样。但是,如果CachedSqlEntityProcessor在主查询publication_authors的结果集中重复了一遍又一遍,而且您对额外内存的使用不太关心,那么p_id也会有帮助。
更新:看起来你已经解决了另外两个问题,也许你可以走任何一条路,无论如何,我会根据你的要求发布简短的示例/指针,以防其他人觉得方便。
<entity name="x" query="select * from x">
<entity name="y" query="select * from y" processor="CachedSqlEntityProcessor" where="xid=x.id">
</entity>
<entity>这个例子取自维基。这仍然将运行主查询"select * from“中的每个id中的每个查询"select * from xid=id”。但是它不会重复发送相同的查询。
https://stackoverflow.com/questions/21841186
复制相似问题