我不太熟悉高级javascript,并寻求一些指导。我希望使用木偶技师-集群将网页内容存储到DB中,下面是一个示例:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.content();
// Store content, do something else
});
cluster.queue('http://www.google.com/');
cluster.queue('http://www.wikipedia.org/');
// many more pages
await cluster.idle();
await cluster.close();
})();看起来,我可能不得不使用pg副词连接到db。对此,建议采取什么方法?
这是我的桌子:
+----+-----------------------------------------------------+---------+
| id | url | content |
+----+-----------------------------------------------------+---------+
| 1 | https://www.npmjs.com/package/pg | |
+----+-----------------------------------------------------+---------+
| 2 | https://github.com/thomasdondorf/puppeteer-cluster/ | |
+----+-----------------------------------------------------+---------+我相信我必须将数据拉到数组中(id & url),每次接收到内容后,将其存储到DB中(by id & content)。
发布于 2019-03-20 18:54:21
您应该在任务函数之外创建数据库连接:
const { Client } = require('pg');
const client = new Client(/* ... */);
await client.connect();然后查询数据并将其排队(使用ID,以便以后能够将其保存在数据库中):
const rows = await pool.query('SELECT id, url FROM your_table WHERE ...');
rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));然后,在任务函数的末尾,更新表行。
await cluster.task(async ({ page, data: { id, url, id } }) => {
// ... run puppeteer and save results in content variable
await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
});总之,您的代码应该如下所示(请注意,我没有亲自测试代码):
const { Cluster } = require('puppeteer-cluster');
const { Client } = require('pg');
(async () => {
const client = new Client(/* ... */);
await client.connect();
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});
await cluster.task(async ({ page, data: { id, url } }) => {
await page.goto(url);
const content = await page.content();
await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
});
const rows = await pool.query('SELECT id, url FROM your_table');
rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));
await cluster.idle();
await cluster.close();
})();https://stackoverflow.com/questions/55266504
复制相似问题