当抓取一个网页时,网页的结构一直在变化,我的意思是它的动态性导致我的爬虫停止工作。是否有一种机制可以在运行完整的爬虫程序之前识别网页结构的变化,以便识别结构是否已经改变。
发布于 2020-09-28 21:49:55
如果您可以在网页中运行自己的javascript代码,则可以使用MutationObserver来监视对DOM树所做的更改。
类似于:
waitForDomStability(timeout: number) {
return new Promise(resolve => {
const waitResolve = observer => {
observer.disconnect();
resolve();
};
let timeoutId;
const observer = new MutationObserver((mutationList, observer) => {
for (let i = 0; i < mutationList.length; i += 1) {
// we only care if new nodes have been added
if (mutationList[i].type === 'childList') {
// restart the countdown timer
window.clearTimeout(timeoutId);
timeoutId = window.setTimeout(waitResolve, timeout, observer);
break;
}
}
});
timeoutId = setTimeout(waitResolve, timeout, observer);
// start observing document.body
observer.observe(document.body, { attributes: true, childList: true, subtree: true });
});
}我在开源抓取扩展get-set-fetch中使用了这种方法。有关完整代码,请查看代码库中的/packages/background/src/ts/plugins/builtin/FetchPlugin.ts。
发布于 2020-09-29 23:34:19
你当然可以使用“快照”来比较同一页面的两个版本。为了实现这一点,我实现了一些类似于java String hashCode的东西。
javascript中的代码:
/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
const snapshotSelector = 'body';
const nodeToBeHashed = document.querySelector(snapshotSelector);
if (!nodeToBeHashed) return 0;
const { innerText } = nodeToBeHashed;
let hash = 0;
if (innerText.length === 0) {
return hash;
}
for (let i = 0; i < innerText.length; i += 1) {
// an integer between 0 and 65535 representing the UTF-16 code unit
const charCode = innerText.charCodeAt(i);
// multiply by 31 and add current charCode
hash = ((hash << 5) - hash) + charCode;
// convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
hash |= 0;
}
return hash;
}如果您不能在页面中运行javascript代码,您可以使用整个html响应作为要以您喜欢的语言进行散列的内容。
https://stackoverflow.com/questions/64101394
复制相似问题