文章/答案/技术大牛

发布

社区首页 >问答首页 >调用-WebRequest-脚本不像预期的那样工作

问调用-WebRequest-脚本不像预期的那样工作
EN

Stack Overflow用户

提问于 2019-07-18 12:01:22

回答 1查看 269关注 0票数 0

我有PowerShell脚本(来自Theo先生)，用于从大文本文件(6000行)中的urls获取标题。我的问题是，带有urls的文本文件中的一些行是特定的，而且脚本不能工作。剧本不会结束。文件的结构是(input.txt)：

2018-11-23 17:10:20;$https://www.super.cz
2018-11-24 21:09:34;$https://www.seznam.cz
2018-11-25 11:20:23;$https://www.kara.cz/muzi
2018-11-26 21:11:00;$https://atlas.centrum.cz (problem row)
2018-11-27 21:09:34;$https://www.seznam.cz

所需的文件结构是：

2018-11-23 17:10:20;$https://www.super.cz;$Super.cz
2018-11-24 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám
2018-11-25 11:20:23;$https://www.kara.cz/muzi;$Kara - Online obchod Kara
2018-11-27 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám

或其他想要的结构：

2018-11-23 17:10:20;$https://www.super.cz;$Super.cz
2018-11-24 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám
2018-11-25 11:20:23;$https://www.kara.cz/muzi;$Kara - Online obchod Kara
2018-11-26 21:11:00;$https://atlas.centrum.cz;$ (problem row without title)
2018-11-27 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám

我可以导出错误的行或删除它们吗？你能帮我更新一下剧本吗？

$inputFile  = 'C:\Users\user\Desktop\OSTROTA\input.txt'
$outputFile = 'C:\Users\user\Desktop\OSTROTA\urls_title.txt'

# Read the headerless textfile and replace all `;$` into a single ';'
# so we can use ConvertFrom-Csv.
# Collect the output for each delimited output string in a variable
$result = (Get-Content -Path $inputFile) -replace ';\$', ';' | 
    ConvertFrom-Csv -Delimiter ';' -Header date, url | 
    ForEach-Object {
        # put the url and date in variables so we can use them inside the catch block if needed
        $url  = $_.url
        $date = $_.date
        try {
            $page = Invoke-WebRequest -Uri $_.url -Method Get -ErrorAction Stop
            # output a string, delimited by ';$' 
            '{0};${1};${2}' -f $_.date, $_.url, $page.ParsedHtml.title
        }
        catch {
            Write-Warning "An error occurred on Url '$url'.`r`n$($_.Exception.Message)"
            # output the line with the title empty
            '{0};${1};$' -f $date, $url
        }
    }

# show output on screen
$result

# write output to a new headerless text file
$result | Set-Content $outputFile -Force

powershell

回答 1

Stack Overflow用户

发布于 2019-07-18 12:25:47

您必须在catch { ... }中修改该行

'{0};${1};$' -f $date, $url

如果您根本不希望它出现，只需评论：

#'{0};${1};$' -f $date, $url

如果要添加自定义消息，则应在$url之后添加该消息，并添加${2}

'{0};${1};${2}' -f $date, $url, ' (problem row without title)'

正如@LotPings在评论中提到的，一些网站可能需要cmdlet的-UseBasicParsing参数，否则就会挂起。

在这种情况下，ParsedHtml属性将为空，因此您需要找到另一种提取标题的方法。一个使用正则表达式的示例：

try {
    $page = Invoke-WebRequest -Uri $_.url -Method Get -ErrorAction Stop -UseBasicParsing
    $title = ($page.Content -match "<title>(.*)</title>")[0]| % {$Matches[1]}
    # output a string, delimited by ';$' 
    '{0};${1};${2}' -f $_.date, $_.url, $title
}

在这种情况下，您可以搜索<title>Something something</title>并使用捕获组从其中提取Something something。虽然整个匹配被保存到$Matches[0]中，但是您不需要这样做，所以您可以使用来自$Matches数组的next元素，该元素与捕获组匹配。

正在使用Explanation：catch { ... }块，因为您特别告诉脚本输入它以防止出现任何错误：-ErrorAction Stop。该参数强制所有错误触发catch { ... }块。在这种情况下，错误是

无效URI:无法解析主机名。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57093988

复制

相似问题

问调用-WebRequest-脚本不像预期的那样工作
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问调用-WebRequest-脚本不像预期的那样工作EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问调用-WebRequest-脚本不像预期的那样工作
EN