首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用vb.net从文本文件中查找和删除重复项

使用vb.net从文本文件中查找和删除重复项
EN

Stack Overflow用户
提问于 2012-09-16 12:51:08
回答 1查看 1.4K关注 0票数 1

我有一个巨大的文本文件,其中发生了大量的重复。复本如下。

总计16个职位

Pin代码= GFDHG

标题=伦敦的商店标牌/投影标牌/工业标牌/餐厅标牌/菜单板和盒子

日期= 12-09-2012

跟踪密钥# 85265E712050-15207427406854753

总计16个职位

Pin代码= GFDHG

标题=伦敦的商店标牌/投影标牌/工业标牌/餐厅标牌/菜单板和盒子

日期= 12-09-2012

跟踪密钥# 85265E712050-15207427406854753

职位总数2894

Pin代码= GFDHG

标题=伦敦的商店标牌/投影标牌/工业标牌/餐厅标牌/菜单板和盒子

日期= 15-09-2012

跟踪密钥# 85265E712050-152797637654753

职位总数2894

Pin代码= GFDHG

标题=伦敦的商店标牌/投影标牌/工业标牌/餐厅标牌/菜单板和盒子

日期= 15-09-2012

跟踪密钥# 85265E712050-152797637654753

以此类推,这个文本文件中总共有4000篇帖子。我希望我的程序匹配总帖子6发生在文件中的所有总帖子,并在哪里找到副本,然后程序化地删除该副本,并删除该副本的下7行。谢谢

EN

回答 1

Stack Overflow用户

发布于 2015-03-03 02:45:14

假设格式是一致的(即,文件中记录的每个事件总共使用6行文本),那么如果您希望从文件中删除重复项,您只需执行以下操作:

代码语言:javascript
复制
Sub DupClean(ByVal fpath As String) 'fpath is the FULL file path, i.e. C:\Users\username\Documents\filename.txt
        Dim OrigText As String = ""
        Dim CleanText As String = ""
        Dim CText As String = ""
        Dim SReader As New System.IO.StreamReader(fpath, System.Text.Encoding.UTF8)
        Dim TxtLines As New List(Of String)
        Dim i As Long = 0
        Dim writer As New System.IO.StreamWriter(Left(fpath, fpath.Length - 4) & "_clean.txt", False) 'to overwrite the text inside the same file simply use StreamWriter(fpath)

        Try
            'Read in the text
            OrigText = SReader.ReadToEnd

            'Parse the text at new lines to allow selecting groups of 6 lines
            TxtLines.AddRange(Split(OrigText, Chr(10))) 'may need to change the Chr # to look for depending on if 10 or 13 is used when the file is generated
        Catch ex As Exception
            MsgBox("Encountered an error while reading in the text file contents and parsing them. Details: " & ex.Message, vbOKOnly, "Read Error")
            End
        End Try

        Try
            'Now we iterate through blocks of 6 lines 
            Do While i < TxtLines.Count
                'Set CText to the next 6 lines of text
                CText = TxtLines.Item(i) & Chr(10) & TxtLines.Item(i + 1) & Chr(10) & TxtLines.Item(i + 2) & Chr(10) & TxtLines.Item(i + 3) & Chr(10) & TxtLines.Item(i + 4) & Chr(10) & TxtLines.Item(i + 5)

                'Check if CText is already present in CleanText
                If Not (CleanText.Contains(CText)) Then
                    'Add CText to CleanText
                    If CleanText.Length = 0 Then
                        CleanText = CText
                    Else
                        CleanText = CleanText & Chr(10) & CText
                    End If
                End If 'else the text is already present and we don't need to do anything

                i = i + 6
            Loop
        Catch ex As Exception
            MsgBox("Encountered an error while running cleaning duplicates from the read in text. The application was on the " & i & "-th line of text when the following error was thrown: " & ex.Message, _
                   vbOKOnly, "Comparison Error")
            End
        End Try

        Try
            'Write out the clean text
            writer.Write(CleanText)
        Catch ex As Exception
            MsgBox("Encountered an error writing the cleaned text. Details: " & ex.Message & Chr(10) & Chr(10) & "The cleaned text was " & CleanText, vbOKOnly, "Write Error")
        End Try
    End Sub

如果格式不一致,您将需要更花哨一些,并定义规则来告诉在循环中的任何给定遍历时将哪些行添加到CText中,但是如果没有上下文,我将无法给您任何关于这些行可能是什么的想法。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/12444475

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档