我在VBA中构建了一个简单的webscraper,它从Google专利中提取一个表,并将innerHTML存储在一个.txt文件中(名为1234567.txt),用于大约23,000项专利。现在我想分析每个文件的内容。为此,我希望将txt文件导入到VBA中,以便能够执行一些字符串搜索,但这似乎非常困难。我阅读了大约20种将.txt文件导入VBA的解决方案,但没有一种解决方案适用于我的文件,如下所示:
US6824791 B2 ' There is a shift + Enter here
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7767249">US7767249</A></TD>
<TD class="patent-data-table-td patent-date-value">Jul 25, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Aug 3, 2010</TD>
<TD class="patent-data-table-td ">Hewlett-Packard Development Company, L.P.</TD>
<TD class="patent-data-table-td ">Preparation of nanoparticles</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7935853">US7935853</A><SPAN class=patent-tooltip-anchor aria-label="Cited by examiner" data-tooltip-text="Cited by examiner" data-tooltip="Cited by examiner" a="null"> *</SPAN></TD>
<TD class="patent-data-table-td patent-date-value">Oct 8, 2009</TD>
<TD class="patent-data-table-td patent-date-value">May 3, 2011</TD>
<TD class="patent-data-table-td ">Bobelium S.L.</TD>
<TD class="patent-data-table-td ">Micronized composition of a 2,4-disubstituted phenol derivative</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US8524829">US8524829</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 17, 2008</TD>
<TD class="patent-data-table-td patent-date-value">Sep 3, 2013</TD>
<TD class="patent-data-table-td ">Brown University Research Foundation</TD>
<TD class="patent-data-table-td ">Methods for micronization of hydrophobic drugs</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/EP2422804A1?cl=en">EP2422804A1</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 16, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Feb 29, 2012</TD>
<TD class="patent-data-table-td ">Amano Enzyme USA., Ltd.</TD>
<TD class="patent-data-table-td ">Controlled release formulations of enzymes, microorganisms, and antibodies with mucoadhesive polymers</TD></TR></TBODY></TABLE>
' There is a shift + Enter here因此,尽管这些文件结构整洁、重复,但将它们作为一个字符串导入似乎非常困难。我基本上想遍历这个文件,提取专利号和提到的两个日期(使用Mid和InStr),并将它们放在三个不同的列中。这是我认为是最好的,但我渴望听到更聪明的建议!
可能需要知道每个.txt文件都有不同的长度(行数),但我可以用99.9%的确定性(1 (title) + 6 * total number of citations (which I know) - 1 (last <TR> is missing)来估计确切的行数
提前感谢
西蒙
编辑:我试过的一些例子。这些代码是从网上来源获取的。我不太清楚它们应该如何工作,很可能我使用它们不正确。
`Sub Text2Excel_Click()
Dim sourcestring as String
sourcestring = GetText("C\users\...\test.txt")
sourcestring = OpenTextFileToString("C\users\...\test.txt")
Function GetText(sFile As String) As String
Dim sText As String
Dim nSourceFile As Integer
''Close any open text files
Close
''Get the number of the next free text file
nSourceFile = FreeFile
''Write the entire file to sText
Open sFile For Input As #nSourceFile
sText = Input$(LOF(1), 1)
Close
GetText = sText
End Function
Function OpenTextFileToString(ByVal strFile As String) As String
' RB Smissaert - Author
Dim hFile As Long
hFile = FreeFile
Open strFile For Input As #hFile
OpenTextFileToString = Input$(LOF(hFile), hFile)
Close #hFile
End Function我还尝试了以下建议:但这给了我运行时错误424“对象所需”。
发布于 2014-05-28 13:11:51
如果您只是尝试将文本文件放入vba中的1字符串中,则可以使用以下函数。
注意:这要求您添加对的引用。
Public Function ReadFileAsString(ByVal filePath As String) As String
Dim fso As FileSystemObject
Set fso = New FileSystemObject
Set txtstream = fso.OpenTextFile(filePath, ForReading, False)
Dim sourceString As String
Do While Not txtstream.AtEndOfStream
sourceString = sourceString + txtstream.ReadLine
Loop
ReadFileAsString = sourceString
txtstream.Close
Set fso = Nothing
Set txtstream = Nothing
End Function然后,您可以对上述函数返回的字符串执行InStr()和Mid()等操作。要查看文本文件,请尝试:
Public Sub Test()
Debug.Print ReadFileAsString("C:\Users\ausername\Desktop\test.txt")
End Subhttps://stackoverflow.com/questions/23894822
复制相似问题