我试图用read_csv方法读取Pandas中的文本日志文件,我必须在----之前读取文件中的每一行,我已经定义了列名,只是为了获取基于列的数据,使其更容易,但我没有找到实现这一目标的方法。
我的原始日志数据:
myserer143
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [ OK ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.
Are you sure you want to continue [Yy/Nn]?
Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.
Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [ OK ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
Removing aex-* links in /usr/bin
Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.
Uninstallation has finished.
dbserer144
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [ OK ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.
Are you sure you want to continue [Yy/Nn]?
Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.
Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [ OK ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
Removing aex-* links in /usr/bin
Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.
Uninstallation has finished.DataFrame如下所示:
>>> data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a", "b", "c"], engine="python")
>>> data
a b c
0 myserer143 NaN NaN
1 ------------------------------- NaN NaN
2 Stopping Symantec Management Agent for UNIX, L... NaN NaN
3 This will remove the Symantec Management Agent... NaN NaN
4 Are you sure you want to continue [Yy/Nn]? NaN NaN
5 Uninstalling dependant solutions... NaN NaN
6 Unregistering the Altiris Base Task Handlers f... NaN NaN
7 Unregistering the Script Task Plugin... NaN NaN
8 Unregistering the Power Control Task Plugin... NaN NaN
9 Unregistering the Service Control Task Plugin... NaN NaN预期结果:
myserer143
dbserer144或它是可行的
myserer143 Uninstallation has finished
dbserer144 Uninstallation has finished发布于 2018-11-30 06:33:49
用shift结合startswith实现布尔掩码和boolean indexing滤波
data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")
m1 = data['a'].shift(-1).str.startswith('----', na=False)
m2 = data['a'].shift(-2).str.startswith('----', na=False)过滤行,并通过append添加最后一行的append
data = data[m1 | m2].append(data.iloc[[-1]])
print (data)
a
0 myserer143
44 Uninstallation has finished.
45 dbserer144
89 Uninstallation has finished.重塑价值观并将文本连接在一起:
df = pd.DataFrame(data.values.reshape(-1,2)).apply(' '.join, 1).to_frame('data')
print (df)
data
0 myserer143 Uninstallation has finished.
1 dbserer144 Uninstallation has finished.编辑:
为了获得更好的性能或处理大文件,可以按每一行循环列出,获取字典列表的值并创建DataFrame。最后一班加最后一次值:
data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")
L = []
with open('result.csv', 'r') as f:
for line in f:
line = line.strip()
if line:
L.append(line)
L = L[-1:] + L
out = [{'a':L[i-1], 'b':L[i-2]} for i, x in enumerate(L) if x.startswith('---') ]
print (out)
[{'a': 'myserer143', 'b': 'Uninstallation has finished.'},
{'a': 'dbserer144', 'b': 'Uninstallation has finished.'}]
df = pd.DataFrame(out)
df['b'] = df['b'].shift(-1).fillna(df.loc[0,'b'])
df = df.apply(' '.join, 1).to_frame('data')
print (df)
data
0 myserer143 Uninstallation has finished.
1 dbserer144 Uninstallation has finished.发布于 2018-12-01 03:10:07
考虑到您不需要的数据中有很多行,我认为最好在将数据加载到dataframe之前准备数据。
基于该文件,所需信息的部分始终由'-------...的分隔符分隔,因此在生成器中查看这些行并仅在分隔符之前加载这2行是有意义的。
我们首先从前2行开始,然后遍历文件以获得所需的信息。
from itertools import tee, islice, zip_longest
results = []
f = open('sample.txt','r')
n = 2 #number of lines to check
first = next(f)
delim = next(f)
results.append(first)
peek, lines = tee(f)
for idx, val in enumerate(lines):
if val == delim:
for val in islice(peek.__copy__(), idx - n, idx):
results.append(val)
last = idx
for i in islice(peek.__copy__(), last, last + 1):
results.append(i)
results
>> ['myserer143\n',
'Uninstallation has finished.\n',
'dbserer144\n',
'Uninstallation has finished.\n',
'dbserer144\n',
'Uninstallation has finished.']此时,没有内存浪费来加载未使用的行,并且返回的列表通过设置前几行的偏移量和获取最后一行来包含所需的信息。
然后,您可以使用来自itertools的Python食谱将结果成对分组,然后加载到dataframe中。
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
results = [i.strip() for i in results]
data = list(grouper(results, n))
df = pd.DataFrame(data, columns = ['Name','Status'])
df
>>
Name Status
0 myserer143 Uninstallation has finished.
1 dbserer144 Uninstallation has finished.
2 dbserer144 Uninstallation has finished.https://stackoverflow.com/questions/53551750
复制相似问题