我有一个csv文件格式的excel电子表格。它有8列,每列对应一个唯一的参数。感兴趣的参数在第8列。excel描述了电力系统的行为,第8列描述了如果发生故障,级联事件将是什么。我已经设法使用Pandas来提取数据,我也将提供代码。下面是excel电子表格中的一行。
143是包含在第一列中的模拟编号。
第16- 19行在第二列。
0.7是第三列。
0是第四列。
0.5是第五列。
0.5是第六列。
1是第7列,其值为1或0,这取决于是否存在故障-如果元素包含故障,则为1。
[('G 05','Over-Speed','1.65'),('Load 23A_UF','11.87'),('Load 21A_UF','11.87'),('Load 24A_UF','11.88'),('Load 16A_UF',‘11.88’).is第八列并包含故障原因。
在第八列中,有一个包含在括号中的列表。括号包含三个值,第一个是导致故障的元素,第二个是原因-在某些情况下没有原因,因此它是空的,第三个是故障发生的时间。我已经使用pandas从第8列中收集了信息。
以下是完整布局的示例
143第16 - 19行0.7 0 0.5 0.5 1 ('G 05','Over-Speed','1.65'),(‘加载23A_UF','11.87'),(’加载21A_UF','11.87'),(‘加载24A_UF','11.88'),(’加载16A_UF','11.88'),(‘加载25A_UF','11.88'),(‘加载26A_UF','11.88'),(’加载27A_UF','11.88'),(‘加载28A_UF','11.88'),(’加载29A_UF','11.88'),(‘加载18A_UF','11.88'),(’加载15A_UF','11.88'),(‘加载03A_UF','11.88'),(’加载04A_UF','11.88'),(‘加载12A_UF','11.88'),(’加载08A_UF','11.88'),(‘加载07A_UF','11.88')
第八列已经这样处理了.
def list_dir(self, pattern = "*"):
import glob
return [Path(x) for x in glob.glob(str(self/pattern))]
Path.ls = list_dir
CSVFiles = Path('CSVFiles')
def CSV_to_Panda_Dataframe(file):
df = pd.read_csv(file)
return df
df = CSV_to_Panda_Dataframe(CSVFiles/"results_summary.csv")
df.head()
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
df.head()
for row in df.itertuples():
print(row._8)
for row in df.itertuples():
if row._8 == "[]":
continue
else:
print("Simulation #-", row._1, ": ", row._8)然后显示第八列中的所有信息,如下所示...
Simulation #- 7.0 : [('G 05', 'Over-Speed', '1.70')]
Simulation #- 20.0 : [('G 01', 'Out of Step', '2.58')]
Simulation #- 31.0 : [('G 02', 'UV', '2.73'), ('G 01', 'Out of Step', '3.16')]
Simulation #- 41.0 : [('G 05', 'Over-Speed', '1.63'), ('Load 23A_UF', '11.37'), ('Load 21A_UF', '11.38'), ('Load 08A_UF', '11.38'), ('Load 07A_UF', '11.38'), ('Load 12A_UF', '11.38'), ('Load 24A_UF', '11.38'), ('Load 15A_UF', '11.38'), ('Load 16A_UF', '11.38'), ('Load 04A_UF', '11.38'), ('Load 03A_UF', '11.38'), ('Load 18A_UF', '11.38'), ('Load 25A_UF', '11.38'), ('Load 27A_UF', '11.39'), ('Load 26A_UF', '11.39')]
Simulation #- 75.0 : [('G 05', 'Over-Speed', '1.65'), ('Load 25A_UF', '12.87'), ('Load 03A_UF', '12.87'), ('Load 26A_UF', '12.87'), ('Load 27A_UF', '12.87'), ('Load 18A_UF', '12.87'), ('Load 16A_UF', '12.87'), ('Load 24A_UF', '12.87'), ('Load 23A_UF', '12.87'), ('Load 21A_UF', '12.87'), ('Load 28A_UF', '12.87'), ('Load 15A_UF', '12.87'), ('Load 29A_UF', '12.87'), ('Load 04A_UF', '12.87'), ('Load 08A_UF', '12.87'), ('Load 07A_UF', '12.87'), ('Load 12A_UF', '12.87')]
Simulation #- 103.0 : [('NSG_2', 'OverVoltage', '2.07')]
Simulation #- 105.0 : [('NSG_2', 'OverVoltage', '2.11')]
Simulation #- 106.0 : [('NSG_2', 'OverVoltage', '2.09')]
Simulation #- 109.0 : [('G 05', 'Over-Speed', '1.63'), ('Load 29A_UF', '10.65'), ('Load 28A_UF', '10.66'), ('Load 26A_UF', '10.67'), ('Load 08A_UF', '10.79'), ('Load 12A_UF', '10.79'), ('Load 07A_UF', '10.79'), ('Load 23A_UF', '10.80'), ('Load 04A_UF', '10.80'), ('Load 21A_UF', '10.81')]
Simulation #- 110.0 : [('NSG_2', 'OverVoltage', '2.04')]
Simulation #- 111.0 : [('NSG_2', 'OverVoltage', '2.00')]然后,我尝试从[]之间提取每个原因,这样...这不是最好的方法,因为我没有遍历列表。
import re
ini_list = "[('G 05', 'Over-Speed', '1.63'), ('Load 23A_UF', '11.37'), ('Load 21A_UF', '11.38'), ('Load 08A_UF', '11.38'), ('Load 07A_UF', '11.38'), ('Load 12A_UF', '11.38'), ('Load 24A_UF', '11.38'), ('Load 15A_UF', '11.38'), ('Load 16A_UF', '11.38'), ('Load 04A_UF', '11.38'), ('Load 03A_UF', '11.38'), ('Load 18A_UF', '11.38'), ('Load 25A_UF', '11.38'), ('Load 27A_UF', '11.39'), ('Load 26A_UF', '11.39')]"
#extract all tuples by regular expression
result = re.findall(r'\((.*?)\)',ini_list)
target_list = []
for tup in result:
item = tup.split(',')
if len(item) == 3:
target_list.append((item[0], item[1], item[2]))
else:
target_list.append((item[0], None, item[1]))
reason = item[0]
re_reason = reason.strip('()')
#res_reason = reason.strip('(')[1]
print(re_reason)然后,我得到每个括号的第一个元素的结果,如下所示……
'G 05‘'Load 23A_UF’'Load 21A_UF‘'Load 08A_UF’'Load 07A_UF‘'Load 12A_UF’'Load 24A_UF‘'Load 15A_UF’'Load 16A_UF‘'Load 04A_UF’
不确定如何只获取故障来进行比较
我想做一些基本的统计分析,确定最常见的故障以及发生了多少故障。例如:计算运行了多少次模拟,然后计算出所有模拟中发生的故障的总百分比,进一步确定这些故障中最常见的故障以及发生故障时最常见的元素(括号内列表中的第一项)。在所有故障中,由于超速、过压、欠压、低速等引起的故障占总故障的百分比是多少?如果有故障,列7将包含1。
我试着获得一些统计数据,如下所示,但我对python非常陌生,并且非常努力,甚至不确定我所做的是否是最好的方法。我真的想要一个整洁的方式来显示故障和发生故障的百分比。另一件要注意的事情是,并不是所有的模拟都有故障,并且发生故障的数量会发生变化。一些实例有2个,而另一些实例可以有多达8个。此外,括号中包含的故障并不总是包含三个数据元素,某些情况下有两个,并且省略了跳闸的原因。然而,布局是('element','reason','time'),但有些情况是以('element','time')的形式。看到似乎比其他元素更容易出错的元素的总百分比将是很棒的。
directory = r'C:\Users\ywb18201\Desktop\Paper\Results 2019new\\'
#directory = r'C:\Users\ywb18201\Desktop\Paper\Results All\\'
k,l=0,0
total_cases=23958
cascades=[]
num=[]
times=0
#filename=r'\results_tap_load=1.1_wind=0.0_0.0_0.0.csv'
for filename in os.listdir(directory):
with open(directory+filename) as csvfile:
csv_reader = csv.reader(csvfile, delimiter=',')
for row in csv_reader:
casc_seq = []
if len(row)>2 and '(' in row[0] and 'Load' not in row[2]:
k = 2
if 'Load' in row[2] or 'Line' in row[2]:
k = 3
for i in range(k,len(row)):
casc_dec=row[i].split(',')
if len(casc_dec)>2:
casc=casc_dec[0]+'-'+casc_dec[1]
else:
casc=casc_dec[0]
if str(casc)[2:]=='':
print(str(casc)[2:])
print(filename)
if str(casc)[2:] not in casc_seq:
casc_seq.append(str(casc)[2:])
# for i,item in enumerate(casc_seq):
# if 'Load' in item:
# casc_seq[i]='Load'
if len(casc_seq)!=0:
if casc_seq not in cascades:
cascades.append(casc_seq)
num.append(1)
times+=1
else:
num[cascades.index(casc_seq)]+=1
type=[]
counterv,counterf,countervf=0,0,0
for item in cascades:
f1,f2=0,0
for casc in item:
if 'UV' in casc or 'Voltage' in casc:
f1=1
elif 'UF' in casc or 'Frequency' in casc or 'Speed' in casc:
f2=1
if f1==1:
if f2==1:
type.append('voltage and frequency')
countervf+=1
else:
type.append('voltage')
counterv+=1
else:
type.append('frequency')
counterf+=1
firstv=0
firstf=0
for item in cascades:
if 'UV' in item[0] or 'Voltage' in item[0]:
firstv+=1
elif 'UF' in item[0] or 'Frequency' in item[0] or 'Speed' in item[0]:
firstf += 1
# for i,item in enumerate(cascades):
# print(num[i],item,type[i])
print(times)
# print(sum(num))
# print(max(num),cascades[num.index(max(num))])
# cascadesnew=[]
# numnew=[]
# timesnew=0
# for item in cascades:
# items = []
# for casc in item:
# if 'Load' not in casc:
# items.append(casc)
# if items in cascadesnew:
# numnew[cascadesnew.index(items)]+=1
# else:
# cascadesnew.append(items)
# numnew.append(1)
# timesnew+=1
#
# numnew, cascadesnew = (list(t) for t in zip(*sorted(zip(numnew, cascadesnew))))
#
# for i,item in enumerate(cascadesnew):
# print(numnew[i],item)
# print(timesnew)
# print(sum(numnew))
flat_list = [item for sublist in cascades for item in sublist]
flat1=[]
for item in flat_list:
if 'Line' not in item:
if 'NSG_3' in item:
flat1.append('NSG_3-UnderVoltage')
elif 'NSG_1' in item:
flat1.append('NSG_1-UnderVoltage')
elif 'G2_UV' in item:
flat1.append('G2_UV')
elif 'G1_UV' in item:
flat1.append('G1_UV')
else:
flat1.append(item)
flat_list=flat1
B=Counter(flat_list)
res=B.most_common()
print(res)
labels, values = zip(*res)
values=[161*item/1357-5 for item in values]
indexes = np.arange(len(labels))
width = 0.5
plt.rcParams.update({'font.size': 22})
bar=plt.bar(indexes, values, width)
plt.rc('xtick', labelsize=6)
plt.xticks(indexes , labels, rotation=90)
for rect in bar:
height = rect.get_height()
plt.text(rect.get_x() + rect.get_width()/2.0, height, '%d' % int(height), ha='center', va='bottom')
#plt.tight_layout()
plt.ylabel('Number of patterns')
plt.xlabel('Protection device')
plt.show()
# labels=['Voltage and Frequency','Voltage','Frequency']
# values = [countervf,counterv,counterf]
# indexes = np.arange(len(labels))
# width = 0.5
# plt.title('Number of Voltage or Frequency related sequences')
# bar=plt.bar(indexes, values, width)
# #plt.rc('xtick', labelsize=8)
# plt.xticks(indexes , labels, )
# for rect in bar:
# height = rect.get_height()
# plt.text(rect.get_x() + rect.get_width()/2.0, height, '%d' % int(height), ha='center', va='bottom')
# plt.tight_layout()
# plt.show()
#
#
# labels=['Voltage','Frequency']
# values = [firstv,firstf]
# indexes = np.arange(len(labels))
# width = 0.5
# plt.title('Reason for the first trip in sequence')
# bar=plt.bar(indexes, values, width)
# plt.rc('xtick', labelsize=8)
# plt.xticks(indexes , labels, )
# for rect in bar:
# height = rect.get_height()
# plt.text(rect.get_x() + rect.get_width()/2.0, height, '%d' % int(height), ha='center', va='bottom')
# plt.tight_layout()
# plt.show()我还试图通过回收类似项目中的旧代码来进行各种绘图,以获取错误的数量及其相关信息。
import os
import csv
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
directory = r'C:\Users\ywb18201\Desktop\Paper\Results 20191.3new\\'
k,l=0,0
total_cases=23958
windv=[]
cascades=[]
num=1331*[0]
times=0
lines=[]
ftimes=[]
vtimes=[]
for filename in os.listdir(directory):
windsum=filename.split('=')[2]
#a=windsum.split('_')[0]*49.5+windsum.split('_')[1]*38.4+windsum.split('_')[2][:-4]*25.6
a=float(windsum.split('_')[0])*49.5+float(windsum.split('_')[1])*38.4+float(windsum.split('_')[2][:-4])*25.6
#print('%.1f'%a)
#windv.append('%.2f'%a)
windv.append(a/113.5*100)
with open(directory+filename) as csvfile:
casc_len=0
countv,countf=0,0
csv_reader = csv.reader(csvfile, delimiter=',')
for row in csv_reader:
casc_seq = []
if len(row) > 2 and '(' in row[0] and 'Load' not in row[2]:
k = 2
if 'Load' in row[2] or 'Line' in row[2]:
k = 3
if len(row)-k>0:
firstline=row[0]
lines.append(firstline[2:-13])
for i in range(k, len(row)):
casc_dec = row[i].split(',')
if len(casc_dec) > 2:
casc = casc_dec[0] + '-' + casc_dec[1]
else:
casc = casc_dec[0]
if str(casc)[2:] == '':
print(str(casc)[2:])
print(filename)
if str(casc)[2:] not in casc_seq:
casc_seq.append(str(casc)[2:])
if 'UnderVoltage' in casc or 'UV' in casc:
countv+=1
elif 'UF' in casc or 'Machine' in casc:
countf+=1
casc_len=casc_len+len(casc_seq)
#print(casc_len- countv-countf)
cascades.append(casc_len)
vtimes.append(countv)
ftimes.append(countf)
print(cascades)
print([x + y for x, y in zip(vtimes, ftimes)])
#print(ftimes)
# B=Counter(lines)
# res=B.most_common()
#
# labels, values = zip(*res)
# indexes = np.arange(len(labels))
# width = 0.5
# plt.rcParams.update({'font.size': 24})
# bar=plt.bar(indexes, values, width)
# plt.rc('xtick', labelsize=6)
# plt.xticks(indexes , labels)
# for rect in bar:
# height = rect.get_height()
# plt.text(rect.get_x() + rect.get_width()/2.0, height, '%d' % int(height), ha='center', va='bottom')
# plt.tight_layout()
# plt.ylabel('Number of Cascading Events')
# plt.xlabel('Fault Location')
# plt.show()
windvf,windvv=windv,windv
windv, cascades = (list(t) for t in zip(*sorted(zip(windv, cascades))))
windvf, ftimes = (list(t) for t in zip(*sorted(zip(windvf, ftimes))))
windvv, vtimes = (list(t) for t in zip(*sorted(zip(windvv, vtimes))))
labels=windv
values = np.arange(len(labels))
#plt.plot(values,cascades,'bo',markersize=2)
plt.rcParams.update({'font.size': 22})
p1=plt.bar(values,cascades,width=1, label="130% Loading")
#plt.xticks(values, labels)
#plt.xticks(np.arange(0, 1332, 100))
plt.xticks(np.arange(min(values), max(values)+1, 133),[0,10,20,30,40,50,60,70,80,90,100])
plt.ylabel('Number of Protection Devices that tripped')
plt.xlabel('Wind Generation Output Percentage (%)')
plt.legend()
plt.show()
ind = np.arange(len(windv)) # the x locations for the groups
width = 1 # the width of the bars: can also be len(x) sequence
plt.rcParams.update({'font.size': 22})
p1 = plt.bar(ind, vtimes, width)
p2 = plt.bar(ind, ftimes, width, bottom=vtimes)
#p4 = plt.bar(ind, cascades, width=1)
plt.xticks(np.arange(min(ind), max(ind)+1, 133),[0,10,20,30,40,50,60,70,80,90,100])
#plt.yticks([])
plt.legend((p1[0], p2[0]), ('Voltage Related', 'Frequency Related'))
plt.ylabel('Number of Protection Devices that tripped')
plt.xlabel('Wind Generation Output Percentage (%)')
plt.show()我意识到这是一项相当复杂的任务,并且数据集的设置格式非常复杂,但是数据集不能更改。任何帮助都将不胜感激。提前感谢
发布于 2020-06-20 00:22:23
我认为你应该认真考虑使用pandas进行数据分析。
这里有一些我用来学习熊猫的教程。
https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS
您可以将第8列(这似乎是最麻烦的)分成更多列,如果有一些丢失的数据(正如您的问题中所说的),您可以随时在pandas中使用一些工具来测量丢失的数据。
我提供了一个宽泛的答案,因为你的问题很宽泛。我的建议是,你可以尝试一下熊猫,如果你遇到困难,你可以随时回到这里,问一个特定的问题来解决。
https://stackoverflow.com/questions/62473855
复制相似问题