我正在使用python,我读取了一个文件,我想从相同的问题中删除重复的内容,但是它一直在读取带有名称的副本。
例如:有两个question1,它读取它们question1和问题1.1
因此,当我使用.drop_duplicates()时,它什么也不做,这里有什么问题?
file = 'survey.csv'
responses = pd.read_csv(file,header=1)
responses.head()
responses.drop_duplicates()下面是.cvs文件的示例
> ,,,X,,,,,,,,,,,,,,,,
> Timestamp,Email Address,,"Know about basic linear algebra and matrices operations (multiplication, add, transpose)?",Know how to
> apply differentiation and the chain rule?,Know how to apply
> differentiation and the chain rule?,"Know what is a probability
> distribution and density function, and how to sample it?","Know what
> is a probability distribution and density function, and how to sample
> it?",Know the difference between classification and regression?,Know
> the difference between training and testing data?,Know the difference
> between training and testing data?,Know what is a training loop and
> what is an epoch?,Know what is a batch?,Know what is
> regularization?,Know what is overfitting and underfitting?,Know what
> is a feature vector?,,,,
> ,,,,,,,,,,,,,,,,,,,
> 10/14/2021 17:15:05,y.sedki@gmail.com,,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,
> 10/14/2021 17:15:39,k.abdulaal@hotmail.com,,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,但是,在编写上面的代码之后的输出是
知道如何运用微分和链规则吗?知道如何应用微分和链规则吗?
发布于 2021-11-04 23:50:22
我认为您应该考虑仅仅指定您知道哪些列是重复的,并具体地将它们放入其中。我不知道Pandas,但我想您可以在一行中指定这些列,比如下面这样的内容来删除第四列(如果是重复的话):
row1 = responses[1]
values_I_care_about = row1[0:3] + row1[4:]您还可以使用Python的DictReader类从csv模块中快速地按列对数据进行去欺骗:
main.py
import csv
import sys
with open('sample.csv', 'r', newline='') as f:
reader = csv.DictReader(f)
row = next(reader)
writer = csv.DictWriter(sys.stdout, fieldnames=row.keys())
writer.writeheader()我在这个示例数据上运行了它(复制了头,取消了换行符,并添加了一个单一值为1的虚拟行):
Timestamp,Email Address,,"Know about basic linear algebra and matrices operations (multiplication, add, transpose)?",Know how to apply differentiation and the chain rule?,Know how to apply differentiation and the chain rule?,"Know what is a probability distribution and density function, and how to sample it?","Know what is a probability distribution and density function, and how to sample it?",Know the difference between classification and regression?,Know the difference between training and testing data?,Know the difference between training and testing data?,Know what is a training loop and what is an epoch?,Know what is a batch?,Know what is regularization?,Know what is overfitting and underfitting?,Know what is a feature vector?
1我还使用我最喜欢的CSV命令行工具GoCSV来检查标题:
% python3 main.py | gocsv headers
1: Timestamp
2: Email Address
3:
4: Know about basic linear algebra and matrices operations (multiplication, add, transpose)?
5: Know how to apply differentiation and the chain rule?
6: Know what is a probability distribution and density function, and how to sample it?
7: Know the difference between classification and regression?
8: Know the difference between training and testing data?
9: Know what is a training loop and what is an epoch?
10: Know what is a batch?
11: Know what is regularization?
12: Know what is overfitting and underfitting?
13: Know what is a feature vector?因为DictReader将列/标头名称读取为dict,所以不能有重复的键,所以没有重复的列/标头。但是你失去了对哪些副本被丢弃的控制。
https://stackoverflow.com/questions/69845202
复制相似问题