首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >为什么熊猫会在.csv文件中读取副本,然后重命名?

为什么熊猫会在.csv文件中读取副本,然后重命名?
EN

Stack Overflow用户
提问于 2021-11-04 20:03:36
回答 1查看 123关注 0票数 0

我正在使用python,我读取了一个文件,我想从相同的问题中删除重复的内容,但是它一直在读取带有名称的副本。

例如:有两个question1,它读取它们question1和问题1.1

因此,当我使用.drop_duplicates()时,它什么也不做,这里有什么问题?

代码语言:javascript
复制
file = 'survey.csv'
responses = pd.read_csv(file,header=1)
responses.head()
responses.drop_duplicates()

下面是.cvs文件的示例

代码语言:javascript
复制
>         ,,,X,,,,,,,,,,,,,,,,
>     Timestamp,Email Address,,"Know about basic linear algebra and matrices operations (multiplication, add, transpose)?",Know how to
> apply differentiation and the chain rule?,Know how to apply
> differentiation and the chain rule?,"Know what is a probability
> distribution and density function, and how to sample it?","Know what
> is a probability distribution and density function, and how to sample
> it?",Know the difference between classification and regression?,Know
> the difference between training and testing data?,Know the difference
> between training and testing data?,Know what is a training loop and
> what is an epoch?,Know what is a batch?,Know what is
> regularization?,Know what is overfitting and underfitting?,Know what
> is a feature vector?,,,,
>     ,,,,,,,,,,,,,,,,,,,
>     10/14/2021 17:15:05,y.sedki@gmail.com,,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,
>     10/14/2021 17:15:39,k.abdulaal@hotmail.com,,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,

但是,在编写上面的代码之后的输出是

知道如何运用微分和链规则吗?知道如何应用微分和链规则吗?

EN

回答 1

Stack Overflow用户

发布于 2021-11-04 23:50:22

我认为您应该考虑仅仅指定您知道哪些列是重复的,并具体地将它们放入其中。我不知道Pandas,但我想您可以在一行中指定这些列,比如下面这样的内容来删除第四列(如果是重复的话):

代码语言:javascript
复制
row1 = responses[1]
values_I_care_about = row1[0:3] + row1[4:]

您还可以使用Python的DictReader类从csv模块中快速地按列对数据进行去欺骗:

main.py

代码语言:javascript
复制
import csv
import sys

with open('sample.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    row = next(reader)

writer = csv.DictWriter(sys.stdout, fieldnames=row.keys())
writer.writeheader()

我在这个示例数据上运行了它(复制了头,取消了换行符,并添加了一个单一值为1的虚拟行):

代码语言:javascript
复制
Timestamp,Email Address,,"Know about basic linear algebra and matrices operations (multiplication, add, transpose)?",Know how to apply differentiation and the chain rule?,Know how to apply differentiation and the chain rule?,"Know what is a probability distribution and density function, and how to sample it?","Know what is a probability distribution and density function, and how to sample it?",Know the difference between classification and regression?,Know the difference between training and testing data?,Know the difference between training and testing data?,Know what is a training loop and what is an epoch?,Know what is a batch?,Know what is regularization?,Know what is overfitting and underfitting?,Know what is a feature vector?
1

我还使用我最喜欢的CSV命令行工具GoCSV来检查标题:

代码语言:javascript
复制
% python3 main.py | gocsv headers 
1: Timestamp
2: Email Address
3: 
4: Know about basic linear algebra and matrices operations (multiplication, add, transpose)?
5: Know how to apply differentiation and the chain rule?
6: Know what is a probability distribution and density function, and how to sample it?
7: Know the difference between classification and regression?
8: Know the difference between training and testing data?
9: Know what is a training loop and what is an epoch?
10: Know what is a batch?
11: Know what is regularization?
12: Know what is overfitting and underfitting?
13: Know what is a feature vector?

因为DictReader将列/标头名称读取为dict,所以不能有重复的键,所以没有重复的列/标头。但是你失去了对哪些副本被丢弃的控制。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69845202

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档