blocks|key|4444786|text|您不需要Python来拆分csv文件。使用您的shell：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4444787|$+split+-l+100+data.csv|code-block|syntax|javascript|4444788|会将data.csv分成100行的块。|offset|length|style|CODE|4444789|entityMap^0|0|0|2|8|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@$I|R|J|S|K|L]]|9|@]|A|$]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

You don't need Python to split a csv file. Using your shell:

<pre><code>$ split -l 100 data.csv
</code></pre>

Would split <code>data.csv</code> in chunks of 100 lines.

blocks|key|4444818|text|我不得不做一个类似的任务，并使用了pandas包：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4444819|for+i,chunk+in+enumerate(pd.read_csv('bigfile.csv',+chunksize=500000)):
++++chunk.to_csv('chunk{}.csv'.format(i),+index=False)|code-block|syntax|javascript|4444820|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

I had to do a similar task, and used the pandas package:

<pre><code>for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
 chunk.to_csv('chunk{}.csv'.format(i), index=False)
</code></pre>

blocks|key|1783342|text|这是一个小的python脚本，我用它将一个文件data.csv拆分为几个CSV部分文件。可以使用chunk_size+(每个零件文件的行数)控制零件文件的数量。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1783343|原始文件的标题行(列名)将复制到每个零件CSV文件中。|1783344|它适用于大文件，因为它使用readline()一次读取一行，而不是一次将整个文件加载到内存中。|1783345|#!/usr/bin/env+python3

def+main():
++++chunk_size+=+9998++#+lines

++++def+write_chunk(part,+lines):
++++++++with+open('data_part_'%2B+str(part)+%2B'.csv',+'w')+as+f_out:
++++++++++++f_out.write(header)
++++++++++++f_out.writelines(lines)

++++with+open('data.csv',+'r')+as+f:
++++++++count+=+0
++++++++header+=+f.readline()
++++++++lines+=+[]
++++++++for+line+in+f:
++++++++++++count+%2B=+1
++++++++++++lines.append(line)
++++++++++++if+count+%25+chunk_size+==+0:
++++++++++++++++write_chunk(count+//+chunk_size,+lines)
++++++++++++++++lines+=+[]
++++++++#+write+remainder
++++++++if+len(lines)+>+0:
++++++++++++write_chunk((count+//+chunk_size)+%2B+1,+lines)

if+__name__+==+'__main__':
++++main()|code-block|syntax|javascript|1783346|entityMap^0|N|8|1C|A|0|0|D|A|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@$9|R|A|S|B|C]|$9|T|A|U|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|V|8|@]|D|@]|E|$]]|$1|H|3|I|5|6|7|W|8|@$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|J|3|K|5|L|7|Z|8|@]|D|@]|E|$M|N]]|$1|O|3|-4|5|6|7|10|8|@]|D|@]|E|$]]]|P|$]]

Here is a little python script I used to split a file <code>data.csv</code> into several CSV part files. The number of part files can be controlled with <code>chunk_size</code> (number of lines per part file).
The header line (column names) of the original file is copied into every part CSV file.
It works for big files because it reads one line at a time with <code>readline()</code> instead of loading the complete file into memory at once.
<pre><code>#!/usr/bin/env python3

def main():
 chunk_size = 9998 # lines

 def write_chunk(part, lines):
 with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
 f_out.write(header)
 f_out.writelines(lines)

 with open('data.csv', 'r') as f:
 count = 0
 header = f.readline()
 lines = []
 for line in f:
 count += 1
 lines.append(line)
 if count % chunk_size == 0:
 write_chunk(count // chunk_size, lines)
 lines = []
 # write remainder
 if len(lines) &gt; 0:
 write_chunk((count // chunk_size) + 1, lines)

if __name__ == '__main__':
 main()
</code></pre>

blocks|key|5860982|text|也许是这样的？|type|unstyled|depth|inlineStyleRanges|entityRanges|data|5860983|#!/usr/local/cpython-3.3/bin/python

import+csv

divisor+=+10

outfileno+=+1
outfile+=+None

with+open('big.csv',+'r')+as+infile:
++++for+index,+row+in+enumerate(csv.reader(infile)):
++++++++if+index+%25+divisor+==+0:
++++++++++++if+outfile+is+not+None:
++++++++++++++++outfile.close()
++++++++++++outfilename+=+'big-{}.csv'.format(outfileno)
++++++++++++outfile+=+open(outfilename,+'w')
++++++++++++outfileno+%2B=+1
++++++++++++writer+=+csv.writer(outfile)
++++++++writer.writerow(row)|code-block|syntax|javascript|5860984|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Maybe something like this?

<pre><code>#!/usr/local/cpython-3.3/bin/python

import csv

divisor = 10

outfileno = 1
outfile = None

with open('big.csv', 'r') as infile:
 for index, row in enumerate(csv.reader(infile)):
 if index % divisor == 0:
 if outfile is not None:
 outfile.close()
 outfilename = 'big-{}.csv'.format(outfileno)
 outfile = open(outfilename, 'w')
 outfileno += 1
 writer = csv.writer(outfile)
 writer.writerow(row)
</code></pre>

blocks|key|1786731|text|请参阅file对象上的Python+docs+(由open(filename)返回的对象-您可以选择read指定的字节数，或使用readline一次遍历一行。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1786732|entityMap|0|LINK|mutability|MUTABLE|url|http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects^0|3|4|P|E|1E|4|1S|8|B|B|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@$9|O|A|P|B|C]|$9|Q|A|R|B|C]|$9|S|A|T|B|C]|$9|U|A|V|B|C]]|D|@$9|W|A|X|1|Y]]|E|$]]|$1|F|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]]]

See the <a href="http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects" rel="nofollow">Python docs</a> on <code>file</code> objects (the object returned by <code>open(filename)</code> - you can choose to <code>read</code> a specified number of bytes, or use <code>readline</code> to work through one line at a time.

blocks|key|1786779|text|我同意@jonrsharpe+readline应该能够一次读取一行，即使对于大文件也是如此。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1786780|如果你正在处理大的csv文件，我建议你使用pandas.read_csv。我经常将它用于相同的目的，并且总是发现它很棒(而且很快)。需要一些时间才能习惯DataFrames的概念。但一旦你克服了这一点，它就会极大地加速像你这样的大型操作。|offset|length|1786781|希望能有所帮助。|1786782|entityMap|0|LINK|mutability|MUTABLE|url|http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html^0|0|L|F|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@$D|R|E|S|1|T]]|A|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|O]]]]

I agree with @jonrsharpe readline should be able to read one line at a time even for big files.

If you are dealing with big csv files might I suggest using <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html" rel="nofollow noreferrer">pandas.read_csv</a>. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively. 

Hope it helps.

blocks|key|2332294|text|此图显示了其他海报概述的不同方法的运行时差异(在8核计算机上，将具有1180万行数据的2.9+GB文件拆分为~290个文件)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2332295|​|2332296|📷|atomic|offset|length|2332297|2332298|shell方法来自Thomas+Orozco，Python方法来自Roberto，Pandas方法来自Quentin+Febvre，以下是Dask代码片段：|2332299|ddf+=+dd.read_csv("../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv",+blocksize=10000000,+dtype=dtypes)
ddf.to_csv("../tmp/split_csv_dask")|code-block|syntax|javascript|2332300|我推荐Dask来分割文件，尽管它不是最快的，因为它是最灵活的解决方案(你可以写出不同的文件格式，在写之前执行处理操作，很容易修改压缩格式，等等)。Pandas方法几乎同样灵活，但不能对整个数据集执行处理(比如在写入之前对整个数据集进行排序)。|2332301|Bash+/原生Python文件系统操作显然更快，但这不是我在拥有大型CSV时通常所期望的。我通常对将大型CSV拆分成较小的拼图文件感兴趣，以便进行性能和生产数据分析。我通常不关心实际拆分是否需要多花几分钟时间。我更感兴趣的是准确拆分。|2332302|我写了一篇博客文章对此进行了更详细的讨论。你也许可以在谷歌上搜索一下，找到这篇文章。|2332303|entityMap|0|IMAGE|mutability|IMMUTABLE|imageUrl|https://ask.qcloudimg.com/http-save/yehe-900000/97fb296e4bea9afe83ac459246bf890a.png|imageAlt^0|0|0|0|1|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|15|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|16|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|17|8|@]|9|@$G|18|H|19|1|1A]]|A|$]]|$1|I|3|C|5|6|7|1B|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|1C|8|@]|9|@]|A|$]]|$1|L|3|M|5|N|7|1D|8|@]|9|@]|A|$O|P]]|$1|Q|3|R|5|6|7|1E|8|@]|9|@]|A|$]]|$1|S|3|T|5|6|7|1F|8|@]|9|@]|A|$]]|$1|U|3|V|5|6|7|1G|8|@]|9|@]|A|$]]|$1|W|3|-4|5|6|7|1H|8|@]|9|@]|A|$]]]|X|$Y|$5|Z|10|11|A|$12|13|14|-4]]]]

This graph shows the runtime difference of the different approaches outlined by other posters (on an 8 core machine when splitting a 2.9 GB file with 11.8 million rows of data into ~290 files).
<a href="https://i.stack.imgur.com/ifNjO.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/ifNjO.png" alt="enter image description here" /></a>
The shell approach is from Thomas Orozco, Python approach s from Roberto, Pandas approach is from Quentin Febvre and here's the Dask snippet:
<pre class="lang-py prettyprint-override"><code>ddf = dd.read_csv(&quot;../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv&quot;, blocksize=10000000, dtype=dtypes)
ddf.to_csv(&quot;../tmp/split_csv_dask&quot;)
</code></pre>
I'd recommend Dask for splitting files, even though it's not the fastest, because it's the most flexible solution (you can write out different file formats, perform processing operations before writing, easily modify compression formats, etc.). The Pandas approach is almost as flexible, but cannot perform processing on the entire dataset (like sorting the entire dataset before writing).
Bash / native Python filesystem operations are clearly quicker, but that's not what I'm typically looking for when I have a large CSV. I'm typically interested in splitting large CSVs into smaller Parquet files, for performant, production data analyses. I don't usually care if the actually splitting takes a couple minutes more. I'm more interested in splitting accurately.
I wrote a blog post that discusses this in more detail. You can probably Google around and find the post.

blocks|key|4444952|text|在希望按粗略边界(以字节为单位)拆分的情况下，最新的数据点是最底层的数据点，并且希望将最新的数据点放在第一个文件中：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4444953|from+pathlib+import+Path
++++
TEN_MB+=+10000000
FIVE_MB+=+5000000

def+split_file_into_chunks(path,+chunk_size=TEN_MB):
++++path+=+str(path)
++++output_prefix+=+path.rpartition('.')[0]
++++output_ext+=+path.rpartition('.')[-1]

++++with+open(path,+'rb')+as+f:
++++++++seek_positions+=+[]
++++++++for+x,+line+in+enumerate(f):
++++++++++++if+not+x:
++++++++++++++++header+=+line
++++++++++++seek_positions.append(f.tell())

++++++++part+=+0
++++++++last_seek_pos+=+seek_positions[-1]
++++++++for+seek_pos+in+reversed(seek_positions):
++++++++++++if+last_seek_pos-seek_pos+>=+chunk_size:
++++++++++++++++with+open(f'{output_prefix}.arch.{part}.{output_ext}',+'wb')+as+f_out:
++++++++++++++++++++f.seek(seek_pos)
++++++++++++++++++++f_out.write(header)
++++++++++++++++++++f_out.write(f.read(last_seek_pos-seek_pos))

++++++++++++++++last_seek_pos+=+seek_pos
++++++++++++++++part+%2B=+1

++++++++with+open(f'{output_prefix}.arch.{part}.{output_ext}',+'wb')+as+f_out:
++++++++++++f.seek(0)
++++++++++++f_out.write(f.read(last_seek_pos))

++++Path(path).rename(path%2B'~')
++++Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
++++Path(path%2B'~').unlink()|code-block|syntax|javascript|4444954|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

In the case of wanting to split by rough boundaries in bytes, the newest datapoints being the bottom-most ones and wanting to put the newest datapoints in the first file:
<pre><code>from pathlib import Path
 
TEN_MB = 10000000
FIVE_MB = 5000000

def split_file_into_chunks(path, chunk_size=TEN_MB):
 path = str(path)
 output_prefix = path.rpartition('.')[0]
 output_ext = path.rpartition('.')[-1]

 with open(path, 'rb') as f:
 seek_positions = []
 for x, line in enumerate(f):
 if not x:
 header = line
 seek_positions.append(f.tell())

 part = 0
 last_seek_pos = seek_positions[-1]
 for seek_pos in reversed(seek_positions):
 if last_seek_pos-seek_pos &gt;= chunk_size:
 with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
 f.seek(seek_pos)
 f_out.write(header)
 f_out.write(f.read(last_seek_pos-seek_pos))

 last_seek_pos = seek_pos
 part += 1

 with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
 f.seek(0)
 f_out.write(f.read(last_seek_pos))

 Path(path).rename(path+'~')
 Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
 Path(path+'~').unlink()
</code></pre>

blocks|key|4444972|text|下面是我的代码，它可能会有所帮助|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4444973|import+os
import+pandas+as+pd
import+uuid


class+FileSettings(object):
++++def+__init__(self,+file_name,+row_size=100):
++++++++self.file_name+=+file_name
++++++++self.row_size+=+row_size


class+FileSplitter(object):

++++def+__init__(self,+file_settings):
++++++++self.file_settings+=+file_settings

++++++++if+type(self.file_settings).__name__+!=+"FileSettings":
++++++++++++raise+Exception("Please+pass+correct+instance+")

++++++++self.df+=+pd.read_csv(self.file_settings.file_name,
++++++++++++++++++++++++++++++chunksize=self.file_settings.row_size)

++++def+run(self,+directory="temp"):

++++++++try:os.makedirs(directory)
++++++++except+Exception+as+e:pass

++++++++counter+=+0

++++++++while+True:
++++++++++++try:
++++++++++++++++file_name+=+"{}/{}_{}_row_{}_{}.csv".format(
++++++++++++++++++++directory,++self.file_settings.file_name.split(".")[0],+counter,+self.file_settings.row_size,+uuid.uuid4().__str__()
++++++++++++++++)
++++++++++++++++df+=+next(self.df).to_csv(file_name)
++++++++++++++++counter+=+counter+%2B+1
++++++++++++except+StopIteration:
++++++++++++++++break
++++++++++++except+Exception+as+e:
++++++++++++++++print("Error:",e)
++++++++++++++++break

++++++++return+True


def+main():
++++helper+=++FileSplitter(FileSettings(
++++++++file_name='sample1.csv',
++++++++row_size=10
++++))
++++helper.run()

main()|code-block|syntax|javascript|4444974|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

here is my code which might help
<pre><code>import os
import pandas as pd
import uuid


class FileSettings(object):
 def __init__(self, file_name, row_size=100):
 self.file_name = file_name
 self.row_size = row_size


class FileSplitter(object):

 def __init__(self, file_settings):
 self.file_settings = file_settings

 if type(self.file_settings).__name__ != &quot;FileSettings&quot;:
 raise Exception(&quot;Please pass correct instance &quot;)

 self.df = pd.read_csv(self.file_settings.file_name,
 chunksize=self.file_settings.row_size)

 def run(self, directory=&quot;temp&quot;):

 try:os.makedirs(directory)
 except Exception as e:pass

 counter = 0

 while True:
 try:
 file_name = &quot;{}/{}_{}_row_{}_{}.csv&quot;.format(
 directory, self.file_settings.file_name.split(&quot;.&quot;)[0], counter, self.file_settings.row_size, uuid.uuid4().__str__()
 )
 df = next(self.df).to_csv(file_name)
 counter = counter + 1
 except StopIteration:
 break
 except Exception as e:
 print(&quot;Error:&quot;,e)
 break

 return True


def main():
 helper = FileSplitter(FileSettings(
 file_name='sample1.csv',
 row_size=10
 ))
 helper.run()

main()
</code></pre>

I have a 7GB <code>csv</code> file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

How can I split a large file csv file (7GB) in Python

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我有一个7 7GB的csv文件，我想把它分成更小的块，这样在笔记本上用Python语言进行分析时，它的可读性和速度更快。我想从它中获取一个小的设置，可能是250MB，那么我该怎么做呢？

问如何在Python中拆分大文件csv文件(7 7GB
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Python中拆分大文件csv文件(7 7GBEN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Python中拆分大文件csv文件(7 7GB
EN