问使用Flume (假脱机目录)将大文件加载到hdfs
EN

Stack Overflow用户

提问于 2014-03-20 16:16:29

回答 1查看 13.2K关注 0票数 4

我们复制了一个150mb的csv文件到flume的spool目录，当它被加载到hdfs时，文件被分割成更小的文件，比如80kb。有没有办法加载文件而不被分割成更小的文件使用flume？因为在namenode中会生成更多关于较小文件的元数据，所以我们需要避免这种情况。

我的flume-ng代码如下所示

# Initialize agent's source, channel and sink
agent.sources = TwitterExampleDir
agent.channels = memoryChannel
agent.sinks = flumeHDFS

# Setting the source to spool directory where the file exists
agent.sources.TwitterExampleDir.type = spooldir
agent.sources.TwitterExampleDir.spoolDir = /usr/local/flume/live

# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000

# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://info3s7:54310/spool5
agent.sinks.flumeHDFS.hdfs.fileType = DataStream

# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text

# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1

# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000

# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0

# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600

# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel

flume

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-03-30 00:09:24

你想要的是：

# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount = 0
agent.sinks.flumeHDFS.hdfs.rollInterval = 0
agent.sinks.flumeHDFS.hdfs.rollSize = 10000000
agent.sinks.flumeHDFS.hdfs.batchSize = 10000

从flume documentation

hdfs.rollSize: File size to trigger roll, in bytes (0: never roll based on file size)

在您的示例中，您使用的是2000的rollInterval，它将在2000秒后滚动文件，从而生成较小的文件。

另请注意，batchSize反映的是文件刷新到HDFS之前的事件数，而不一定是关闭文件和创建新文件之前的事件数。您需要将该值设置为足够小，以便不会超时写入大文件，但又足够大，以避免对HDFS的许多请求的开销。

票数 8

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/22527089

复制

相似问题

问使用Flume (假脱机目录)将大文件加载到hdfs
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Flume (假脱机目录)将大文件加载到hdfsEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Flume (假脱机目录)将大文件加载到hdfs
EN