我在一个文件中将一条记录拆分为多行。识别记录结尾的唯一方法是当新记录以ABC开头时。下面是示例。文件大小可以是5-10 GB,我正在寻找一个有效的java逻辑只拆分文件(不需要读取每一行),但拆分逻辑应该检查开始一个新的文件与新的记录,这应该以"ABC“在这种情况下。
添加了更多的细节,我只是在寻找拆分文件,同时拆分最后一条记录应该在一个文件中正确结束。
有没有人能提个建议?
HDR
ABCline1goesonforrecord1 //first record
line2goesonForRecord1
line3goesonForRecord1
line4goesonForRecord1
ABCline2goesOnForRecord2 //second record
line2goesonForRecord2
line3goesonForRecord2
line4goesonForRecord2
line5goesonForRecord2
ABCline2goesOnForRecord3 //third record
line2goesonForRecord3
line3goesonForRecord3
line4goesonForRecord3
TRL发布于 2020-12-01 04:36:44
所以,这就是你需要的代码。我在一个10 to的文件上进行了测试,拆分文件需要64秒
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.concurrent.TimeUnit;
public class FileSplitter {
private final Path filePath;
private BufferedWriter writer;
private int fileCounter = 1;
public static void main(String[] args) throws Exception {
long startTime = System.nanoTime();
new FileSplitter(Path.of("/tmp/bigfile.txt")).split();
System.out.println("Time to split " + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime() - startTime));
}
private static void generateBigFile() throws Exception {
var writer = Files.newBufferedWriter(Path.of("/tmp/bigfile.txt"), StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
for (int i = 0; i < 100_000; i++) {
writer.write(String.format("ABCline1goesonforrecord%d\n", i + 1));
for (int j = 0; j < 10_000; j++) {
writer.write(String.format("line%dgoesonForRecord%d\n", j + 2, i + 1));
}
}
writer.flush();
writer.close();
}
public FileSplitter(Path filePath) {
this.filePath = filePath;
}
void split() throws IOException {
try (var stream = Files.lines(filePath, StandardCharsets.UTF_8)) {
stream.forEach(line -> {
if (line.startsWith("ABC")) {
closeWriter();
openWriter();
}
writeLine(line);
});
}
closeWriter();
}
private void writeLine(String line) {
if (writer != null) {
try {
writer.write(line);
writer.write("\n");
} catch (IOException e) {
throw new UncheckedIOException("Failed to write line to file part", e);
}
}
}
private void openWriter() {
if (this.writer == null) {
var filePartName = filePath.getFileName().toString().replace(".", "_part" + fileCounter + ".");
try {
writer = Files.newBufferedWriter(Path.of("/tmp/split", filePartName), StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
} catch (IOException e) {
throw new UncheckedIOException("Failed to write line to file", e);
}
fileCounter++;
}
}
private void closeWriter() {
if (writer != null) {
try {
writer.flush();
writer.close();
writer = null;
} catch (IOException e) {
throw new UncheckedIOException("Failed to close writer", e);
}
}
}
}顺便说一句,扫描仪的解决方案也是可行的。
关于没有读完所有的行,我不明白你为什么不想要这个。如果您选择不读取所有行(这是可能的),那么,第一,您将使解决方案过于复杂,第二,我非常确定您将失去性能,因为您必须将逻辑合并到拆分中。
发布于 2020-12-01 03:28:58
我没有测试它,但是像这样的东西应该可以工作,你不是一次只读一行内存中的整个文件,所以它应该不是坏事。
public void spiltRecords(String filename) {
/*
HDR
ABCline1goesonforrecord1 //first record
line2goesonForRecord1
line3goesonForRecord1
line4goesonForRecord1
ABCline2goesOnForRecord2 //second record
line2goesonForRecord2
line3goesonForRecord2
line4goesonForRecord2
line5goesonForRecord2
ABCline2goesOnForRecord3 //third record
line2goesonForRecord3
line3goesonForRecord3
line4goesonForRecord3
TRL
*/
try {
Scanner scanFile = new Scanner(new File(filename));
// now you do not want to edit the existing file in case things go wrong. one way is to get list of index
// where a new record starts.
LinkedList<Long> startOfRecordIndexes = new LinkedList<>();
long index = 0;
while (scanFile.hasNext()) {
if (scanFile.nextLine().startsWith("ABC")) {
startOfRecordIndexes.add(index);
}
index++;
}
// Once you have the starting index for all records you can iterate through the list and create new records
scanFile = scanFile.reset();
index = 0;
BufferedWriter writer = null;
while (scanFile.hasNext()) {
if (!startOfRecordIndexes.isEmpty() && index == startOfRecordIndexes.peek()) {
if(writer != null) {
writer.write("TRL");
writer.close();
}
writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("Give unique filename"), StandardCharsets.UTF_8));
writer.write("HDR");
writer.write(scanFile.nextLine());
startOfRecordIndexes.remove();
} else {
writer.write(scanFile.nextLine());
}
}
// Close the last record
if(writer != null) {
writer.write("TRL");
writer.close();
}
} catch (IOException e) {
// deal with exception
}
}https://stackoverflow.com/questions/65079254
复制相似问题