文章/答案/技术大牛

发布

社区首页 >问答首页 >处理PatternSyntaxException和扫描文本

问处理PatternSyntaxException和扫描文本
EN

Stack Overflow用户

提问于 2015-11-01 23:10:44

回答 2查看 73关注 0票数 0

我想要从大约一百万个名字的巨大列表中找到文本文档集合中的名字。我先从列表的名字做一个模式：

    BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));

    String dataRow = TSVFile.readLine();
    dataRow = TSVFile.readLine();// skip first line (header)

    String combined = "";
    while (dataRow != null) {
        String[] dataArray = dataRow.split("\t");
        String name = dataArray[1];
        combined += name.replace("\"", "") + "|";

        dataRow = TSVFile.readLine(); // Read next line of data.
    }
    TSVFile.close();
    Pattern all = Pattern.compile(combined);

这样做之后，我得到了一个IllegalPatternSyntax异常，因为有些名称或其他正则表达式中包含'+'。我试着通过忽略下面几个名字来解决这个问题：

    if(name.contains("\""){
    //ignore this name }

不能正常工作，但也很混乱，因为你必须手动逃离所有东西，并多次运行它，浪费时间。然后我尝试使用quote方法：

   Pattern all = Pattern.compile(Pattern.quote(combined));

但是现在，我在文本文档中找不到任何匹配项，即使我也对它们使用quote。我该如何解决这个问题？

java

regex

回答 2

Stack Overflow用户

发布于 2015-11-02 04:27:58

我同意@dragon66的评论，你不应该引用管道"|“。因此，您的代码将类似于使用Pattern.quote()的以下代码：

BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));

String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)

String combined = "";
while (dataRow != null) {
    String[] dataArray = dataRow.split("\t");
    String name = dataArray[1];
    combined += Pattern.quote(name.replace("\"", "")) + "|"; //line changed

    dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
Pattern all = Pattern.compile(combined);

另外，我建议验证您的问题域是否需要优化，以避免在循环中创建不必要的新字符串，而不是在不可变的StringBuilder类上使用String combined = "";。

票数 0

Stack Overflow用户

发布于 2015-11-07 20:20:15

guilhermerama向您的代码提交了错误修复。

我将添加一些性能改进。正如我所指出的，java的regex库不能伸缩，如果用于搜索，速度会更慢。

但使用多字符串搜索算法可以做得更好。例如，通过使用StringsAndChars String Search

//setting up a test file
Iterable<String> lines = createLines();
Files.write(Paths.get("names.tsv"), lines , CREATE, WRITE, TRUNCATE_EXISTING);

// read the pattern from the file
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));

Set<String> combined = new LinkedHashSet<>();

String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)

while (dataRow != null) {
  String[] dataArray = dataRow.split("\t");
  String name = dataArray[1];
  combined.add(name);

  dataRow = TSVFile.readLine(); // Read next line of data.
}

TSVFile.close();

// search the pattern in a small text
StringSearchAlgorithm stringSearch = new AhoCorasick(new ArrayList<>(combined));
StringFinder finder = stringSearch.createFinder(new StringCharProvider("test " + name(38) + "\n or " + name(799) + " : " + name(99999), 0));
System.out.println(finder.findAll());

结果将是

[5:10(00038), 15:20(00799), 23:28(99999)]

搜索(finder.findAll())所用时间(在我的电脑上)不到1毫秒。对java.util.regex执行同样的操作大约需要20毫秒。

您可以使用RexLex提供的其他算法来调优此性能。

设置需要以下代码：

private static Iterable<String> createLines() {
    List<String> list = new ArrayList<>();
    for (int i = 0; i < 100000; i++)  {
        list.add(i + "\t" + name(i));
    }
    return list;
}

private static String name(int i) {
    String s = String.valueOf(i);
    while (s.length() < 5)  {
        s = '0' + s;
    }
    return s;
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/33463661

复制

相似问题

问处理PatternSyntaxException和扫描文本
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问处理PatternSyntaxException和扫描文本EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问处理PatternSyntaxException和扫描文本
EN