文章/答案/技术大牛

发布

社区首页 >问答首页 >从给定字符串中修剪特殊字符的Regex

问从给定字符串中修剪特殊字符的Regex
EN

Stack Overflow用户

提问于 2018-07-16 14:44:41

回答 3查看 103关注 0票数 0

我从源中提取了数据，现在它是一组令牌。这些标记在结尾或开头包含垃圾字符或特殊字符。例如，我有下面一组..。

人工交通
(设备)
(交通平静)
交通-
合成的，
人工草皮)

这些数据应该分别如下..。

人工交通
设备
交通平静
交通
合成
人造草坪

为了净化这个字符串集，我实现了下面的方法，即正常工作。见regex101.com..。

public Filter filterSpecialCharacters() {
    String regex = "^([^a-z0-9A-Z]*)([a-z0-9A-Z])(.*)([a-z0-9A-Z])([^a-z0-9A-Z]*)$";
    set = set
        .stream()
        .map(str -> str.replaceAll(regex, "$2$3$4"))
        .collect(Collectors.toSet());
    return this;
}

但是我仍然对我使用的正则表达式不满意，因为我有大量的数据。想看看有没有更好的选择。

java

regex

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-07-17 06:17:07

不要将正则表达式用于这类简单的trims。解析字符串并修剪它。代码很大，但肯定比regex快。

public static List<String> filterSpecialCharacters(List<String> input) {
    Iterator<String> it = input.iterator();
    List<String> output = new ArrayList<String>();
    // For all strings in the List
    while (it.hasNext()) {
        String s = it.next();
        int endIndex = s.length() - 1;
        // Get the last index of alpha numeric char
        for (int i = endIndex; i >= 0; i--) {
            if (isAlphaNumeric(s.charAt(i))) {
                endIndex = i;
                break;
            }
        }
        StringBuilder out = new StringBuilder();
        boolean startCopying = false;
        // Parse the string till the last index of alpha numeric char
        for (int i = 0; i <= endIndex; i++) {
            // Ignore the leading occurrences non alpha-num chars
            if (!startCopying && !isAlphaNumeric(s.charAt(i))) {
                continue;
            }
            // Start copying to output buffer after(including) the first occurrence of alpha-num char 
            else {
                startCopying = true;
                out.append(s.charAt(i));
            }
        }
        // Add the trimmed string to the output list.
        output.add(out.toString());
    }

    return output;
}

// Updated this method with the characters that you dont want to trim
private static boolean isAlphaNumeric(char c) {
    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9');
}

请测试这段代码，看看它是否满足您的条件。我发现这几乎比regex trims (用于其他答案)快10倍。另外，如果性能对您很重要，那么我建议您使用Iterator来解析Set，而不是stream/map/collect函数。

票数 1

Stack Overflow用户

发布于 2018-07-16 14:52:23

我想使用\p{Punct}删除所有这些标点符号!"#$%&'()*+,-./:;<=>?@[\]^_{ use }~‘

String regex = "^\\p{Punct}*([a-z0-9A-Z -]*)\\p{Punct}*$";
set = set.stream()
        .map(str -> str.replaceAll(regex, "$1"))
        .collect(Collectors.toSet());

=>[synthetic, devices, traffic-calming, manufactured traffic , artificial turf]

看看这个正则表达式构造概述

或者像@Ted Hopp在注释中提到的那样，您可以使用两个映射，一个从乞讨中删除特殊字符，另一个从末尾删除它们：

set = set.stream()
        .map(str -> str.replaceFirst("^[^a-z0-9A-Z]*", ""))
        .map(str -> str.replaceFirst("[^a-z0-9A-Z]*$", ""))
        .collect(Collectors.toSet());

票数 2

Stack Overflow用户

发布于 2018-07-16 17:27:23

您可以在每次工作相同的单一被动正则表达式中这样做。

Globlly找到(?m)^[^a-z0-9A-Z\r\n]*(.*?)[^a-z0-9A-Z\r\n]*$

替换$1

https://regex101.com/r/tGFbLm/1

 (?m)                          # Multi-line mode
 ^                             # BOL
 [^a-z0-9A-Z\r\n]*     
 ( .*? )                       # (1), Passive content to write back
 [^a-z0-9A-Z\r\n]* 
 $                             # EOL

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51364451

复制

相似问题

问从给定字符串中修剪特殊字符的Regex
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从给定字符串中修剪特殊字符的RegexEN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从给定字符串中修剪特殊字符的Regex
EN