文章/答案/技术大牛

发布

社区首页 >问答首页 >从字节数组(7位/字符)压缩到/来自字节数组的US字符串(de-)

问从字节数组(7位/字符)压缩到/来自字节数组的US字符串(de-)
EN

Stack Overflow用户

提问于 2019-07-04 09:31:55

回答 4查看 913关注 0票数 1

众所周知，ASCII使用7位编码字符，因此用于表示文本的字节数总是小于文本字母的长度。

例如：

    StringBuilder text = new StringBuilder();
    IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
    int letters = text.length();
    int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
    System.out.println(letters); // expected  160,  actual 160
    System.out.println(bytes); //   expected  140,  actual 160

总是letters = bytes，但预期值是letters > bytes。

的主要问题：在smpp协议中的sms主体必须是<= 140字节，如果我们使用ascii编码，那么可以写160字母=(140*8/7)，所以我想用7-bit based ascii进行文本编码，我们使用的是JSMPP库

有人能给我解释一下并引导我走正确的路吗?谢谢。

java

ascii

smpp

jsmpp

回答 4

Stack Overflow用户

回答已采纳

发布于 2019-07-05 06:04:46

这里有一个没有任何库的快速而肮脏的解决方案，即只有JRE上的方法。它没有优化效率，也不检查信息是否真的是US，它只是假设。这只是一个概念的证明：

package de.scrum_master.stackoverflow;

import java.util.BitSet;

public class ASCIIConverter {
  public byte[] compress(String message) {
    BitSet bits = new BitSet(message.length() * 7);
    int currentBit = 0;
    for (char character : message.toCharArray()) {
      for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
        if ((character & 1 << bitInCharacter) > 0)
          bits.set(currentBit);
        currentBit++;
      }
    }
    return bits.toByteArray();
  }

  public String decompress(byte[] compressedMessage) {
    BitSet bits = BitSet.valueOf(compressedMessage);
    int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
    StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
    for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
      char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
      decompressedMessage.append(character);
    }
    return decompressedMessage.toString();
  }

  public static void main(String[] args) {
    String[] messages = {
      "Hello world!",
      "This is my message.\n\tAnd this is indented!",
      " !\"#$%&'()*+,-./0123456789:;<=>?\n"
        + "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
        + "`abcdefghijklmnopqrstuvwxyz{|}~",
      "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
    };

    ASCIIConverter asciiConverter = new ASCIIConverter();
    for (String message : messages) {
      System.out.println(message);
      System.out.println("--------------------------------");
      byte[] compressedMessage = asciiConverter.compress(message);
      System.out.println("Number of ASCII characters = " + message.length());
      System.out.println("Number of compressed bytes = " + compressedMessage.length);
      System.out.println("--------------------------------");
      System.out.println(asciiConverter.decompress(compressedMessage));
      System.out.println("\n");
    }
  }
}

控制台日志如下所示：

Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!


This is my message.
    And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
    And this is indented!


 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~


1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

票数 1

Stack Overflow用户

发布于 2019-07-04 09:52:19

(160*7-160*8)/8 = 20，因此您希望脚本结束时少使用20个字节。但是，寄存器有一个最小的大小，所以即使没有使用所有的位，仍然不能将其连接到另一个值，所以您仍然使用8位字节作为ASCII代码，这就是为什么您得到相同的数字。例如，在ASCII中小写的"a“是97。

‭01100001‬

注意前导零仍然存在，即使它没有被使用。您不能仅仅使用它来存储另一个值的一部分。

最后，在纯ASCII字母中，必须始终等于字节。

(或者想象将7大小的对象放入大小为8的框中。您不能将对象分割成碎片，因此框数必须等于对象的数量--至少在本例中是这样的。)

票数 2

Stack Overflow用户

发布于 2019-07-04 09:50:49

根据编码类型，Byte长度将有所不同。检查下面的示例。

String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"

byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length); 
// prints "10"

byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); 
// prints "22"

byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length); 
// prints "10"

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56884877

复制

相似问题

问从字节数组(7位/字符)压缩到/来自字节数组的US字符串(de-)
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字节数组(7位/字符)压缩到/来自字节数组的US字符串(de-)EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字节数组(7位/字符)压缩到/来自字节数组的US字符串(de-)
EN