众所周知,ASCII使用7位编码字符,因此用于表示文本的字节数总是小于文本字母的长度。
例如:
StringBuilder text = new StringBuilder();
IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
int letters = text.length();
int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
System.out.println(letters); // expected 160, actual 160
System.out.println(bytes); // expected 140, actual 160总是letters = bytes,但预期值是letters > bytes。
的主要问题:在smpp协议中的sms主体必须是<= 140字节,如果我们使用ascii编码,那么可以写160字母=(140*8/7),所以我想用7-bit based ascii进行文本编码,我们使用的是JSMPP库
有人能给我解释一下并引导我走正确的路吗?谢谢。
发布于 2019-07-05 06:04:46
这里有一个没有任何库的快速而肮脏的解决方案,即只有JRE上的方法。它没有优化效率,也不检查信息是否真的是US,它只是假设。这只是一个概念的证明:
package de.scrum_master.stackoverflow;
import java.util.BitSet;
public class ASCIIConverter {
public byte[] compress(String message) {
BitSet bits = new BitSet(message.length() * 7);
int currentBit = 0;
for (char character : message.toCharArray()) {
for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
if ((character & 1 << bitInCharacter) > 0)
bits.set(currentBit);
currentBit++;
}
}
return bits.toByteArray();
}
public String decompress(byte[] compressedMessage) {
BitSet bits = BitSet.valueOf(compressedMessage);
int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
decompressedMessage.append(character);
}
return decompressedMessage.toString();
}
public static void main(String[] args) {
String[] messages = {
"Hello world!",
"This is my message.\n\tAnd this is indented!",
" !\"#$%&'()*+,-./0123456789:;<=>?\n"
+ "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
+ "`abcdefghijklmnopqrstuvwxyz{|}~",
"1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
};
ASCIIConverter asciiConverter = new ASCIIConverter();
for (String message : messages) {
System.out.println(message);
System.out.println("--------------------------------");
byte[] compressedMessage = asciiConverter.compress(message);
System.out.println("Number of ASCII characters = " + message.length());
System.out.println("Number of compressed bytes = " + compressedMessage.length);
System.out.println("--------------------------------");
System.out.println(asciiConverter.decompress(compressedMessage));
System.out.println("\n");
}
}
}控制台日志如下所示:
Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!
This is my message.
And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
And this is indented!
!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890发布于 2019-07-04 09:52:19
(160*7-160*8)/8 = 20,因此您希望脚本结束时少使用20个字节。但是,寄存器有一个最小的大小,所以即使没有使用所有的位,仍然不能将其连接到另一个值,所以您仍然使用8位字节作为ASCII代码,这就是为什么您得到相同的数字。例如,在ASCII中小写的"a“是97。
01100001注意前导零仍然存在,即使它没有被使用。您不能仅仅使用它来存储另一个值的一部分。
最后,在纯ASCII字母中,必须始终等于字节。
(或者想象将7大小的对象放入大小为8的框中。您不能将对象分割成碎片,因此框数必须等于对象的数量--至少在本例中是这样的。)
发布于 2019-07-04 09:50:49
根据编码类型,Byte长度将有所不同。检查下面的示例。
String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length);
// prints "10"
byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length);
// prints "22"
byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length);
// prints "10" https://stackoverflow.com/questions/56884877
复制相似问题