Someone picked my brain the other day looking for a technique to compress language files.
After walking away to think about it… my method was to re-order the ASCII code to the letters by their frequency and the most common words by their frequency.
Where lowercase e is stored as an ASCII value using 1 byte
ASCII e = 0x61 = 0b1100001 = 7 bits
vs
APK e = 0x1 = 1 bit
… this method stores an E in 1 bit. This is similar to the Huffman Code with the addition of whole words being included in the code.
For example:
“because” is the 94th most used word in the english language and in this method is stored in 7 bits.
I don’t know if this has been done before… but I would imagine it could compress Language files substantially.
I have thought about a third addition of using the most used 2 or three letter combinations commonly used.
APK ORDER | APK | LET FREQ | WORD FEQ | APK BIN | APK HEX | APK BITS USED |
0 | space | 0 | 0 | 1 | ||
1 | e | 12.70% | 1 | 1 | 1 | |
2 | t | 9.06% | 10 | 2 | 2 | |
3 | a | 8.17% | 11 | 3 | 2 | |
4 | o | 7.51% | 100 | 4 | 3 | |
5 | i | 6.97% | 101 | 5 | 3 | |
6 | n | 6.75% | 110 | 6 | 3 | |
7 | s | 6.33% | 111 | 7 | 3 | |
8 | h | 6.09% | 1000 | 8 | 4 | |
9 | r | 5.99% | 1001 | 9 | 4 | |
10 | d | 4.25% | 1010 | A | 4 | |
11 | l | 4.03% | 1011 | B | 4 | |
12 | c | 2.78% | 1100 | C | 4 | |
13 | u | 2.76% | 1101 | D | 4 | |
14 | m | 2.41% | 1110 | E | 4 | |
15 | w | 2.36% | 1111 | F | 4 | |
16 | f | 2.23% | 10000 | 10 | 5 | |
17 | g | 2.02% | 10001 | 11 | 5 | |
18 | y | 1.97% | 10010 | 12 | 5 | |
19 | p | 1.93% | 10011 | 13 | 5 | |
20 | b | 1.49% | 10100 | 14 | 5 | |
21 | v | 0.98% | 10101 | 15 | 5 | |
22 | k | 0.77% | 10110 | 16 | 5 | |
23 | j | 0.15% | 10111 | 17 | 5 | |
24 | x | 0.15% | 11000 | 18 | 5 | |
25 | q | 0.10% | 11001 | 19 | 5 | |
26 | z | 0.07% | 11010 | 1A | 5 | |
27 | the | 1 | 11011 | 1B | 5 | |
28 | be | 2 | 11100 | 1C | 5 | |
29 | to | 3 | 11101 | 1D | 5 | |
30 | of | 4 | 11110 | 1E | 5 | |
31 | and | 5 | 11111 | 1F | 5 | |
32 | a | 6 | 100000 | 20 | 6 | |
33 | in | 7 | 100001 | 21 | 6 | |
34 | that | 8 | 100010 | 22 | 6 | |
35 | have | 9 | 100011 | 23 | 6 | |
36 | I | 10 | 100100 | 24 | 6 | |
37 | it | 11 | 100101 | 25 | 6 | |
38 | for | 12 | 100110 | 26 | 6 | |
39 | not | 13 | 100111 | 27 | 6 | |
40 | on | 14 | 101000 | 28 | 6 | |
41 | with | 15 | 101001 | 29 | 6 | |
42 | he | 16 | 101010 | 2A | 6 | |
43 | as | 17 | 101011 | 2B | 6 | |
44 | you | 18 | 101100 | 2C | 6 | |
45 | do | 19 | 101101 | 2D | 6 | |
46 | at | 20 | 101110 | 2E | 6 | |
47 | this | 21 | 101111 | 2F | 6 | |
48 | but | 22 | 110000 | 30 | 6 | |
49 | his | 23 | 110001 | 31 | 6 | |
50 | by | 24 | 110010 | 32 | 6 | |
51 | from | 25 | 110011 | 33 | 6 | |
52 | they | 26 | 110100 | 34 | 6 | |
53 | we | 27 | 110101 | 35 | 6 | |
54 | say | 28 | 110110 | 36 | 6 | |
55 | her | 29 | 110111 | 37 | 6 | |
56 | she | 30 | 111000 | 38 | 6 | |
57 | or | 31 | 111001 | 39 | 6 | |
58 | an | 32 | 111010 | 3A | 6 | |
59 | will | 33 | 111011 | 3B | 6 | |
60 | my | 34 | 111100 | 3C | 6 | |
61 | one | 35 | 111101 | 3D | 6 | |
62 | all | 36 | 111110 | 3E | 6 | |
63 | would | 37 | 111111 | 3F | 6 | |
64 | there | 38 | 1000000 | 40 | 7 | |
65 | their | 39 | 1000001 | 41 | 7 | |
66 | what | 40 | 1000010 | 42 | 7 | |
67 | so | 41 | 1000011 | 43 | 7 | |
68 | up | 42 | 1000100 | 44 | 7 | |
69 | out | 43 | 1000101 | 45 | 7 | |
70 | if | 44 | 1000110 | 46 | 7 | |
71 | about | 45 | 1000111 | 47 | 7 | |
72 | who | 46 | 1001000 | 48 | 7 | |
73 | get | 47 | 1001001 | 49 | 7 | |
74 | which | 48 | 1001010 | 4A | 7 | |
75 | go | 49 | 1001011 | 4B | 7 | |
76 | me | 50 | 1001100 | 4C | 7 | |
77 | when | 51 | 1001101 | 4D | 7 | |
78 | make | 52 | 1001110 | 4E | 7 | |
79 | can | 53 | 1001111 | 4F | 7 | |
80 | like | 54 | 1010000 | 50 | 7 | |
81 | time | 55 | 1010001 | 51 | 7 | |
82 | no | 56 | 1010010 | 52 | 7 | |
83 | just | 57 | 1010011 | 53 | 7 | |
84 | him | 58 | 1010100 | 54 | 7 | |
85 | know | 59 | 1010101 | 55 | 7 | |
86 | take | 60 | 1010110 | 56 | 7 | |
87 | people | 61 | 1010111 | 57 | 7 | |
88 | into | 62 | 1011000 | 58 | 7 | |
89 | year | 63 | 1011001 | 59 | 7 | |
90 | your | 64 | 1011010 | 5A | 7 | |
91 | good | 65 | 1011011 | 5B | 7 | |
92 | some | 66 | 1011100 | 5C | 7 | |
93 | could | 67 | 1011101 | 5D | 7 | |
94 | them | 68 | 1011110 | 5E | 7 | |
95 | see | 69 | 1011111 | 5F | 7 | |
96 | other | 70 | 1100000 | 60 | 7 | |
97 | than | 71 | 1100001 | 61 | 7 | |
98 | then | 72 | 1100010 | 62 | 7 | |
99 | now | 73 | 1100011 | 63 | 7 | |
100 | look | 74 | 1100100 | 64 | 7 | |
101 | only | 75 | 1100101 | 65 | 7 | |
102 | come | 76 | 1100110 | 66 | 7 | |
103 | its | 77 | 1100111 | 67 | 7 | |
104 | over | 78 | 1101000 | 68 | 7 | |
105 | think | 79 | 1101001 | 69 | 7 | |
106 | also | 80 | 1101010 | 6A | 7 | |
107 | back | 81 | 1101011 | 6B | 7 | |
108 | after | 82 | 1101100 | 6C | 7 | |
109 | use | 83 | 1101101 | 6D | 7 | |
110 | two | 84 | 1101110 | 6E | 7 | |
111 | how | 85 | 1101111 | 6F | 7 | |
112 | our | 86 | 1110000 | 70 | 7 | |
113 | work | 87 | 1110001 | 71 | 7 | |
114 | first | 88 | 1110010 | 72 | 7 | |
115 | well | 89 | 1110011 | 73 | 7 | |
116 | way | 90 | 1110100 | 74 | 7 | |
117 | even | 91 | 1110101 | 75 | 7 | |
118 | new | 92 | 1110110 | 76 | 7 | |
119 | want | 93 | 1110111 | 77 | 7 | |
120 | because | 94 | 1111000 | 78 | 7 | |
121 | any | 95 | 1111001 | 79 | 7 | |
122 | these | 96 | 1111010 | 7A | 7 | |
123 | give | 97 | 1111011 | 7B | 7 | |
124 | day | 98 | 1111100 | 7C | 7 | |
125 | most | 99 | 1111101 | 7D | 7 | |
126 | use | 100 | 1111110 | 7E | 7 |