Someone picked my brain the other day looking for a technique to compress language files.
After walking away to think about it… my method was to re-order the ASCII code to the letters by their frequency and the most common words by their frequency.
Where lowercase e is stored as an ASCII value using 1 byte
ASCII e = 0x61 = 0b1100001 = 7 bits
vs
APK e = 0x1 = 1 bit
… this method stores an E in 1 bit. This is similar to the Huffman Code with the addition of whole words being included in the code.
For example:
“because” is the 94th most used word in the english language and in this method is stored in 7 bits.
I don’t know if this has been done before… but I would imagine it could compress Language files substantially.
I have thought about a third addition of using the most used 2 or three letter combinations commonly used.
| APK ORDER |
APK |
LET FREQ |
WORD FEQ |
APK BIN |
APK HEX |
APK BITS USED |
| 0 |
space |
|
|
0 |
0 |
1 |
| 1 |
e |
12.70% |
|
1 |
1 |
1 |
| 2 |
t |
9.06% |
|
10 |
2 |
2 |
| 3 |
a |
8.17% |
|
11 |
3 |
2 |
| 4 |
o |
7.51% |
|
100 |
4 |
3 |
| 5 |
i |
6.97% |
|
101 |
5 |
3 |
| 6 |
n |
6.75% |
|
110 |
6 |
3 |
| 7 |
s |
6.33% |
|
111 |
7 |
3 |
| 8 |
h |
6.09% |
|
1000 |
8 |
4 |
| 9 |
r |
5.99% |
|
1001 |
9 |
4 |
| 10 |
d |
4.25% |
|
1010 |
A |
4 |
| 11 |
l |
4.03% |
|
1011 |
B |
4 |
| 12 |
c |
2.78% |
|
1100 |
C |
4 |
| 13 |
u |
2.76% |
|
1101 |
D |
4 |
| 14 |
m |
2.41% |
|
1110 |
E |
4 |
| 15 |
w |
2.36% |
|
1111 |
F |
4 |
| 16 |
f |
2.23% |
|
10000 |
10 |
5 |
| 17 |
g |
2.02% |
|
10001 |
11 |
5 |
| 18 |
y |
1.97% |
|
10010 |
12 |
5 |
| 19 |
p |
1.93% |
|
10011 |
13 |
5 |
| 20 |
b |
1.49% |
|
10100 |
14 |
5 |
| 21 |
v |
0.98% |
|
10101 |
15 |
5 |
| 22 |
k |
0.77% |
|
10110 |
16 |
5 |
| 23 |
j |
0.15% |
|
10111 |
17 |
5 |
| 24 |
x |
0.15% |
|
11000 |
18 |
5 |
| 25 |
q |
0.10% |
|
11001 |
19 |
5 |
| 26 |
z |
0.07% |
|
11010 |
1A |
5 |
| 27 |
the |
|
1 |
11011 |
1B |
5 |
| 28 |
be |
|
2 |
11100 |
1C |
5 |
| 29 |
to |
|
3 |
11101 |
1D |
5 |
| 30 |
of |
|
4 |
11110 |
1E |
5 |
| 31 |
and |
|
5 |
11111 |
1F |
5 |
| 32 |
a |
|
6 |
100000 |
20 |
6 |
| 33 |
in |
|
7 |
100001 |
21 |
6 |
| 34 |
that |
|
8 |
100010 |
22 |
6 |
| 35 |
have |
|
9 |
100011 |
23 |
6 |
| 36 |
I |
|
10 |
100100 |
24 |
6 |
| 37 |
it |
|
11 |
100101 |
25 |
6 |
| 38 |
for |
|
12 |
100110 |
26 |
6 |
| 39 |
not |
|
13 |
100111 |
27 |
6 |
| 40 |
on |
|
14 |
101000 |
28 |
6 |
| 41 |
with |
|
15 |
101001 |
29 |
6 |
| 42 |
he |
|
16 |
101010 |
2A |
6 |
| 43 |
as |
|
17 |
101011 |
2B |
6 |
| 44 |
you |
|
18 |
101100 |
2C |
6 |
| 45 |
do |
|
19 |
101101 |
2D |
6 |
| 46 |
at |
|
20 |
101110 |
2E |
6 |
| 47 |
this |
|
21 |
101111 |
2F |
6 |
| 48 |
but |
|
22 |
110000 |
30 |
6 |
| 49 |
his |
|
23 |
110001 |
31 |
6 |
| 50 |
by |
|
24 |
110010 |
32 |
6 |
| 51 |
from |
|
25 |
110011 |
33 |
6 |
| 52 |
they |
|
26 |
110100 |
34 |
6 |
| 53 |
we |
|
27 |
110101 |
35 |
6 |
| 54 |
say |
|
28 |
110110 |
36 |
6 |
| 55 |
her |
|
29 |
110111 |
37 |
6 |
| 56 |
she |
|
30 |
111000 |
38 |
6 |
| 57 |
or |
|
31 |
111001 |
39 |
6 |
| 58 |
an |
|
32 |
111010 |
3A |
6 |
| 59 |
will |
|
33 |
111011 |
3B |
6 |
| 60 |
my |
|
34 |
111100 |
3C |
6 |
| 61 |
one |
|
35 |
111101 |
3D |
6 |
| 62 |
all |
|
36 |
111110 |
3E |
6 |
| 63 |
would |
|
37 |
111111 |
3F |
6 |
| 64 |
there |
|
38 |
1000000 |
40 |
7 |
| 65 |
their |
|
39 |
1000001 |
41 |
7 |
| 66 |
what |
|
40 |
1000010 |
42 |
7 |
| 67 |
so |
|
41 |
1000011 |
43 |
7 |
| 68 |
up |
|
42 |
1000100 |
44 |
7 |
| 69 |
out |
|
43 |
1000101 |
45 |
7 |
| 70 |
if |
|
44 |
1000110 |
46 |
7 |
| 71 |
about |
|
45 |
1000111 |
47 |
7 |
| 72 |
who |
|
46 |
1001000 |
48 |
7 |
| 73 |
get |
|
47 |
1001001 |
49 |
7 |
| 74 |
which |
|
48 |
1001010 |
4A |
7 |
| 75 |
go |
|
49 |
1001011 |
4B |
7 |
| 76 |
me |
|
50 |
1001100 |
4C |
7 |
| 77 |
when |
|
51 |
1001101 |
4D |
7 |
| 78 |
make |
|
52 |
1001110 |
4E |
7 |
| 79 |
can |
|
53 |
1001111 |
4F |
7 |
| 80 |
like |
|
54 |
1010000 |
50 |
7 |
| 81 |
time |
|
55 |
1010001 |
51 |
7 |
| 82 |
no |
|
56 |
1010010 |
52 |
7 |
| 83 |
just |
|
57 |
1010011 |
53 |
7 |
| 84 |
him |
|
58 |
1010100 |
54 |
7 |
| 85 |
know |
|
59 |
1010101 |
55 |
7 |
| 86 |
take |
|
60 |
1010110 |
56 |
7 |
| 87 |
people |
|
61 |
1010111 |
57 |
7 |
| 88 |
into |
|
62 |
1011000 |
58 |
7 |
| 89 |
year |
|
63 |
1011001 |
59 |
7 |
| 90 |
your |
|
64 |
1011010 |
5A |
7 |
| 91 |
good |
|
65 |
1011011 |
5B |
7 |
| 92 |
some |
|
66 |
1011100 |
5C |
7 |
| 93 |
could |
|
67 |
1011101 |
5D |
7 |
| 94 |
them |
|
68 |
1011110 |
5E |
7 |
| 95 |
see |
|
69 |
1011111 |
5F |
7 |
| 96 |
other |
|
70 |
1100000 |
60 |
7 |
| 97 |
than |
|
71 |
1100001 |
61 |
7 |
| 98 |
then |
|
72 |
1100010 |
62 |
7 |
| 99 |
now |
|
73 |
1100011 |
63 |
7 |
| 100 |
look |
|
74 |
1100100 |
64 |
7 |
| 101 |
only |
|
75 |
1100101 |
65 |
7 |
| 102 |
come |
|
76 |
1100110 |
66 |
7 |
| 103 |
its |
|
77 |
1100111 |
67 |
7 |
| 104 |
over |
|
78 |
1101000 |
68 |
7 |
| 105 |
think |
|
79 |
1101001 |
69 |
7 |
| 106 |
also |
|
80 |
1101010 |
6A |
7 |
| 107 |
back |
|
81 |
1101011 |
6B |
7 |
| 108 |
after |
|
82 |
1101100 |
6C |
7 |
| 109 |
use |
|
83 |
1101101 |
6D |
7 |
| 110 |
two |
|
84 |
1101110 |
6E |
7 |
| 111 |
how |
|
85 |
1101111 |
6F |
7 |
| 112 |
our |
|
86 |
1110000 |
70 |
7 |
| 113 |
work |
|
87 |
1110001 |
71 |
7 |
| 114 |
first |
|
88 |
1110010 |
72 |
7 |
| 115 |
well |
|
89 |
1110011 |
73 |
7 |
| 116 |
way |
|
90 |
1110100 |
74 |
7 |
| 117 |
even |
|
91 |
1110101 |
75 |
7 |
| 118 |
new |
|
92 |
1110110 |
76 |
7 |
| 119 |
want |
|
93 |
1110111 |
77 |
7 |
| 120 |
because |
|
94 |
1111000 |
78 |
7 |
| 121 |
any |
|
95 |
1111001 |
79 |
7 |
| 122 |
these |
|
96 |
1111010 |
7A |
7 |
| 123 |
give |
|
97 |
1111011 |
7B |
7 |
| 124 |
day |
|
98 |
1111100 |
7C |
7 |
| 125 |
most |
|
99 |
1111101 |
7D |
7 |
| 126 |
use |
|
100 |
1111110 |
7E |
7 |