On Compressing the English Language

Someone picked my brain the other day looking for a technique to compress language files.

After walking away to think about it… my method was to re-order the ASCII code to the letters by their frequency and the most common words by their frequency.

Where lowercase e is stored as an ASCII value using 1 byte
ASCII  e = 0x61 = 0b1100001 = 7 bits
vs
APK e = 0x1 = 1 bit

… this method stores an E in 1 bit.  This is similar to the Huffman Code with the addition of whole words being included in the code.

For example:

because” is the 94th most used word in the english language and in this method is stored in 7 bits.

I don’t know if this has been done before… but I would imagine it could compress Language files substantially.

I have thought about a third addition of using the most used 2 or three letter combinations commonly used.

APK ORDER APK LET FREQ WORD FEQ APK BIN APK HEX APK BITS USED
0 space 0 0 1
1 e 12.70% 1 1 1
2 t 9.06% 10 2 2
3 a 8.17% 11 3 2
4 o 7.51% 100 4 3
5 i 6.97% 101 5 3
6 n 6.75% 110 6 3
7 s 6.33% 111 7 3
8 h 6.09% 1000 8 4
9 r 5.99% 1001 9 4
10 d 4.25% 1010 A 4
11 l 4.03% 1011 B 4
12 c 2.78% 1100 C 4
13 u 2.76% 1101 D 4
14 m 2.41% 1110 E 4
15 w 2.36% 1111 F 4
16 f 2.23% 10000 10 5
17 g 2.02% 10001 11 5
18 y 1.97% 10010 12 5
19 p 1.93% 10011 13 5
20 b 1.49% 10100 14 5
21 v 0.98% 10101 15 5
22 k 0.77% 10110 16 5
23 j 0.15% 10111 17 5
24 x 0.15% 11000 18 5
25 q 0.10% 11001 19 5
26 z 0.07% 11010 1A 5
27 the 1 11011 1B 5
28 be 2 11100 1C 5
29 to 3 11101 1D 5
30 of 4 11110 1E 5
31 and 5 11111 1F 5
32 a 6 100000 20 6
33 in 7 100001 21 6
34 that 8 100010 22 6
35 have 9 100011 23 6
36 I 10 100100 24 6
37 it 11 100101 25 6
38 for 12 100110 26 6
39 not 13 100111 27 6
40 on 14 101000 28 6
41 with 15 101001 29 6
42 he 16 101010 2A 6
43 as 17 101011 2B 6
44 you 18 101100 2C 6
45 do 19 101101 2D 6
46 at 20 101110 2E 6
47 this 21 101111 2F 6
48 but 22 110000 30 6
49 his 23 110001 31 6
50 by 24 110010 32 6
51 from 25 110011 33 6
52 they 26 110100 34 6
53 we 27 110101 35 6
54 say 28 110110 36 6
55 her 29 110111 37 6
56 she 30 111000 38 6
57 or 31 111001 39 6
58 an 32 111010 3A 6
59 will 33 111011 3B 6
60 my 34 111100 3C 6
61 one 35 111101 3D 6
62 all 36 111110 3E 6
63 would 37 111111 3F 6
64 there 38 1000000 40 7
65 their 39 1000001 41 7
66 what 40 1000010 42 7
67 so 41 1000011 43 7
68 up 42 1000100 44 7
69 out 43 1000101 45 7
70 if 44 1000110 46 7
71 about 45 1000111 47 7
72 who 46 1001000 48 7
73 get 47 1001001 49 7
74 which 48 1001010 4A 7
75 go 49 1001011 4B 7
76 me 50 1001100 4C 7
77 when 51 1001101 4D 7
78 make 52 1001110 4E 7
79 can 53 1001111 4F 7
80 like 54 1010000 50 7
81 time 55 1010001 51 7
82 no 56 1010010 52 7
83 just 57 1010011 53 7
84 him 58 1010100 54 7
85 know 59 1010101 55 7
86 take 60 1010110 56 7
87 people 61 1010111 57 7
88 into 62 1011000 58 7
89 year 63 1011001 59 7
90 your 64 1011010 5A 7
91 good 65 1011011 5B 7
92 some 66 1011100 5C 7
93 could 67 1011101 5D 7
94 them 68 1011110 5E 7
95 see 69 1011111 5F 7
96 other 70 1100000 60 7
97 than 71 1100001 61 7
98 then 72 1100010 62 7
99 now 73 1100011 63 7
100 look 74 1100100 64 7
101 only 75 1100101 65 7
102 come 76 1100110 66 7
103 its 77 1100111 67 7
104 over 78 1101000 68 7
105 think 79 1101001 69 7
106 also 80 1101010 6A 7
107 back 81 1101011 6B 7
108 after 82 1101100 6C 7
109 use 83 1101101 6D 7
110 two 84 1101110 6E 7
111 how 85 1101111 6F 7
112 our 86 1110000 70 7
113 work 87 1110001 71 7
114 first 88 1110010 72 7
115 well 89 1110011 73 7
116 way 90 1110100 74 7
117 even 91 1110101 75 7
118 new 92 1110110 76 7
119 want 93 1110111 77 7
120 because 94 1111000 78 7
121 any 95 1111001 79 7
122 these 96 1111010 7A 7
123 give 97 1111011 7B 7
124 day 98 1111100 7C 7
125 most 99 1111101 7D 7
126 use 100 1111110 7E 7