How much space do unnecessary citations and replies occupy in Gmail? Let’s analyze it, first using the Big-O notation, and then using concrete numbers.
I’ve been with Codility for over seven years. When I joined there were just eight of us — and now we’re approaching 120+ employees. Through this time, I’ve watched how a startup organically grows and becomes a small company.
One of our traditions that has evolved is sending an introductory email announcing new employees. Initially, an email was sent to each and every one of us, but very quickly an alias [email protected]… was created. Most of us replied to such emails to welcome the new team member, using the function ‘reply to all.’ The result was a long email thread with each message citing all of the previous messages.
Luckily Gmail collapses such citations to a single ellipsis, that can be expanded on demand. (I thank you and curse you Google for this feature! Thank you, because it really helps in reading just the essential part of the email and ignoring unnecessary citations. Curse you, because it makes people carelessly cite everything there is, no matter whether they really refer to something that was written before or not.)
As the company grew, getting a new email with every person replying to a ‘Welcome‘ message became a nuisance. So we decided to use the [email protected]… alias only in BCC, not in CC. And in most cases, we actually follow this convention.
I couldn’t help but wonder how much space do all of these unnecessary citations occupy? Let’s analyze it, first using the Big-O notation, and then using concrete numbers.
Say there are N employees in the company. With every new employee, there is a ‘Welcome‘ email. For simplicity, let’s assume that half of the employees reply to such an email (replying to the last message in the thread and citing all the previous messages). Here are the two scenarios for comparison:
- (A) the ‘Welcome‘ emails are sent to BCC: [email protected]…
- (B) the ‘Welcome‘ emails are sent to CC: [email protected] or just TO: [email protected]…
So, what’s the difference in the amount of disk space that such messages occupy, in these two scenarios?
Scenario A is simpler: There are N+1 mailboxes (including the newcomer’s one). The message lands in each of them. N/2 people reply and the replies land in their own, and one or two other mailboxes. This means circa 2*N, that is O(N) messages in total. As new people join the company over time, this accumulates to O(N^2) messages in total.
In Scenario B, the announcement message lands in N+1 mailboxes. N/2 people reply to all, which makes the total of N*(N+1)/2 = O(N^2) messages in mailboxes. However, the messages get longer and longer. As we reply to the last message in the thread, all of the previous replies are cited. On average, half of the replies are cited, so each reply contains O(N) messages cited. This means that one ‘Welcome‘ message triggers O(N^3) cited messages being stored in the mailboxes. Over the history of the company, this cumulates to O(N^4) total cited messages being stored in the mailboxes.
How bad is it? Its O(N^2) times worse than Scenario A, and O(N^2) is a lot!
Really? What if I don’t believe this O(N^2) mumbo-jumbo? A lot of almost nothing can still be negligible. We can’t take the Big-O shortcuts — we have to do the math.
Concrete Numbers Analysis
(Simplifying assumptions: we only hire, we send the same announcement for all new hires, half of the currently employed people respond with “welcome aboard” messages, when replying we “reply to all”, the welcome message occupies circa 16 KB, each reply adds 1 KB to the message size.)
Let’s say there are currently N employees.
Scenario A is again simpler: there are N+1 mailboxes (including the newcomer’s one). The message lands in each of them. N/2 people reply and the reply lands in their own, and one or two other mailboxes. This means circa 2*N messages in total, half of them 16 KB and half of them 17 KB large. Over time this cumulates to circa N^2 messages in total, each 16.5 KB on average.
- For a 100 people company this is peanuts: 100*100*16.5KB = 165MB.
- For a 1,000 people company this is negligible: 1,000*1,000*16.5KB = 16.5GB.
- For a 10,000 people company this still fits in a single hard drive: 10,000 * 10,000 * 16.5KB = 1,65 TB.
Nothing to worry about. So how much worse is Scenario B?
An announcement message lands in N+1 mailboxes. N/2 people reply to all, which makes the total of N*(N+1)/2 messages in mailboxes. However the messages get longer and longer. Each reply adds 1 KB. So the first reply takes 17KB, next 18KB and so on. On average, half of the earlier replies are cited, so each reply takes 16 KB + N/4 KB on average. All the messages in the mailboxes related to one announcement take N*(N+1)/2 * (16 + N/4) KB ≈ 8*N^2 + N^3/8 KB. Over the history of the company it cumulates to 8/3*N^3 + N^4/32 KB.
- For a 100 people company it’s 2.7 GB for the message and 3.1GB for citations, 5.7 GB in total (negligible).
- For a 1,000 people company it’s 2.7 TB for the message and 31.3 TB for citations, 34 TB in total — well, it’s just 10-20 hard drives, what is it for Google?
- For a 10,000 people company it’s 2.7 PB for the message and 312.5 PB for citations, 315 PB in total. This is at least 50 tons of about 100,000 hard drives (not counting the servers). Maybe Google can still handle it, but it’s definitely a significant load. Moreover, 99% of this load are just cited replies …
You can also notice, that in these calculations N^4/32 KB quickly dominates 8/3*N^3. This is simply because O(N^4) always dominates O(N^3) (for large enough N). It’s the O(N^4) that makes the final result PetaBytes not GigaBytes.
This blog originally appeared on Cubical Thoughts.