< Back to the archive

Like what you see? Subscribe here and get it every week in your inbox!

Issue #112 - April 25, 2021

Here are the top threads of the week, happy reading!

Top comment by jonahbenton

Not coincidence. Not "bank purchase history" shared with google- in most cases banks and credit cards don't know item level detail.

Lots of ways this data flow could happen, at least in the US. Happy to go through specific details I have seen if you want to share more about this, but two high level points

1. Remember that when you purchase something, the data about the purchase is BOTH yours AND the entity from whom you made the purchase. Most of those entities have data sharing agreements of various kinds for all sorts of legitimate business reasons

2. It isn't google who knows about the purchase, and even the advertiser doesn't "know" you made a purchase. Advertising is zillions of two sided marketplaces, with an enormous ecosystem of data packagers and conveyers and linkers, with lots of concern about recency and freshness of data. Your purchase landed some key about you in a bucket that was mixed and repackaged with many other keys that the advertiser knows as "keys recently interested in Voltaren." Some of those keys are related to people who bought it, or who searched for it, or more indirectly who lingered while reading a page with an ad for it...and in most cases are very short lived. So give it a few weeks and many of those buckets of keys will have been completely remade.

Top comment by jiggawatts

You've discovered what many other people have: The cloud is the new time-share mainframe.

Programming in the 1960s to 80s was like this too. You'd develop some program in isolation, unable to properly run it. You "submit" it to the system, and it would be scheduled to run along with other workloads. You'd get a printout of the results back hours later, or even tomorrow. Rinse and repeat.

This work loop is incredibly inefficient, and was replaced by development that happened entirely locally on a workstation. This dramatically tightened the edit-compile-debug loop, down to seconds or at most minutes. Productivity skyrocketed, and most enterprises shifted the majority of their workload away from mainframes.

Now, in the 2020s, mainframes are back! They're just called "the cloud" now, but not much of their essential nature has changed other than the vendor name.

The cloud, just like mainframes:

- Does not provide all-local workstations. The only full-fidelity platform is the shared server.

- Is closed source. Only Amazon provides AWS. Only Microsoft provides Azure. Only Google provides GCP. You can't peer into their source code, it is all proprietary and even secret.

- Has a poor debugging experience. Shared platforms can't generally allow "invasive" debugging for security reasons. Their sheer size and complexity will mean that your visibility will always be limited. You'll never been able to get a stack trace that crosses into the internal calls of the platform services like S3 or Lambda. Contrast this with typical debugging where you can even trace into the OS kernel if you so choose.

- Are generally based on the "print the logs out" feedback mechanism, with all the usual issues of mainframes such as hours-long delays.

Top comment by pmlnr

Non-cloud:

HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)

The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.

[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html

So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)

EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.

Top comment by rubyn00bie

Learning to let my go of my ego and trust...

Understanding that delegating tasks, instead of doing them myself, was of the utmost importance the more people I managed... especially, greenfield work. Literally: do not steal the fun. This also means inherently trusting people, and if you can't do that you shouldn't be working with them. two important notes about that:

1. If you just thought of someone you can't trust instead of thinking about how you can give more trust to people, you just had your ego stand in the way of mutual success. You are a shitty manager, and today is hopefully the first day of you recognizing that and learning to trust.

2. When people don't deliver what you expected, it's because you did a shitty job of communicating it to them. What seems obvious to you after 45 minutes in a meeting with three other people already prepared for the topic will most of the time seem obvious to no one else. If it does, I can almost promise, their vision of it is totally different than yours. Learning to work through the defining the problem (that includes asking "does this problem exist?") and then guiding solutions (we have x days, engineering hours, etc available) to ensure they meet the needs of the business. If no one but you delivers things correctly, you're a shitty manager, and today's hopefully the first day of you recognizing you need to learn to communicate and trust.

Not many people are lucky enough to be told so plainly it's their ego, but it's your ego that causes your team to fail. Maybe it's your boss's ego that's causing you to fail... I was told plainly to my face to not let my ego get in the way of the goal... and yeah it punched me in the gut too, so if you're hurting, or in denial know it's okay. We all have to grow, it's worth it.

Top comment by gru

Python Poetry [1] is very sexy.

Oh, and Starship prompt [2] too.

[1] https://python-poetry.org/

[2] https://starship.rs/

Top comment by UglyToad

So I've been mulling this stupid thought for a while (and disclaimer that it's extremely useful for these outage stories to make it to the front-page to help everyone who is getting paged with p1s out).

But, does it really matter?

I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence.

However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something.

My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours.

Top comment by rockwotj

I'm not sure I would call this relatively easy even without joins. Without joins Google Cloud Firestore does exactly what you're describing. The initial query runs against the DB then the client gets a filtered stream of updates to only that query. Its distributed and scales logarithmically with the number of queries, as it doesn't need to keep the query results in memory/materialized.

The fun part of this problem is that it's really inverted from most traditional database literature out there. Mostly the problem is you have a query and need to find the matching rows. To make updates efficient you need to do this the other way around - if there is a row (that has changed) find all the matching queries.

With joins (or any data dependant query where you can't tell a row is in the result set without looking at other data) you need to keep the query results materialized otherwise you can't have enough information without going back to disk or keeping everything in memory, which isn't really feasible in most cases.

Source: I worked on Google Cloud Firestore from it's launch until 2020 and was the on responsible for the current implementation and data structures of how changes get broadcasted to subscribed queries.

Top comment by apohn

>All of them were on combinatorics/algorithms

I have a PhD. I have had titles like "Principal Data Scientist","Senior Manager, Data Science" etc. Most of my colleagues think I am good at what I do. I would absolutely fail an interview where they ask these kinds of questions. I would need to revisit my probability and other textbooks for at least a few months to be able to pass these kind of questions.

Being able to pass these types of interviews is a learned skill. I would recommend you look at it that way and disconnect it from "intelligence" or from the value you can bring as a data scientist.

As far as the slow thinking, most jobs allow you to think slowly and work through things. You can make mistakes and fix them. There is no job where somebody says "We will release the hungry tigers on a plane full of children unless you solve a algorithm problem in 60 seconds."

>In my country, there aren’t many DS internships, so if I fail a few more interviews, not sure if I can find new opportunities soon.

I think this is the root of the issue. If there are only a few internships and lot of applicants, the employers can be as demanding or as harsh as they want. So it's really a reflection of the job situation, not you personally.

Top comment by xondono

My take from experience and from what I've seen with friends:

- Consulting is great for people with good people skills, but it can turn into hell if you haven't them.

- It's very easy for experienced workers in some shops to take advantage of the "new meat", and basically shove you their work. Be on the look for that situation. If it happens run as fast as you can.

- Find out who are the top performers, get as close as you can. Some of them will just be political hacks, but others are fountains of experience, and learning from them will provide you with invaluable insight into their fields.

- Be ready to ship crappy products. Consultancy is about doing things fast and keeping costs down. Nobody expects perfection, although you'll hear business speak like "excellence" repeated constantly. Your bosses know it, your clients know it. If you are the kind of person that has trouble living with that (i.e. perfectionist), you'll be way happier in product orgs.

- Insist on meeting the client. Engineering consultancy is 20% about making the thing, 80% about understanding the clients needs and managing their expectations. All the big failures I've seen in consulting come from having middleman between the guy building and the guy talking to the client. You don't need to be there all the time, but enough to not be playing telephone with others about what the client wants.