Like what you see? Subscribe here and get it every week in your inbox!
Issue #282 - August 4, 2024
If you are looking for work, check out this month's Who is hiring? and Who wants to be hired? threads.
Here are the top threads of the week, happy reading!
1. Ask HN: What are you using to parse PDFs for RAG?
Top comment by whakim
We have been using different things for text, images, and tables. I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand; transformers are extremely powerful and can often do surprisingly well even when you've accidentally mashed a set of footnotes into the middle of a paragraph or something.
For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.
For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.
For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.
For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.
2. Ask HN: What is the best software to visualize a graph with a billion nodes?
Top comment by bane
Visualizing large graphs is a natural desire for people with lots of connected data. But after a fairly small size, there's almost no utility in visualizing graphs. It's much more useful to compute various measures on the graph, and then query the graph using some combination of node/edge values and these computed values. You might subset out the nodes and edges of particular interest if you really want to see them -- or don't visualize at all and just inspect the graph nodes and edges very locally with some kind of tabular data viewer.
It used to be thought that visualizing super large graphs would reveal some kind of macro-scale structural insight, but it turns out that the visual structure ends up becoming dominated by the graph layout algorithm and the need to squash often inherently high-dimensional structures into 2 or 3 dimensions. You end up basically seeing patterns in the artifacts of the algorithm instead of any real structure.
There's a similar, but unrelated desire to overlay sequenced transaction data (like transportation logs) on a geographical map as a kind of visualization, which also almost never reveals any interesting insights. The better technique is almost always a different abstraction like a sequence diagram with the lanes being aggregated locations.
There's a bunch of these kinds of pitfalls in visualization that people who work in the space inevitably end up grinding against for a while before realizing it's pointless or there's a better abstraction.
(source: I used to run an infoviz startup for a few years that dealt with this exact topic)
3. Ask HN: Weirdest Computer Architecture?
Top comment by runjake
Here are some architectures that might interest you. Note these are links that lead to rabbit holes.
1. Transmeta: https://en.wikipedia.org/wiki/Transmeta
2. Cell processor: https://en.wikipedia.org/wiki/Cell_(processor)
3. VAX: https://en.wikipedia.org/wiki/VAX (Was unusual for it's time, but many concepts have since been adopted)
4. IBM zArchitecture: https://en.wikipedia.org/wiki/Z/Architecture (This stuff is complete unlike conventional computing, particularly the "self-healing" features.)
5. IBM TrueNorth processor: https://open-neuromorphic.org/blog/truenorth-deep-dive-ibm-n... (Cognitive/neuromorphic computing)
4. Ask HN: What's an appropriate compensation counter offer in London 2024?
Top comment by dsr_
> Should I ask for a lower salary to avoid tax brackets but more equity in return? There is no vesting schedule, so only realised on sale & if you leave its at the companies discretion about if you keep your share.
If you want equity -- which means that you feel really good about this company's chances -- you want actual non-revocable ownership now. Be entered on the list of shareholders.
"if you leave it's at the company's discretion" means that it doesn't exist unless you are still employed there on the day of a sale. That's not equity, that's not a lottery ticket -- that's the promise that maybe someday there could be lottery tickets. Ask for the actual lottery ticket now, or discount it right down to zero no matter how much they offer you.
5. Ask HN: How are you finding the job market in July 2024?
Top comment by sleight42
I've been a software dev for almost three decades and manager the last few years of it. I've been looking for work for over 6 months. Granted, I'm trying to move back to SWE work but have been somewhat opeN to EM work.
Finding work is harder than 2008 and even harder than 2001. It's as bad as tech has ever seen.
For EMs, middle management is always the first to get culled. That was me. And there are about 10% or less EM roles to engineer roles.
No one has said it but my management years away from daily coding likely count against me.
What more, I'm seeing far fewer Staff+ roles now than in the past several years. Anecdotally, from a recent employer, I had the impression that company was attempting to hire fewer staff+ to (1) outsource more and (2) hire more people at lower levels and (3) both 1 and 2 together.
The net effect in the US is seeing many more "senior" roles (we love our title inflation in Tech where "senior" tends to mean "has more than 2 years experience") or extremely specialized roles where exceptionally few people would have the skill set at the required experience level.
The VCs and the bigger firms evidently were heavily leveraged in low interest rate loans. Raising rates meant less money to play with and lower profits. Employees are the biggest cost center so that's where companies cut.
Without significantly lower interest rates, I suspect Tech is going to experience something of a depression in terms of unemployed/under-employed software developers.
6. Ask HN: Junior dev and I don't want to compete in this job market. Any advice?
Top comment by addaon
I think you might be over-indexing on the process of getting a job, and under-indexing on the process of /doing/ a job. Yes, getting a job sucks -- for a few weeks or months as you work through the process. But once you have that job, it will take up more than a third of your waking hours for, ideally, years. What can imagine doing for 2000 hours of the next 8760? What would you /enjoy/ doing? That, more than the concerns over the hiring gauntlet, should be your motivation to choose a direction -- and should motivate pushing through the hiring process.
7. Ask HN: How do you choose a hostname for personal devices?
Top comment by teddyh
RFC 1178 has you covered: <https://www.rfc-editor.org/rfc/rfc1178.html>
8. Ask HN: How to go about reverse engineering and deformulating a beverage at home
Top comment by mannyv
I'd start by buying a few. Open one and let it decarbonate. It's must easier to discern flavors for uncarbonated drinks.
As an example, Coca-cola has cinnamon in it, which is almost impossible to taste when it's carbonated. It pops out when Coke goes flat.
Most sodas will have a citrus component. Japan has odd ones like Yuzu, so try to pick up some essential citrus oils that aren't normal in your home country and are plentiful there.
Then really, just put it onto stuff that you know the flavor of and taste/smell it to see if you can tell what's been added. Dip some white bread into it and see if anything comes out, etc. Don't be afraid to swirl it around in your mouth (like wine) or just breathe it in.
FYI, I was just in Japan and realized that their Sprite has a lot more lime than other countries, which I didn't really like. Normally Sprite is great in hot weather, but the lime just didn't work in the heat IMO.
9. Ask HN: What Are You Working On? (August 2024)
Top comment by dang
I think I might have posted this before, but: I'm open to having a regular "What are you working on?" thread. I even reserved https://news.ycombinator.com/user?id=whoisworking for that.
But it shouldn't be on the first of the month, colliding with Who Is Hiring threads. How about, hmm, the third Saturday of each month? or maybe Friday?
10. Ask HN: Business logic is slowing us down
Top comment by llamaLord
A lot of this comes down to poor product management.
A shitty PM sees problems like "when a guest at a hotel room completed {specific action X}, they want to receive a notification saying {specific message Y}".
A half decent PM will stop, step back, and ask "is there a meta-pattern here that I'm just not seeing yet", and realise the actual requirement is something like.
"When a guest at a hotel is the subject of any meaningful business event related to their booking, the management of the hotel want to be able to send them some kind of configurable, context-bound notification containing information about that event (ideally with some simple conditional logic to skip scenarios where the notification isn't relevant).
As an engineer, you're going to look at those two requirements fundamentally differently, and sure, the second one is maybe 3x harder to implement initially... But once you're 20x different "bespoke notification events" deep, you're gonna bloody wish you built the second option from day one.
Src: am Snr PM who sees teams get screwed by this type of bad PM thinking all the time.