Inside The Bonkers DIY Project To Corral Every Gadget Rumor On Earth

NOTE: Architecture is in active evolution: events through Kafka, bytes via a Media Gateway into MinIO, analytics in ClickHouse, and a thin Read API for the GUI. Ingest writes WARCs; summaries are sidecar objects keyed by the exact text hash. This post tracks the big ideas; fine‑grained topics (topics/schemas, scoring features, RBAC, DLQ) will land as they stabilize.

BillsTechDeck

Many times in life we must do something not because it is easy, but because it is hard. I am in one of those spaces. My dream: building a program to fetch and correlate tech news. Interested in next gen augmented reality? BillsTechDeck can help you find information on it! The world is wide open for the types of gadgets and tech announcements you can get from a correlated source to evaluate tech trends and maybe get an overall picture that may elude you with vigorous googling. Taking a huge breather by being able to get the “big picture”.

It’s a problem I ran into while looking into information on the Vision Pro from Apple or when I was awaiting new Switch 2 news. Too many pieces to the jigsaw and it was frustrating.

Let’s start with an overview of the system by showing a flow diagram (rough). Some considerations: all arrows pointing back to the message are flowing to different queues. Also, consider every subsystem to be in Docker containers and orchestrated by K8’s and in a CI/CD pipeline (i didn’t include it in the graph because it would be too busy).

Basically I want news, trends, sources, analysis, summaries in one place to build a coherent volume of data which I can understand the zeitgeist of tech news, gadgets and trends.

My attempt at something coherent:

Now, I’m just a hobbyist. So I make no claim that I know anything. I’m having fun which is the true joy. Let’s take a high overview dive into this system.

I’ve redone this sooooo much

The Harvester takes input from the IngressOrca which is based on info from the FeedbackService. It utilizes MinIO as a way to piece together WARCs and also be able to handle rich media in a distributed way. Media gets stored in the MinIO cluster (Media Cluster) which operates on sha256 and sha1 keys (in case of WARCs, but they’ll have sha256 hash keys too for system congruency)
The Sanitizer gets jobs queued up from the Harvester and draws media from the media gateway to sanitize. If its dirty we still keep it to perform forensics in a controlled environment
OCR gets run on rich media based on jobs from Kafka
spaCy (NER) gets run on everything. spaCy submits a coupled job to the spaCy sanity checker, and if sane it gets sent to a scoring service to decide whether an automatic phi4 summary is warranted, if not warranted it just gets sent to the correlation engine. If insane though, the data gets sent to an InsanityHandler (not shown for brevity) to use as analytics or get human checking.
Phi4 gets run on specific pieces of data either warranted by the scoring service or user initiated.
The Correlation Engine gets run on everything
Every subsystem will have robust auditing and submits to log queues to be handled by the LogHandler and in the LogSilo/ElasticSearch
The subsystems/handler needing access to media will interact with the MinIO cluster through restful call through a MediaGateway. The Media cluster (MinIO Cluster) just talks to the MediaGateway
Different handlers speak to the GUIHandler which displays to the GUI. The GUIHandler can submit ControlEvents (asking for phi4 summarization, tweaking stuff)
The FeedbackService talks to the HistoricalHandler to pull from the HistoricalSilo to train a model that will give better information for the IngressOrca (Orchestrator) to better choose when, where and how we can pull better information to cut down on wasted resources on bunk data
All data gets stored in a MinIO cluster (MediaCluster) and access through restful calls
Calls to the GUIHandler are made by restful calls
All subsystems are containerized and orchestrated by Kubernetes

I'm sure I left out some detail, but that's the gist.

The Harvester: gathering data

How do we gather our news about the slick Pixel Fold 3?

We need to pull it from online! Various sources require various data harvesting methods. Big problems we run into is bot detection, DDoS filtering, Captchas, malformed information. All sites have specific structures too (how complicated). Luckily we have an incredible ecosystem behind data harvesting.

Python is an incredible language to use for this purpose and has a vibrant community working diligently to help hobbyists like me grab the crucial information on the Steam Deck 2 tech specs and other chatter about it (how incredible). There are a lot of considerations. So we have to take things in steps:

Recon
- What is the site structure?
- What is the site’s flow?
- What tricks are companies like Akamai pulling to impede my ability to get my precious tech snippets?
- What values change and where? When does my cookie become invalid depending on an abnormal flow?
- Does the javascript try to fool me? Is it dynamic, obfuscated or check for tampering?
- Are my user agents okay and when do I rotate them?
- How do I handle headers?
- How do I handle TLS Fingerprinting?
- This list is getting long so I’ll just add “heuristics”

This is a very involved process and requires a good amount of attention. So targets to get my tech news have to be curated in general and scoped. Using technology like Caido and MiTM Proxy to gather valuable information about sites and their heuristics is important.

CAPTCHAs
- Tradition CAPTCHAs*: image recognition tasks*
- ReCAPTCHA*: Machine learning looking at user behavior to determine bot behavior*
- Invisible CAPTCHAs*: pesky things that run in the background by grumpy site admins looking to stop me*

While a smaller list, these are definitely huge hurdle and by no means and exhaustive list. All solutions to these problems require complex solutions. Complex solutions that have to morph constantly.

I could go on but I believe adding in things like using reputable residential proxies, mobile proxies, rate limiting, acknowledging and mitigating device fingerprinting and lastly honeypots. Last thing I need is to waste resources on useless honeypots!

So we need to have different approach levels:

Graduated Response Crawling Strategy “test with a pellet gun, escalate to a Ordinance if it’s fubar’d.”

Level 1: Pellet Gun aiohttp scrapy
- Use for static pages, public APIs, or weakly protected endpoints. Low noise, low cost.
Level 2 : Scoped Rifle Playwright + stealth plugins
- Use for JS-rendered sites, light bot defenses, simple captchas. Mimics real users, simulates browser behavior.
Level 3: Ordinance Crawl4AI/ Nodriver, heavyCAPTCHA solving, Mobile proxies
- Use when you hit: invisible captchas, anti-bot JavaScript puzzles, DOM obfuscation, or flow control defenses. Heavy but necessary for hard targets.

Why This Matters

Efficiency: Don’t burn Playwright cycles when curl works.
Stealth: Avoid raising alarms unnecessarily.
Longevity: Run for months without bans, not weeks.

Though now we introduce complexity which is fine. At the start we will have very simple rules. As the system grows and the HistoricalSilo becomes more robust we can do better calls to better places since we’d have historical data and patterns to guide us.

This part of the system is arguably the most essential and will be one requiring constant updating because of the cat and mouse game between weenies running sites keeping me from my sweet, sweet Samsung news.

I’ve come up with a plan to be able to ingest, ingest, ingest and be able to verify before I really have to worry about pulling real time data. Current plan is to pull data from archive.org (at a throttled rate and politely of course). Going this way I rewrote the Internet Archive Python wrapper to be async and non-blocking.

If I just started pulling lots of timely data, I wouldn’t have a good assurance my correlations mean anything. Historical data gives me far more assurance and allows me to verify information with 20/20 hindsight.

This approach allows me to ingest and focus on the rest of the system without having to build a crawler that will require a lot of changes. I feel building a crawler would eat too much time at the start and leave the rest of the system derelict.

No data should be trusted: the art of people looking to poison your system

What’s the problem with taking data from the internet?

Well, anyone who has been on the internet for any length of time knows about the dirty trolls. Actors who are out to hose you and your noble goal of getting the new smart phone information. Because the fact that people want to pwn you, you have to assume the worst.

Let’s highlight some concerns (not an exhaustive list, just a taste)

Malice in action
- Javascript Payloads (XSS, Embedded goodness, etc)
  - Worry about data exfiltration
  - Browser Exploits
  - Redirection and Phishing
- PDF Macros and Embedded Object
  - Can do spooky things like “remote code execution”
  - Info disclosures
  - Initiate connections to scary C2’s
- Handling various filetypes
  - Office Document macros
  - EXE/DLL (less of a concern since they’d be filtered
  - Malicous archive files that contain executables and path traversals
  - Image/Media file: hiding stegonagraphy or utilizing dirty dirty codecs
- Data Integrity
  - Tampered data
  - Spoofed sources
  - People looking to poison my system with generally bad data

So how do we deal with this? Some things I left off this list (like servers trying to DDoS my harvester by serving up tons of unnecessary data to hurt my feelings).

We first off want to isolate and contain all data we haven’t vetted. A separate black box that either resides on a different network system or is air-gapped. While VLAN hopping occurs, it has to be weighed with the caveats that come with air-gapping (which I won’t bore anyone with).

One level is running YARA rules on a file. Which is fine, and a great starting point. We have tools for macro analysis. We have PDF an analysis tools. We can verify files are what they are (making sure the dirty trolls aren’t hiding exe’s). We have static code analysis. We check hashes against threat feeds.

We also have Cuckoo at the other extreme. It won’t be implemented until we get past the Internet Archive phase. It comes with significant caveats. It provides dynamic analysis, behavioral reporting, threat detection… But, it can be thwarted! Some dirty files can detect sandbox environments. Others can escape them. It is resource intensive and has a complex setup. It is too resource intensive and complex for a while.

Lastly, we have to worry about data poisoning. I don’t have a clear path on how to handle this. There is a breadth of research papers I am going to go through to better understand the problem and approaches.

No one said safety is easy. I write this not a definitive writing of what I’m doing. More so highlighting the staggering amount of ways bad hombres can compromise me and my system.

I have yet to see a lock that can’t be picked.

I can only make it as complicated as I can.

With that in mind, I am designing this part with Rust. Performance, memory safety and I just like it a lot. This will be a Tokio job. Media will be fetched and posted to the MediaGateway to interact with the MediaCluster (MinIO cluster)

In conclusion:

For the majority of time, bad actors are looking for low hanging fruit. The further I can put the sweet, sweet apples up the tree and minimize my attack surface the better.

If the data is skanky we quarantine it so we can analyze it. We document it and store the analytics revolving around it in the HistoricalSilo.

Phi4-medium: summarizing for busy people like me

LLM’s come with a lot of challenges. Resource wise, content wise. However, they also have the ability to give us cogent summaries of potentially lengthy pieces of information. That’s why I’m using Phi4-medium (needed something more robust).

Why would I choose this?

Goldilocks size and performance
- Medium is bigger than mini. Medium has 14 billion parameters.
- Competitive enough with larger models but more efficient
Optimized for my use cases
Suitable for local deployments
Cost effective (since I’m a lowly cabbage farmer)
Flexibility in deployments

I need something local and powerful and it fits the bill. Having it being its own docker image makes it easy. Another positive is my ability to fine tune it (for my greedy need for information on the new iPhone).

Caveats!

Hallucination
- My own guys are working against me! *sigh* Tis the cost of doing business. For this I may have to implement and RAG system.
English
- I’m pigeonholing myself into consuming English. In the end this is not an overall large deal since I’m not multilingual. Though it adds complexity should I want to expand data sources to places I can’t understand

So what does a headstrong cabbage farmer like me do?

Sanity checks.

Things like volume yields
- Meaning: Checks if the summary’s length is reasonable.
  - Did Phi-4 produce a 150-word summary as requested, or did it return a single sentence or a 10-page novel?
Cardinality or categorical value checks.
- Meaning: Checks if the entities (people, places, etc.) in the summary are a valid subset of the entities in the original article. Primary defense against hallucination.
  - Does the summary mention ‘Germany’ when the source text only ever mentioned ‘France’?
Completeness and fill rate checks.
- Meaning: Checks for the omission of critical information.
  - The original article mentioned three key companies, but the summary only includes one. Is the summary missing vital information?
Uniqueness checks
- Meaning: Checks for repetitive or redundant content within the summary.
  - Did the model get stuck in a loop and repeat the same sentence three times?
Range checks.
- Meaning: Checks if numerical data in the summary is factually correct based on the source.
- The source text says profits were ‘$5 million,’ but the summary says ‘$5 billion.’ Is this a catastrophic numerical error?
Presence checks
- Meaning: The most basic check: did the service return anything at all?
  - Did the Phi-4 service time out or return an empty string instead of a summary?
Data type validation checks.
- Meaning: Checks if the summary adheres to the requested structure.
  - I asked for a JSON object with a ‘title’ and ‘key_points’ array. Is the output valid JSON with those exact keys?
Consistency checks
- Meaning: The deepest check for factual grounding and logical contradiction.
  - The source text says ‘the project was cancelled,’ but the summary implies it’s ongoing. Does the summary contradict the facts of the original article?

This list can quickly become like Benjamin Buford Blue naming uses for shrimp so I’ll top it off there.

This will be auto-run based on the scoring service or manually requested by moi.

Grabbing Entities with spaCy: grabbing the pertinent things

We are at the spaCy section.

Which model do I choose? spaCy offers a variety of pretrained models all with their own uses. They are trained on general web content so out of the box it won’t recognize tech jargon. I will likely need to fine tune a custom NER model and add custom components. At the start I will need to annotate data to train my model (there are open source tools to somewhat automate this process). This will also encompass training it to recognize entity types.

I will need to be fluent in rule-based matching (matcher and EntityRuler). I will need to go in and do entity linking and disambiguation (i.e. “Apple” the company and “apple” the fruit). With that comes the possibility of building a custom entity linking component or external tool integration (hopefully not).

Since I’m only worried about English at the moment, I am blessed to be ignorant of language detection.

Past that I will need to consider performant things like batch processing and component disabling. When not in use turn it off!

With the consideration possible parallel processes running with phi4 I’ll have to consider CPU based models and GPU based models, and also have to consider considerable RAM utilization.

There’s pre-processing, post-processing and possibly integrating external logic and models. The use of custom attributes will be a must. I will have to plan for out-of-domain text which I will inevitably run into and is crucial for me to know how to handle.

Lastly, and almost most importantly:

Sanity checks.

Schema validation
Verifying correct data types
Paying close attention to the behavior around critical fields
Defining expected data types
Establishing acceptable ranges with things like dates and word counts
Define allowed values
Define completeness thresholds
Consideration of cross field consistency rules

A lot of the above mentioned sanity check stuff applies here, but in a more granular sense dealing with entities. The list goes on, and again, it becomes listing uses for shrimp to Forrest Gump.

I feel okay about the completeness of this section.

Data correlation: making sense of things

Data correlation in this system is incredibly important. I need a language that can provide me some memory guarantees as well as stop me from making newbie mistakes. I drifted towards C++ at first. I thought it through and arrived back at Rust. I’m simply not an experienced C++ programmer and would likely implement things that would hose my system.

Basically, Rust takes entities from spaCy and connects the dots. It will utilize ClickHouse to write/read/store pertinent things. I needed some real granularity and functionality for statistics in correlation. An earlier draft incorporated RocksDB, which wasn’t robust enough with recent developments.

So stats will be important (yay!).

An idomatic way of coding is key and I’ll need to be very deliberate with what I do, why I do it and how I implement things. I’m going to be using Tokio for this part since I will have a lot of I/O processes talking to ClickHouse.

We basically take all entities and run rich analysis on them an compare it historical data.

I consider the following things:

Is this relationship statistically significant?
Is this correlation more than just “chance”?
Is this significance worth creating a graph relationship with?
Is there factual backing to put emphasis on this specific relationship?

So I’d need to do things like establish a p-value for connections. It’d also be a good idea to establish Pointwise Mutual information, a measure that scores how much more likely two entities are to appear together than by random chance. Where high and negative scores tell me great things about a correlation.

Using stats is essential for filtering out noise. For instance, the entities ‘Apple’ and ‘iPhone’ will appear together thousands of times, but this connection is obvious and not particularly insightful. Statistics help us prove that a rarer connection, like a specific tech company and a government agency, is far more significant even if it only appears a few times. Also, thinking of the Whitehouse: its not significant because it’s a white building.

Past getting into some concepts I feel out of the scope of this overview, I’ll leave it at that.

Data: the backbone

So what do I do with all this data about hot new tech items?

I hoard it.

I will have multiple databases (PostgreSQL, ClickHouse, Neo4j, MinIO)

All data operations will be fed through data handlers. One will handle Neo4J operations, one PostgreSQL which will be used to store artifact data (basically a metadata registry), two will be ClickHouse (HistoricalSilo and CorrelationSilo). Its a lot, but each DB has its own strength and I believe a simple “SQL Server for everything” would have significant drawbacks.

Data structures, good tables and primary keying will be tantamount in ClickHouse (complex stored procedures among other things). The ArtifactSilo will be significantly easier, though will definitely require a lot of care. It will be a source of much contemplation, tears and frustration. A good design will pay off in spades. I’m approaching this later since I feel I’ll have a much better idea of what I need the further in the system I get.

Neo4J is another beast. I feel as long as my correlator isn’t phoning it in it should be relatively painless (famous last words). My feelings are that I essentially want to try and make it as dumb as possible. I want to be able to point to point to my correlation engine and understand the “why?” If I started adding layers of complexity and correlation logic the data becomes more coupled and detracts from the value of my correlation engine

The HistoricalSilo will be a ClickHouse DB have a lot of granular data from things like:

Where we got good data
What search queries yielded the best data
What harvesting methods worked the best for which data source
Where/when and potentially why we got dirty data
Analytics about that dirty data

There’s most likely much more, and I will find them when I get to that point.

The MinIO cluster will be way less painful to implement than the others. I still do need to make sure everything is belt and suspenders.

The databases will be an intense experience. There will be a ton more. Bit by bit though.

GUI: webapp time!

The GUI will be a webapp. Initially I was going to make this a desktop app. I realized though that eventually I want more people to use it. Pyside6 wouldn’t be a great option.

Using a webapp I get to access such a wide variety of libraries. I have incredible access to information that may not be available if I used a GUI. When I initially settled on Pyside6, my goals were a lot different. I honestly just didn’t want to write a gui in Python. I have no good reason as to why I don’t. It’s perfectly capable. It was just a personal preference.

Having that nagging feeling in my gut I searched for other gui option. I found a LOT of gui projects were abandoned. To add to that, finding good examples of what people built with the gui libraries was difficult if not impossible. I could definitely have just pushed forward, but I didn’t want to use something then put in work and come to the realization that my vision isn’t possible with a certain gui.

So I went with a webapp. There is a lot of benefit to it, but now I’ll have to be really on top of security. However, I won’t have to worry about that complexity until I believe I’m ready to show my project, and just maybe by then I could find some cool dudes to code with.

Basically, the gui talks to the GuiHandler which talks to the LogHandler, ArtifactHandler, HistoricalHandler and do control events like be able to run certain jobs. ControlEvents will have to be locked down and deliberate in how it puts jobs into Kafka.

We will have to be able to serve all types of rich media.

It feels more prudent to just do a webapp.

Final words: last considerations

I didn’t cover everything. This post is now closing in on 4.5k words. One thing I want to add is my choice of Kafka. Kafka for this project right now is indeed overkill. It wasn’t my initial choice. However, I ran into a snag during development when my initial choice became untenable. So, Kafka is where I landed.

An added bonus is that it looks good on a resume. If I ever decide to try and be a developer.

I won’t.

But, it would look nice.

There is a ton of work ahead of me to be able to breathe life into my love of tech trends.

Do I need to do any of this?

No.

I just think it’s incredible fun.

All architecture and flow choices are subject to change. On this blog I will not be providing code (I’ll save your eyes).

There are tradeoffs everywhere.

When do I scale Kafka?
Do I implement a resource orchestrator so I’m not burning my rig?
How granular do I get with defining “valuable” data?
What do I do within the sytem to purge useless data?
Will I need late night sessions burning darts?
What do I do if compromised?
How do I offset data poisoning?

Vexing.

However daunting, I have a secret weapon: time and no boss to ride me about failing.

This will take years.

And that’s okay.

This project may be outwardly insane and ambitious to the reader.

I’m self aware enough to acknowledge that.

Though I want to say that I’m incredibly interested in all the domains of knowledge within the system itself. It’s a long marathon, not a 100-meter sprint. Bit by bit.

I want to leave on a lesson learned from Mr. Spruce. Spruce is a man who changed the UPS headquarters address to his own, an apartment in Chicago. This was allowed for months where Mr. Spruce was able to deposit ~$65k in cash to his account that was earmarked for UPS.

How does this fit?

A lesson I learned from this story is audacity. To have a complete disregard for a logical ceiling in what is possible. Mr. Spruce did not worry himself with questions about whether or not he could indeed change the address of the world’s largest logistics company to his own apartment. He just did. And it worked.

While I feel like I can definitely shed Mr. Spruce’s lack of impulse control and absence of foresight, I can internalize the audacity to try. Having a complete and utter disregard for what a consensus may deem “feasible” I am able to embark on a journey of learning untethered by a tradition steeped in reason that unequivocally says “you can’t”.

Maybe I can’t. I’d rather fail large than to not try. For that, I must embody Mr. Spruce’s approach of totally not giving a fuck.

If you stumbled across this blog, I hope you may have learned something.

Much love,

Bill “Wizard” Anderson

Inside the Bonkers DIY Project to Corral Every Gadget Rumor on Earth | HackerNoon

BillsTechDeck

The Harvester: gathering data

Graduated Response Crawling Strategy “test with a pellet gun, escalate to a Ordinance if it’s fubar’d.”

Why This Matters

No data should be trusted: the art of people looking to poison your system

Let’s highlight some concerns (not an exhaustive list, just a taste)

In conclusion:

Phi4-medium: summarizing for busy people like me

Why would I choose this?

Caveats!

Sanity checks.

Grabbing Entities with spaCy: grabbing the pertinent things

Sanity checks.

Data correlation: making sense of things

Data: the backbone

GUI: webapp time!

Final words: last considerations

Leave a Reply Cancel reply

Stay Connected

Latest News

Take $149 off the Apple Watch Ultra 2 and power up your wrist

Ethereum’s Post-EIP-4844 Era: More Forks, Slower Syncs, Big Questions | HackerNoon

This gaming headset’s killer feature solved my biggest PS5 and PC annoyance — and the price is just right

A Forgotten Julia Roberts And Matt Damon Rom-Com Is A Tubi Hit 37 Years Later – BGR

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

BillsTechDeck

Let’s break down the steps (leaving out kafka related steps):

The Harvester: gathering data

Graduated Response Crawling Strategy “test with a pellet gun, escalate to a Ordinance if it’s fubar’d.”

Why This Matters

No data should be trusted: the art of people looking to poison your system

Let’s highlight some concerns (not an exhaustive list, just a taste)

In conclusion:

Phi4-medium: summarizing for busy people like me

Why would I choose this?

Caveats!

Sanity checks.

Grabbing Entities with spaCy: grabbing the pertinent things

Sanity checks.

Data correlation: making sense of things

Data: the backbone

GUI: webapp time!

Final words: last considerations

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News