NOTE: Architecture is in active evolution: events through Kafka, bytes via a Media Gateway into MinIO, analytics in ClickHouse, and a thin Read API for the GUI. Ingest writes WARCs; summaries are sidecar objects keyed by the exact text hash. This post tracks the big ideas; fine‑grained topics (topics/schemas, scoring features, RBAC, DLQ) will land as they stabilize.
BillsTechDeck
Many times in life we must do something not because it is easy, but because it is hard. I am in one of those spaces. My dream: building a program to fetch and correlate tech news. Interested in next gen augmented reality? BillsTechDeck can help you find information on it! The world is wide open for the types of gadgets and tech announcements you can get from a correlated source to evaluate tech trends and maybe get an overall picture that may elude you with vigorous googling. Taking a huge breather by being able to get the “big picture”.
It’s a problem I ran into while looking into information on the Vision Pro from Apple or when I was awaiting new Switch 2 news. Too many pieces to the jigsaw and it was frustrating.
Let’s start with an overview of the system by showing a flow diagram (rough). Some considerations: all arrows pointing back to the message are flowing to different queues. Also, consider every subsystem to be in Docker containers and orchestrated by K8’s and in a CI/CD pipeline (i didn’t include it in the graph because it would be too busy).
Basically I want news, trends, sources, analysis, summaries in one place to build a coherent volume of data which I can understand the zeitgeist of tech news, gadgets and trends.
My attempt at something coherent:
Now, I’m just a hobbyist. So I make no claim that I know anything. I’m having fun which is the true joy. Let’s take a high overview dive into this system.
I’ve redone this sooooo much
Let’s break down the steps (leaving out kafka related steps):
- The Harvester takes input from the IngressOrca which is based on info from the FeedbackService. It utilizes MinIO as a way to piece together WARCs and also be able to handle rich media in a distributed way. Media gets stored in the MinIO cluster (Media Cluster) which operates on sha256 and sha1 keys (in case of WARCs, but they’ll have sha256 hash keys too for system congruency)
- The Sanitizer gets jobs queued up from the Harvester and draws media from the media gateway to sanitize. If its dirty we still keep it to perform forensics in a controlled environment
- OCR gets run on rich media based on jobs from Kafka
- spaCy (NER) gets run on everything. spaCy submits a coupled job to the spaCy sanity checker, and if sane it gets sent to a scoring service to decide whether an automatic phi4 summary is warranted, if not warranted it just gets sent to the correlation engine. If insane though, the data gets sent to an InsanityHandler (not shown for brevity) to use as analytics or get human checking.
- Phi4 gets run on specific pieces of data either warranted by the scoring service or user initiated.
- The Correlation Engine gets run on everything
- Every subsystem will have robust auditing and submits to log queues to be handled by the LogHandler and in the LogSilo/ElasticSearch
- The subsystems/handler needing access to media will interact with the MinIO cluster through restful call through a MediaGateway. The Media cluster (MinIO Cluster) just talks to the MediaGateway
- Different handlers speak to the GUIHandler which displays to the GUI. The GUIHandler can submit ControlEvents (asking for phi4 summarization, tweaking stuff)
- The FeedbackService talks to the HistoricalHandler to pull from the HistoricalSilo to train a model that will give better information for the IngressOrca (Orchestrator) to better choose when, where and how we can pull better information to cut down on wasted resources on bunk data
- All data gets stored in a MinIO cluster (MediaCluster) and access through restful calls
- Calls to the GUIHandler are made by restful calls
- All subsystems are containerized and orchestrated by Kubernetes
I'm sure I left out some detail, but that's the gist.
The Harvester: gathering data
How do we gather our news about the slick Pixel Fold 3?
We need to pull it from online! Various sources require various data harvesting methods. Big problems we run into is bot detection, DDoS filtering, Captchas, malformed information. All sites have specific structures too (how complicated). Luckily we have an incredible ecosystem behind data harvesting.
Python is an incredible language to use for this purpose and has a vibrant community working diligently to help hobbyists like me grab the crucial information on the Steam Deck 2 tech specs and other chatter about it (how incredible). There are a lot of considerations. So we have to take things in steps:
- Recon
- What is the site structure?
- What is the site’s flow?
- What tricks are companies like Akamai pulling to impede my ability to get my precious tech snippets?
- What values change and where? When does my cookie become invalid depending on an abnormal flow?
- Does the javascript try to fool me? Is it dynamic, obfuscated or check for tampering?
- Are my user agents okay and when do I rotate them?
- How do I handle headers?
- How do I handle TLS Fingerprinting?
- This list is getting long so I’ll just add “heuristics”
This is a very involved process and requires a good amount of attention. So targets to get my tech news have to be curated in general and scoped. Using technology like Caido and MiTM Proxy to gather valuable information about sites and their heuristics is important.
- CAPTCHAs
- Tradition CAPTCHAs*: image recognition tasks*
- ReCAPTCHA*: Machine learning looking at user behavior to determine bot behavior*
- Invisible CAPTCHAs*: pesky things that run in the background by grumpy site admins looking to stop me*
While a smaller list, these are definitely huge hurdle and by no means and exhaustive list. All solutions to these problems require complex solutions. Complex solutions that have to morph constantly.
I could go on but I believe adding in things like using reputable residential proxies, mobile proxies, rate limiting, acknowledging and mitigating device fingerprinting and lastly honeypots. Last thing I need is to waste resources on useless honeypots!
So we need to have different approach levels:
Graduated Response Crawling Strategy “test with a pellet gun, escalate to a Ordinance if it’s fubar’d.”
- Level 1: Pellet Gun
aiohttp
scrapy
- Use for static pages, public APIs, or weakly protected endpoints. Low noise, low cost.
- Level 2 : Scoped Rifle
Playwright + stealth plugins
- Use for JS-rendered sites, light bot defenses, simple captchas. Mimics real users, simulates browser behavior.
- Level 3: Ordinance
Crawl4AI
/Nodriver
,heavyCAPTCHA solving
,Mobile proxies
- Use when you hit: invisible captchas, anti-bot JavaScript puzzles, DOM obfuscation, or flow control defenses. Heavy but necessary for hard targets.
Why This Matters
- Efficiency: Don’t burn Playwright cycles when
curl
works. - Stealth: Avoid raising alarms unnecessarily.
- Longevity: Run for months without bans, not weeks.
Though now we introduce complexity which is fine. At the start we will have very simple rules. As the system grows and the HistoricalSilo becomes more robust we can do better calls to better places since we’d have historical data and patterns to guide us.
This part of the system is arguably the most essential and will be one requiring constant updating because of the cat and mouse game between weenies running sites keeping me from my sweet, sweet Samsung news.
I’ve come up with a plan to be able to ingest, ingest, ingest and be able to verify before I really have to worry about pulling real time data. Current plan is to pull data from archive.org (at a throttled rate and politely of course). Going this way I rewrote the Internet Archive Python wrapper to be async and non-blocking.
If I just started pulling lots of timely data, I wouldn’t have a good assurance my correlations mean anything. Historical data gives me far more assurance and allows me to verify information with 20/20 hindsight.
This approach allows me to ingest and focus on the rest of the system without having to build a crawler that will require a lot of changes. I feel building a crawler would eat too much time at the start and leave the rest of the system derelict.
No data should be trusted: the art of people looking to poison your system
What’s the problem with taking data from the internet?
Well, anyone who has been on the internet for any length of time knows about the dirty trolls. Actors who are out to hose you and your noble goal of getting the new smart phone information. Because the fact that people want to pwn you, you have to assume the worst.
Let’s highlight some concerns (not an exhaustive list, just a taste)
- Malice in action
- Javascript Payloads (XSS, Embedded goodness, etc)
- Worry about data exfiltration
- Browser Exploits
- Redirection and Phishing
- PDF Macros and Embedded Object
- Can do spooky things like “remote code execution”
- Info disclosures
- Initiate connections to scary C2’s
- Handling various filetypes
- Office Document macros
- EXE/DLL (less of a concern since they’d be filtered
- Malicous archive files that contain executables and path traversals
- Image/Media file: hiding stegonagraphy or utilizing dirty dirty codecs
- Data Integrity
- Tampered data
- Spoofed sources
- People looking to poison my system with generally bad data
- Javascript Payloads (XSS, Embedded goodness, etc)
So how do we deal with this? Some things I left off this list (like servers trying to DDoS my harvester by serving up tons of unnecessary data to hurt my feelings).
We first off want to isolate and contain all data we haven’t vetted. A separate black box that either resides on a different network system or is air-gapped. While VLAN hopping occurs, it has to be weighed with the caveats that come with air-gapping (which I won’t bore anyone with).
One level is running YARA rules on a file. Which is fine, and a great starting point. We have tools for macro analysis. We have PDF an analysis tools. We can verify files are what they are (making sure the dirty trolls aren’t hiding exe’s). We have static code analysis. We check hashes against threat feeds.
We also have Cuckoo at the other extreme. It won’t be implemented until we get past the Internet Archive phase. It comes with significant caveats. It provides dynamic analysis, behavioral reporting, threat detection… But,
Lastly, we have to worry about data poisoning. I don’t have a clear path on how to handle this. There is a breadth of research papers I am going to go through to better understand the problem and approaches.
No one said safety is easy. I write this not a definitive writing of what I’m doing. More so highlighting the staggering amount of ways bad hombres can compromise me and my system.
I can only make it as complicated as I can.
With that in mind, I am designing this part with Rust. Performance, memory safety and I just like it a lot. This will be a Tokio job. Media will be fetched and posted to the MediaGateway to interact with the MediaCluster (MinIO cluster)
In conclusion:
For the majority of time, bad actors are looking for low hanging fruit. The further I can put the sweet, sweet apples up the tree and minimize my attack surface the better.
If the data is skanky we quarantine it so we can analyze it. We document it and store the analytics revolving around it in the HistoricalSilo.
Phi4-medium: summarizing for busy people like me
LLM’s come with a lot of challenges. Resource wise, content wise. However, they also have the ability to give us cogent summaries of potentially lengthy pieces of information. That’s why I’m using Phi4-medium (needed something more robust).
Why would I choose this?
- Goldilocks size and performance
- Medium is bigger than mini. Medium has 14 billion parameters.
- Competitive enough with larger models but more efficient
- Optimized for my use cases
- Suitable for local deployments
- Cost effective (since I’m a lowly cabbage farmer)
- Flexibility in deployments
I need something local and powerful and it fits the bill. Having it being its own docker image makes it easy. Another positive is my ability to fine tune it (for my greedy need for information on the new iPhone).
Caveats!
- Hallucination
- My own guys are working against me! *sigh* Tis the cost of doing business. For this I may have to implement and RAG system.
- English
- I’m pigeonholing myself into consuming English. In the end this is not an overall large deal since I’m not multilingual. Though it adds complexity should I want to expand data sources to places I can’t understand
So what does a headstrong cabbage farmer like me do?
Sanity checks.
- Things like volume yields
- Meaning: Checks if the summary’s length is reasonable.
- Did Phi-4 produce a 150-word summary as requested, or did it return a single sentence or a 10-page novel?
- Meaning: Checks if the summary’s length is reasonable.
- Cardinality or categorical value checks.
- Meaning: Checks if the entities (people, places, etc.) in the summary are a valid subset of the entities in the original article. Primary defense against hallucination.
- Does the summary mention ‘Germany’ when the source text only ever mentioned ‘France’?
- Meaning: Checks if the entities (people, places, etc.) in the summary are a valid subset of the entities in the original article. Primary defense against hallucination.
- Completeness and fill rate checks.
- Meaning: Checks for the omission of critical information.
- The original article mentioned three key companies, but the summary only includes one. Is the summary missing vital information?
- Meaning: Checks for the omission of critical information.
- Uniqueness checks
- Meaning: Checks for repetitive or redundant content within the summary.
- Did the model get stuck in a loop and repeat the same sentence three times?
- Meaning: Checks for repetitive or redundant content within the summary.
- Range checks.
- Meaning: Checks if numerical data in the summary is factually correct based on the source.
- The source text says profits were ‘$5 million,’ but the summary says ‘$5 billion.’ Is this a catastrophic numerical error?
- Presence checks
- Meaning: The most basic check: did the service return anything at all?
- Did the Phi-4 service time out or return an empty string instead of a summary?
- Meaning: The most basic check: did the service return anything at all?
- Data type validation checks.
- Meaning: Checks if the summary adheres to the requested structure.
- I asked for a JSON object with a ‘title’ and ‘key_points’ array. Is the output valid JSON with those exact keys?
- Meaning: Checks if the summary adheres to the requested structure.
- Consistency checks
- Meaning: The deepest check for factual grounding and logical contradiction.
- The source text says ‘the project was cancelled,’ but the summary implies it’s ongoing. Does the summary contradict the facts of the original article?
- Meaning: The deepest check for factual grounding and logical contradiction.
This list can quickly become like Benjamin Buford Blue naming uses for shrimp so I’ll top it off there.
This will be auto-run based on the scoring service or manually requested by moi.
Grabbing Entities with spaCy: grabbing the pertinent things
We are at the spaCy section.
Which model do I choose? spaCy offers a variety of pretrained models all with their own uses. They are trained on general web content so out of the box it won’t recognize tech jargon. I will likely need to fine tune a custom NER model and add custom components. At the start I will need to annotate data to train my model (there are open source tools to somewhat automate this process). This will also encompass training it to recognize entity types.
I will need to be fluent in rule-based matching (matcher
and EntityRuler
). I will need to go in and do entity linking and disambiguation (i.e. “Apple” the company and “apple” the fruit). With that comes the possibility of building a custom entity linking component or external tool integration (hopefully not).
Since I’m only worried about English at the moment, I am blessed to be ignorant of language detection.
Past that I will need to consider performant things like batch processing and component disabling. When not in use turn it off!
With the consideration possible parallel processes running with phi4 I’ll have to consider CPU based models and GPU based models, and also have to consider considerable RAM utilization.
There’s pre-processing, post-processing and possibly integrating external logic and models. The use of custom attributes will be a must. I will have to plan for out-of-domain text which I will inevitably run into and is crucial for me to know how to handle.
Lastly, and almost most importantly:
Sanity checks.
- Schema validation
- Verifying correct data types
- Paying close attention to the behavior around critical fields
- Defining expected data types
- Establishing acceptable ranges with things like dates and word counts
- Define allowed values
- Define completeness thresholds
- Consideration of cross field consistency rules
A lot of the above mentioned sanity check stuff applies here, but in a more granular sense dealing with entities. The list goes on, and again, it becomes listing uses for shrimp to Forrest Gump.
I feel okay about the completeness of this section.
Data correlation: making sense of things
Data correlation in this system is incredibly important. I need a language that can provide me some memory guarantees as well as stop me from making newbie mistakes. I drifted towards C++ at first. I thought it through and arrived back at Rust. I’m simply not an experienced C++ programmer and would likely implement things that would hose my system.
Basically, Rust takes entities from spaCy and connects the dots. It will utilize ClickHouse to write/read/store pertinent things. I needed some real granularity and functionality for statistics in correlation. An earlier draft incorporated RocksDB, which wasn’t robust enough with recent developments.
So stats will be important (yay!).
An idomatic way of coding is key and I’ll need to be very deliberate with what I do, why I do it and how I implement things. I’m going to be using Tokio for this part since I will have a lot of I/O processes talking to ClickHouse.
We basically take all entities and run rich analysis on them an compare it historical data.
I consider the following things:
- Is this relationship statistically significant?
- Is this correlation more than just “chance”?
- Is this significance worth creating a graph relationship with?
- Is there factual backing to put emphasis on this specific relationship?
So I’d need to do things like establish a p-value for connections. It’d also be a good idea to establish Pointwise Mutual information, a measure that scores how much more likely two entities are to appear together than by random chance. Where high and negative scores tell me great things about a correlation.
Using stats is essential for filtering out noise. For instance, the entities ‘Apple’ and ‘iPhone’ will appear together thousands of times, but this connection is obvious and not particularly insightful. Statistics help us prove that a rarer connection, like a specific tech company and a government agency, is far more significant even if it only appears a few times. Also, thinking of the Whitehouse: its not significant because it’s a white building.
Past getting into some concepts I feel out of the scope of this overview, I’ll leave it at that.
Data: the backbone
So what do I do with all this data about hot new tech items?
I hoard it.
I will have multiple databases (PostgreSQL, ClickHouse, Neo4j, MinIO)
All data operations will be fed through data handlers. One will handle Neo4J operations, one PostgreSQL which will be used to store artifact data (basically a metadata registry), two will be ClickHouse (HistoricalSilo and CorrelationSilo). Its a lot, but each DB has its own strength and I believe a simple “SQL Server for everything” would have significant drawbacks.
Data structures, good tables and primary keying will be tantamount in ClickHouse (complex stored procedures among other things). The ArtifactSilo will be significantly easier, though will definitely require a lot of care. It will be a source of much contemplation, tears and frustration. A good design will pay off in spades. I’m approaching this later since I feel I’ll have a much better idea of what I need the further in the system I get.
Neo4J is another beast. I feel as long as my correlator isn’t phoning it in it should be relatively painless (famous last words). My feelings are that I essentially want to try and make it as dumb as possible. I want to be able to point to point to my correlation engine and understand the “why?” If I started adding layers of complexity and correlation logic the data becomes more coupled and detracts from the value of my correlation engine
The HistoricalSilo will be a ClickHouse DB have a lot of granular data from things like:
- Where we got good data
- What search queries yielded the best data
- What harvesting methods worked the best for which data source
- Where/when and potentially why we got dirty data
- Analytics about that dirty data
There’s most likely much more, and I will find them when I get to that point.
The MinIO cluster will be way less painful to implement than the others. I still do need to make sure everything is belt and suspenders.
The databases will be an intense experience. There will be a ton more. Bit by bit though.
GUI: webapp time!
The GUI will be a webapp. Initially I was going to make this a desktop app. I realized though that eventually I want more people to use it. Pyside6 wouldn’t be a great option.
Using a webapp I get to access such a wide variety of libraries. I have incredible access to information that may not be available if I used a GUI. When I initially settled on Pyside6, my goals were a lot different. I honestly just didn’t want to write a gui in Python. I have no good reason as to why I don’t. It’s perfectly capable. It was just a personal preference.
Having that nagging feeling in my gut I searched for other gui option. I found a LOT of gui projects were abandoned. To add to that, finding good examples of what people built with the gui libraries was difficult if not impossible. I could definitely have just pushed forward, but I didn’t want to use something then put in work and come to the realization that my vision isn’t possible with a certain gui.
So I went with a webapp. There is a lot of benefit to it, but now I’ll have to be really on top of security. However, I won’t have to worry about that complexity until I believe I’m ready to show my project, and just maybe by then I could find some cool dudes to code with.
Basically, the gui talks to the GuiHandler which talks to the LogHandler, ArtifactHandler, HistoricalHandler and do control events like be able to run certain jobs. ControlEvents will have to be locked down and deliberate in how it puts jobs into Kafka.
We will have to be able to serve all types of rich media.
It feels more prudent to just do a webapp.
Final words: last considerations
I didn’t cover everything. This post is now closing in on 4.5k words. One thing I want to add is my choice of Kafka. Kafka for this project right now is indeed overkill. It wasn’t my initial choice. However, I ran into a snag during development when my initial choice became untenable. So, Kafka is where I landed.
An added bonus is that it looks good on a resume. If I ever decide to try and be a developer.
I won’t.
But, it would look nice.
There is a ton of work ahead of me to be able to breathe life into my love of tech trends.
Do I need to do any of this?
No.
I just think it’s incredible fun.
All architecture and flow choices are subject to change. On this blog I will not be providing code (I’ll save your eyes).
There are tradeoffs everywhere.
- When do I scale Kafka?
- Do I implement a resource orchestrator so I’m not burning my rig?
- How granular do I get with defining “valuable” data?
- What do I do within the sytem to purge useless data?
- Will I need late night sessions burning darts?
- What do I do if compromised?
- How do I offset data poisoning?
However daunting, I have a secret weapon: time and no boss to ride me about failing.
This will take years.
And that’s okay.
This project may be outwardly insane and ambitious to the reader.
I’m self aware enough to acknowledge that.
Though I want to say that I’m incredibly interested in all the domains of knowledge within the system itself. It’s a long marathon, not a 100-meter sprint.
I want to leave on a lesson learned from Mr. Spruce. Spruce is a man who changed the UPS headquarters address to his own, an apartment in Chicago. This was allowed for months where Mr. Spruce was able to deposit ~$65k in cash to his account that was earmarked for UPS.
How does this fit?
A lesson I learned from this story is audacity. To have a complete disregard for a logical ceiling in what is possible. Mr. Spruce did not worry himself with questions about whether or not he could indeed change the address of the world’s largest logistics company to his own apartment. He just did.
While I feel like I can definitely shed Mr. Spruce’s lack of impulse control and absence of foresight, I can internalize the audacity to try. Having a complete and utter disregard for what a consensus may deem “feasible” I am able to embark on a journey of learning untethered by a tradition steeped in reason that unequivocally says “you can’t”.
Maybe I can’t. I’d rather fail large than to not try. For that, I must embody Mr. Spruce’s approach of totally not giving a fuck.
If you stumbled across this blog, I hope you may have learned something.
Much love,
Bill “Wizard” Anderson