Gen AI chatbots have changed how we should analyze user intent. Before AI chatbots, we relied more on structured interactions—clicks, impressions, page views. Now, we’re dealing with free-form conversations.
This shift in how intent is expressed creates several challenges, outlined below:
-
PII (Personal Identifiable Information) Everywhere: In general, a lot of financial and healthcare-related conversations with chatbots contain PII Data like SSNs and medical diagnoses.
-
Fragmented Signals: User intent now unfolds over multi-turn conversations instead of through single events like clicks and impressions.
Previously, the recommendation systems assumed structured inputs, with LLM’s they need actual conversation signals to make them productive and for training the models.
System needed for Ingesting ChatBot data
- A real-time PII processor using both regex rules and contextual NLP in the ingest pipeline
- A privacy-aware data warehouse supporting analytics and legal compliance with data encryption
- Conversation metrics that improve models without requiring raw data access
Building a Better Framework
Data Ingestion
Our system processes incoming chat data through a high-throughput pipeline from applications:
class SecureMessage(BaseModel):
chat_id: UUID # Conversation session
request_id: UUID # User question identifier
response_id: UUID # LLM response identifier
timestamp: datetime # Event time
encrypted_pii: bytes # GPG-encrypted raw text
clean_text: str # De-identified content
metadata: Dict[str, float] # Non-PII features (sentiment, intent)
vector_embedding: List[float] # Semantic representation (768-dim)
session_context: Dict # Device, region, user segment
The magic below happens in the PII detection system in the ingestion pipeline:
-
Pattern Matching: more than 150 regex patterns catch common PII formats, and this regex can be updated as per config, i.e. the list can grow as we find more PII pattern matches.
-
Named Entity Recognition: A fine-tuned BERT model from Hugging Face to have a score on chat conversations
-
Contextual Analysis: Identifies implicit PII by doing contextual analysis
-
False Positive Reduction: This is very important, as we need to have a way to reduce false positives
All detected PII is secured with envelope encryption using rotating AES-256 data keys, with master keys stored in GSM or some cloud secret manager with strict access controls.
Multi-Temperature Storage
All the data might not need the same treatment, so a tiered approach for storage is a great idea. Here’s our system:
Tier |
Technology |
Retention |
Use Case |
Access Pattern |
---|---|---|---|---|
Hot |
Redis + Elasticsearch |
7 days |
Real-time A/B testing |
High-throughput, low latency |
Warm |
Parquet on Cloud Storage |
90 days |
Model fine-tuning |
Batch processing, ML pipelines |
Cold |
Compressed Parquet + Glacier |
5+ years |
Legal/regulatory audits |
Infrequent, compliance-driven |
Data should be partitioned by time, geography, and conversation topic—optimized for both analytical queries and targeted lookups. Access controls enforce least privilege principles with just-in-time access provisioning and full audit logging.
Overcoming Technical Hurdles
Building this system has its challenges:
- Scaling Throughput: Scaling Kafka consumers to achieve 100ms end-to-end latency to power models with real-time data
- Accurate PII Detection: Our use of NLP and Regex Regex-based PII system helped us ensure privacy
- Maintaining Data Utility: Semantic preservation techniques (replacing real addresses with similar fictional ones) retained 95% analytical utility with zero PII exposure
Measuring What Matters
Hallucination Detection That Actually Works
We calculate a Hallucination Score (H) as:
H = 1 – (sim(R, S) / max(sim(R, D)))
Where:
- R = LLM response
- S = Source documents/knowledge
- D = Knowledge base
- sim() = Cosine similarity between embeddings
Conversation Quality Metrics
Our framework tracks:
- Engagement Depth: Turn count vs. benchmark for intent type
- Resolution Efficiency: Path length to successful resolution
- User Satisfaction: Both explicit feedback and implicit signals (repeats, abandonment)
- Response Relevance: Coherence between turns and contextual adherence
Compliance on Autopilot
Privacy regulations shouldn’t require manual processes. Our system automates:
- GDPR Workflow: From user request to crypto-shredding across all storage tiers
- CCPA Handling: Automated inventory and report generation
- Retention Policies: Time-based purging with justification workflows
Making AI/ML Better
The framework generates de-identified features:
- Conversation-level aggregates (length, topic shifts, sentiment)
- Turn-level metrics (response time, token efficiency)
- User satisfaction correlates without the need for individual identification
Privacy You Can Count On
Our framework delivers both cryptographic and statistical privacy guarantees:
- Cryptographic: AES-256 encryption with 30-day key rotation
- Statistical: (ε,δ)-differential privacy with ε=2.1 and δ=10^-5
- Anonymity: k-anonymity with k≥10 for all demographic aggregates
The Road Ahead
We’re continuing to improve the framework with:
- Support for multimodal conversations (text, voice, image)
- Integration with homomorphic encryption
- Federated fine-tuning capabilities
- Enhanced PII detection for specialized domains