PropRaven Marketplace v1 — Methodology
How each canonical entity and intelligence signal is produced. Customers building decisioning systems should read this before relying on any individual field.
Stable IDs (PRPV_*)
PropRaven IDs are deterministic 16-char hex hashes of canonicalized natural keys.
prpv_id = 'PRPV_' + entity + '_' + UPPER(SUBSTR(SHA2(canonical_key, 256), 1, 16))
Where canonical_key joins the natural-key fields with CHR(31) (ASCII unit separator), and each field is UPPER(TRIM(...))-ed (NULLs become CHR(0)).
Properties:
- Idempotent: same input → same ID, every time
- Whitespace + case tolerant:
' 37 'and'37'produce the same ID - 64 bits of entropy → P(collision in 256M rows) ≈ 1e-3 in any single entity (acceptable for v1)
What this means for you:
- Your queries can JOIN on PRPV_* IDs with confidence
- Source-system APN format changes do NOT change the PRPV ID (unless the field actually changes)
- Within a major version, IDs are forever-stable — your warehouse can store them
Owner entity resolution
Every parcel has at most one canonical prpv_owner_id via the EDGE_OWNER_PARCEL table.
Pipeline:
- Source cluster ID —
PROPZILLA.CURATED.PARCEL_ENRICHED.OWNER_ENTITY_IDcarries an internal cluster ID (er-prefix for entities,er-i-for individuals) covering 99.7% of parcels. - Aggregate to one row per cluster: most-frequent name, classified entity_type, parcel_count rollup.
- Name-classifier override — owners whose name contains
LLC,CORP,TRUST,COUNTY,CITY OF, etc. get theirentity_typereclassified accordingly. This corrects PE's tendency to mark obvious-corporate names as INDIVIDUAL. - Canonical merge (v1.1) — corporate entities (LLC/CORP/LP/GOVT/NONPROFIT) with the same normalized name + entity_type collapse into a single
prpv_owner_id. The one with the most parcels wins. INDIVIDUAL and TRUST entities are NOT merged (too prone to first-name-only false positives like "John" matching across families). - Placeholder drop — owners whose name is empty, "Not Available", "Current Owner", "Multiple Owners", "Unknown", "[Limited Access]" are mapped to NULL canonical and their edges dropped. ~22K placeholder clusters removed.
Confidence scoring:
1.00— Tier A: known-institutional list match (e.g., "Invitation Homes", "Blackstone")0.95— Tier B: name-classified entity type (regex-matched corporate suffix)0.85— Tier C: er-i-* individual cluster- Edges inherit owner confidence
Limitations:
- Multi-grantee names like "URBAN SITES OF AMERICA LLC; WILLIAMS MARY; WILLIAMS-DAUGHTRY EBETTE" are treated as one entity. Splitting at
;and re-resolving is a v1.2 improvement. - We do NOT resolve beneficial ownership (LLC → controlling natural person). That requires Secretary of State filings + FinCEN BOI access — slated for the Person entity in v1.1.
Address comparison (for is_owner_occupied / is_absentee)
PE provides two addresses that are always semantically meaningful but format-incompatible:
- Situs:
'1230 DRESDEN DR WEST CHARLOTTE NC'(street + city + state) - Mailing:
'1230 DRESDEN DR WEST 28205'(street + zip5)
Naïve string comparison fails. Our methodology:
- Strip non-alphanumeric, upper-case both
- Take leading 16 chars (≈ street_number + 1-2 words of street name)
- If prefixes equal → owner-occupied
- If situs_zip5 + mail_zip5 + LEFT(prefix,8) all match → owner-occupied (weak match)
- Otherwise → absentee
Absentee classification:
OUT_OF_STATE— mailing state ≠ situs state (highest signal — likely investor)LOCAL_ABSENTEE— mailing in same ZIP but different street (often a property manager or relative)OUT_OF_COUNTY— mailing in same state but different countyPO_BOX— mailing is a P.O. box (deliberate obfuscation)
Confidence scales with how many corroborating fields agree (street + zip + state).
Institutional buyer classification
Three tiers, in order of confidence:
- Tier A (conf 1.00): name matches a curated list of known institutional buyers — Invitation Homes, AMH, Tricon, Blackstone, etc. Twenty-one canonical patterns shipped in v1.
- Tier B (conf 0.90): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 10 (multi-property corporate ownership).
- Tier C (conf 0.75): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 5.
Customer queries should default to Tier A+B (confidence_score >= 0.90). Tier C is for screen-out / lead-gen workflows tolerant of noise.
The curated list will grow. Customers requesting additions can suggest patterns; we ship them in the next release.
Recent transfer recency
Source: PROPZILLA.CURATED.PARCEL_ENRICHED.LAST_SALE_DATE + LAST_SALE_PRICE.
Buckets: LAST_90 / LAST_180 / LAST_365 / OVER_365 / NEVER.
Known data-quality issue: PE has ~313K parcels (Mecklenburg alone) with LAST_SALE_DATE set to dates in 2026 — these are upstream backfill artifacts, not real recent sales. Customers should filter confidence_score >= 0.95 (which requires both date AND price) for clean signal. Working with upstream to clean for v1.1.
Geometry storage
Parcel boundaries stored as GEOGRAPHY type (Snowflake's WGS84 spatial type). Coverage:
- 167.8M of 234M+ parcels have polygons (~72%)
has_polygonboolean lets customers fast-filter- Centroid lat/lon also exposed for non-spatial queries
Why INSURANCE covers 234M+ parcels
The Core / Ontology / Owner-Graph / Insurance bundles all share a single canonical parcel set — every distinct (STATE_FIPS, COUNTY_FIPS, PARCEL_ID) key in CURATED.PARCEL_ENRICHED. As of 2026-05-17 that's 234M+ parcels.
Prior to the May 2026 merge, this layer was split across two physical tables (PARCEL_ENRICHED and PARCEL_ENRICHED_CANONICAL) with different row counts and an explicit "Core vs Insurance wider grain" semantic. That split has been retired — every share now reads from the merged PARCEL_ENRICHED with one row per parcel, every value carrying its source attribution via the new DATA_SOURCES_ALL array column.
Caveat: ~2.22M parcels carry NULL COUNTY_FIPS where source address / lat-lon / ZIP / city was insufficient to derive a county. They're real parcels (have address, owner, sale history, etc.) but cannot be joined to county-level overlays. Use WHERE COUNTY_FIPS IS NOT NULL to filter them out when running county-grouped analytics.
v1.2 extensions: PARCEL_EXTENSIONS (new 2026-05-17)
37 additional columns are exposed via a 1:1 join table on prpv_parcel_id:
- TIGER census hierarchy — block, block group, tract, place, county subdivision, school district, urban area GEOIDs + names + urban/rural classification (91% / 90% / 89% / 58% / 89% / 85% / 55% / 100% coverage respectively).
- Parsed address components — street number / pre-direction / street name / street type / post-direction / unit type / unit number / address slug / ZIP+4 (67% / 15% / 83% / 66% / 5% / 1% / 1% / 90% / 3% coverage).
- AVM bands —
AVM_LOWER/AVM_UPPER/AVM_AS_OF(86%) andRENTAL_AVM_LOWER/_UPPER/_AS_OF(68%). - Distress flags —
IS_REO(17%),IS_DISTRESSED(10%),IS_MULTI_PARCEL_SALE(6%),IS_PARTIAL_INTEREST(10%). - Distance metrics —
DIST_TO_INTERSTATE_M(74%),DIST_TO_COASTLINE_M(38%). - Provenance —
TIGER_VINTAGE,GEO_PRECISION(rooftop / interpolated / centroid).
Access pattern by share:
| Share | Access |
|---|---|
PROPRAVEN_PARCEL_INTELLIGENCE (marquee) | Already a column on V_MARKETPLACE_NATIONAL. No join needed. |
PROPRAVEN_CRE, _INSURANCE, _GOVTECH, _SFR, _LENDING | SELECT ... FROM <SCHEMA>.PARCEL p LEFT JOIN <SCHEMA>.PARCEL_EXTENSIONS ext ON ext.prpv_parcel_id = p.prpv_parcel_id |
PROPRAVEN_ONTOLOGY | Future: extensions will land in the ontology entity model in v1.3. |
This is a backward-compatible additive change — v1.0 / v1.1 queries are unaffected.
Known coverage gaps (v1 accepted)
| Field | Coverage | Why | Workaround |
|---|---|---|---|
SILVER.OWNER.name_raw | 74% nationally | County feeds in NJ, ME, NH, RI, CA Prop 13 partially redact owner names at the publishing layer. Cannot enrich — the data isn't published upstream. | Filter to name_raw IS NOT NULL for owner-keyed analytics; expect lower coverage in affected states. |
SILVER.TAX_RECORD.assessed_value_total = 0 | ~6.56M parcels in MD + ND | MD and ND county feeds publish TAV=0 for tax-exempt parcels (government, religious, nonprofit). | Filter assessed_value_total > 0 to exclude exempt parcels (usually the desired behavior). |
SILVER.PERMIT → SILVER.PARCEL FK | 73.9% resolves | Permit-to-parcel resolution depends on address / lat-lon match quality. The 26% unmatched permits are mostly NJ/NH and a few large jurisdictions with non-normalized situs strings — a permit-matcher upgrade is on the v1.2 roadmap. | For full permit-to-parcel reach, use the legacy PROPRAVEN_PARCEL_INTELLIGENCE share (over PROPZILLA.CURATED). |
SILVER.PROPERTY.property_type_class normalization | MISMO codes verbatim | The normalized abbreviations (SFR / MF / CRE) ship in v1.1. Current values: SingleFamily, VacantLand, Commercial, Condominium, etc. | Filter on the MISMO values for now (see sample queries Q1, Q16). |
What we DO NOT include
Per v1 PII boundary (locked in plan §13):
- No DOB, SSN, email, phone, or device IDs on any individual
- No image data (parcel photos)
- No proprietary AVM (we expose source-county assessed values; customers run their own AVMs)
- No personal addresses for natural persons (only for entities)
Revisit each in v2 with legal review.
Data freshness SLA (v1)
| Tier | Refresh cadence |
|---|---|
| Ontology (PARCEL/PROPERTY/TAX_RECORD/PERMIT/OWNER/GEOGRAPHY/TRANSACTION/LOAN/ADDRESS + edges + signals) | MONTHLY |
| LENDING / SFR / CRE / GOVTECH bundles | MONTHLY |
| INSURANCE bundle (FEMA NRI annual; NFIP / NOAA quarterly) | QUARTERLY |
| BANKING_MARKETPLACE (FDIC SOD annual; analytics monthly) | MONTHLY |
Source data inherits its publishing cadence — county assessor rolls publish monthly to annually, federal data (FEMA / Census / FDIC) annually. The _LOADED_AT and _SOURCE_AS_OF columns expose actual freshness per row.
META.SOURCE_LINEAGE.last_refreshed_at shows when each entity last updated.
Versioning
Every row carries _resolution_version (e.g., v1.0.0-national). Pin queries to a version for reproducibility:
SELECT * FROM SILVER.OWNER WHERE _resolution_version = 'v1.0.0-national';
Major versions (v1 → v2) are breaking. Old major lives in a parallel listing for 90 days. Minor versions (v1.1) are additive — new entities, new columns, refined methodology — and don't break existing queries.