PropRaven Marketplace v1 — Methodology

How each canonical entity and intelligence signal is produced. Customers building decisioning systems should read this before relying on any individual field.


Stable IDs (PRPV_*)

PropRaven IDs are deterministic 16-char hex hashes of canonicalized natural keys.

prpv_id = 'PRPV_' + entity + '_' + UPPER(SUBSTR(SHA2(canonical_key, 256), 1, 16))

Where canonical_key joins the natural-key fields with CHR(31) (ASCII unit separator), and each field is UPPER(TRIM(...))-ed (NULLs become CHR(0)).

Properties:

  • Idempotent: same input → same ID, every time
  • Whitespace + case tolerant: ' 37 ' and '37' produce the same ID
  • 64 bits of entropy → P(collision in 256M rows) ≈ 1e-3 in any single entity (acceptable for v1)

What this means for you:

  • Your queries can JOIN on PRPV_* IDs with confidence
  • Source-system APN format changes do NOT change the PRPV ID (unless the field actually changes)
  • Within a major version, IDs are forever-stable — your warehouse can store them

Owner entity resolution

Every parcel has at most one canonical prpv_owner_id via the EDGE_OWNER_PARCEL table.

Pipeline:

  1. Source cluster IDPROPZILLA.CURATED.PARCEL_ENRICHED.OWNER_ENTITY_ID carries an internal cluster ID (er- prefix for entities, er-i- for individuals) covering 99.7% of parcels.
  2. Aggregate to one row per cluster: most-frequent name, classified entity_type, parcel_count rollup.
  3. Name-classifier override — owners whose name contains LLC, CORP, TRUST, COUNTY, CITY OF, etc. get their entity_type reclassified accordingly. This corrects PE's tendency to mark obvious-corporate names as INDIVIDUAL.
  4. Canonical merge (v1.1) — corporate entities (LLC/CORP/LP/GOVT/NONPROFIT) with the same normalized name + entity_type collapse into a single prpv_owner_id. The one with the most parcels wins. INDIVIDUAL and TRUST entities are NOT merged (too prone to first-name-only false positives like "John" matching across families).
  5. Placeholder drop — owners whose name is empty, "Not Available", "Current Owner", "Multiple Owners", "Unknown", "[Limited Access]" are mapped to NULL canonical and their edges dropped. ~22K placeholder clusters removed.

Confidence scoring:

  • 1.00 — Tier A: known-institutional list match (e.g., "Invitation Homes", "Blackstone")
  • 0.95 — Tier B: name-classified entity type (regex-matched corporate suffix)
  • 0.85 — Tier C: er-i-* individual cluster
  • Edges inherit owner confidence

Limitations:

  • Multi-grantee names like "URBAN SITES OF AMERICA LLC; WILLIAMS MARY; WILLIAMS-DAUGHTRY EBETTE" are treated as one entity. Splitting at ; and re-resolving is a v1.2 improvement.
  • We do NOT resolve beneficial ownership (LLC → controlling natural person). That requires Secretary of State filings + FinCEN BOI access — slated for the Person entity in v1.1.

Address comparison (for is_owner_occupied / is_absentee)

PE provides two addresses that are always semantically meaningful but format-incompatible:

  • Situs: '1230 DRESDEN DR WEST CHARLOTTE NC' (street + city + state)
  • Mailing: '1230 DRESDEN DR WEST 28205' (street + zip5)

Naïve string comparison fails. Our methodology:

  1. Strip non-alphanumeric, upper-case both
  2. Take leading 16 chars (≈ street_number + 1-2 words of street name)
  3. If prefixes equal → owner-occupied
  4. If situs_zip5 + mail_zip5 + LEFT(prefix,8) all match → owner-occupied (weak match)
  5. Otherwise → absentee

Absentee classification:

  • OUT_OF_STATE — mailing state ≠ situs state (highest signal — likely investor)
  • LOCAL_ABSENTEE — mailing in same ZIP but different street (often a property manager or relative)
  • OUT_OF_COUNTY — mailing in same state but different county
  • PO_BOX — mailing is a P.O. box (deliberate obfuscation)

Confidence scales with how many corroborating fields agree (street + zip + state).


Institutional buyer classification

Three tiers, in order of confidence:

  • Tier A (conf 1.00): name matches a curated list of known institutional buyers — Invitation Homes, AMH, Tricon, Blackstone, etc. Twenty-one canonical patterns shipped in v1.
  • Tier B (conf 0.90): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 10 (multi-property corporate ownership).
  • Tier C (conf 0.75): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 5.

Customer queries should default to Tier A+B (confidence_score >= 0.90). Tier C is for screen-out / lead-gen workflows tolerant of noise.

The curated list will grow. Customers requesting additions can suggest patterns; we ship them in the next release.


Recent transfer recency

Source: PROPZILLA.CURATED.PARCEL_ENRICHED.LAST_SALE_DATE + LAST_SALE_PRICE.

Buckets: LAST_90 / LAST_180 / LAST_365 / OVER_365 / NEVER.

Known data-quality issue: PE has ~313K parcels (Mecklenburg alone) with LAST_SALE_DATE set to dates in 2026 — these are upstream backfill artifacts, not real recent sales. Customers should filter confidence_score >= 0.95 (which requires both date AND price) for clean signal. Working with upstream to clean for v1.1.


Geometry storage

Parcel boundaries stored as GEOGRAPHY type (Snowflake's WGS84 spatial type). Coverage:

  • 167.8M of 234M+ parcels have polygons (~72%)
  • has_polygon boolean lets customers fast-filter
  • Centroid lat/lon also exposed for non-spatial queries

Why INSURANCE covers 234M+ parcels

The Core / Ontology / Owner-Graph / Insurance bundles all share a single canonical parcel set — every distinct (STATE_FIPS, COUNTY_FIPS, PARCEL_ID) key in CURATED.PARCEL_ENRICHED. As of 2026-05-17 that's 234M+ parcels.

Prior to the May 2026 merge, this layer was split across two physical tables (PARCEL_ENRICHED and PARCEL_ENRICHED_CANONICAL) with different row counts and an explicit "Core vs Insurance wider grain" semantic. That split has been retired — every share now reads from the merged PARCEL_ENRICHED with one row per parcel, every value carrying its source attribution via the new DATA_SOURCES_ALL array column.

Caveat: ~2.22M parcels carry NULL COUNTY_FIPS where source address / lat-lon / ZIP / city was insufficient to derive a county. They're real parcels (have address, owner, sale history, etc.) but cannot be joined to county-level overlays. Use WHERE COUNTY_FIPS IS NOT NULL to filter them out when running county-grouped analytics.

v1.2 extensions: PARCEL_EXTENSIONS (new 2026-05-17)

37 additional columns are exposed via a 1:1 join table on prpv_parcel_id:

  • TIGER census hierarchy — block, block group, tract, place, county subdivision, school district, urban area GEOIDs + names + urban/rural classification (91% / 90% / 89% / 58% / 89% / 85% / 55% / 100% coverage respectively).
  • Parsed address components — street number / pre-direction / street name / street type / post-direction / unit type / unit number / address slug / ZIP+4 (67% / 15% / 83% / 66% / 5% / 1% / 1% / 90% / 3% coverage).
  • AVM bandsAVM_LOWER / AVM_UPPER / AVM_AS_OF (86%) and RENTAL_AVM_LOWER / _UPPER / _AS_OF (68%).
  • Distress flagsIS_REO (17%), IS_DISTRESSED (10%), IS_MULTI_PARCEL_SALE (6%), IS_PARTIAL_INTEREST (10%).
  • Distance metricsDIST_TO_INTERSTATE_M (74%), DIST_TO_COASTLINE_M (38%).
  • ProvenanceTIGER_VINTAGE, GEO_PRECISION (rooftop / interpolated / centroid).

Access pattern by share:

ShareAccess
PROPRAVEN_PARCEL_INTELLIGENCE (marquee)Already a column on V_MARKETPLACE_NATIONAL. No join needed.
PROPRAVEN_CRE, _INSURANCE, _GOVTECH, _SFR, _LENDINGSELECT ... FROM <SCHEMA>.PARCEL p LEFT JOIN <SCHEMA>.PARCEL_EXTENSIONS ext ON ext.prpv_parcel_id = p.prpv_parcel_id
PROPRAVEN_ONTOLOGYFuture: extensions will land in the ontology entity model in v1.3.

This is a backward-compatible additive change — v1.0 / v1.1 queries are unaffected.


Known coverage gaps (v1 accepted)

FieldCoverageWhyWorkaround
SILVER.OWNER.name_raw74% nationallyCounty feeds in NJ, ME, NH, RI, CA Prop 13 partially redact owner names at the publishing layer. Cannot enrich — the data isn't published upstream.Filter to name_raw IS NOT NULL for owner-keyed analytics; expect lower coverage in affected states.
SILVER.TAX_RECORD.assessed_value_total = 0~6.56M parcels in MD + NDMD and ND county feeds publish TAV=0 for tax-exempt parcels (government, religious, nonprofit).Filter assessed_value_total > 0 to exclude exempt parcels (usually the desired behavior).
SILVER.PERMIT → SILVER.PARCEL FK73.9% resolvesPermit-to-parcel resolution depends on address / lat-lon match quality. The 26% unmatched permits are mostly NJ/NH and a few large jurisdictions with non-normalized situs strings — a permit-matcher upgrade is on the v1.2 roadmap.For full permit-to-parcel reach, use the legacy PROPRAVEN_PARCEL_INTELLIGENCE share (over PROPZILLA.CURATED).
SILVER.PROPERTY.property_type_class normalizationMISMO codes verbatimThe normalized abbreviations (SFR / MF / CRE) ship in v1.1. Current values: SingleFamily, VacantLand, Commercial, Condominium, etc.Filter on the MISMO values for now (see sample queries Q1, Q16).

What we DO NOT include

Per v1 PII boundary (locked in plan §13):

  • No DOB, SSN, email, phone, or device IDs on any individual
  • No image data (parcel photos)
  • No proprietary AVM (we expose source-county assessed values; customers run their own AVMs)
  • No personal addresses for natural persons (only for entities)

Revisit each in v2 with legal review.


Data freshness SLA (v1)

TierRefresh cadence
Ontology (PARCEL/PROPERTY/TAX_RECORD/PERMIT/OWNER/GEOGRAPHY/TRANSACTION/LOAN/ADDRESS + edges + signals)MONTHLY
LENDING / SFR / CRE / GOVTECH bundlesMONTHLY
INSURANCE bundle (FEMA NRI annual; NFIP / NOAA quarterly)QUARTERLY
BANKING_MARKETPLACE (FDIC SOD annual; analytics monthly)MONTHLY

Source data inherits its publishing cadence — county assessor rolls publish monthly to annually, federal data (FEMA / Census / FDIC) annually. The _LOADED_AT and _SOURCE_AS_OF columns expose actual freshness per row.

META.SOURCE_LINEAGE.last_refreshed_at shows when each entity last updated.


Versioning

Every row carries _resolution_version (e.g., v1.0.0-national). Pin queries to a version for reproducibility:

SELECT * FROM SILVER.OWNER WHERE _resolution_version = 'v1.0.0-national';

Major versions (v1v2) are breaking. Old major lives in a parallel listing for 90 days. Minor versions (v1.1) are additive — new entities, new columns, refined methodology — and don't break existing queries.