names, addresses, phone numbers

Mckenna, Mike mimckenna at paypal.com
Thu Apr 21 20:11:07 CDT 2016


That would be me, and Erwin Hom (erwin.hom at gmail.com<mailto:erwin.hom at gmail.com>) who spoke at IUC.

My latest work is coalescing the schema used for Google address validation metadata<https://github.com/googlei18n/libaddressinput/wiki/AddressValidationMetadata>, HTML5.1 autofill fields<https://www.w3.org/TR/html51/sec-forms.html#autofill-field>, AddressDoctor<https://www.informatica.com/content/dam/informatica-com/global/amer/us/collateral/other/addressdoctor-cloud-2_user-guide.pdf>, and some geocoding standards, for a portable address format that works well across countries, can be adapted easily to common open source libraries like the google code, and uses generic terms like HTML so as not to confuse state/province/prefecture/country or suburb/district/ward/neighborhood with colloquially named fields using US nomenclature.

The meta meta data I need for each country (my next project) is

  *   The input format – what order is expected on input, what fields are required and regex if known
     *   Variations by local or international script(s)
     *   Order change if address lookup using postcode is used
  *   The output or display format – what order, punctuation, and case-mapping
     *   Variations for multi-line, single-line
     *   Variations for local or international address
  *   Cross-walk mapping between the portable address schema and HTML5.1, i18napis, hcard, AddressDoctor

Name is pretty straightforward and what we do is nowhere near as complete or elegant as what Edwin has in his code, but we do add character-range regex because for us valid legal names have to be composed of characters that are allowed in identity or financial documents.  Organization name gets more punctuation, but the character range is limited like personal name.  The range limitations is an extension of the CLDR exemplar characters, and combined with normalization, helps reduce spoofing and confusables.  For an interesting read on names, take a look at the name restrictions for the UK Deed Poll<http://www.deedpoll.org.uk/AreThereAnyRestrictionsOnNames.html>.

For phone, we just punted and use the google phone lib.  The big help there is the phone validation.  Edwin is correct that the formats do not change much, but we like that for display, the google lib chooses the correct format, e.g. for the many prefix formats for Germany.

Mike McKenna
Internationalization Technology Product Owner
+1-408-967-3631 (desk), +1-510-332-7820 (mobile)
PayPal
2211  N. First Street, San Jose CA 95131 - USA


From: CLDR-Users <cldr-users-bounces at unicode.org<mailto:cldr-users-bounces at unicode.org>> on behalf of Cameron Dutro <cameron at lumoslabs.com<mailto:cameron at lumoslabs.com>>
Date: Thursday, April 21, 2016 at 5:02 PM
To: Edwin Hoogerbeets <ehoogerbeets at gmail.com<mailto:ehoogerbeets at gmail.com>>
Cc: "cldr-users at unicode.org<mailto:cldr-users at unicode.org>" <cldr-users at unicode.org<mailto:cldr-users at unicode.org>>, Chris Leonard <cjl at sugarlabs.org<mailto:cjl at sugarlabs.org>>
Subject: Re: names, addresses, phone numbers

I remember some fine folks from Paypal talking about something like this at IUC a few years ago. Does anyone remember who spoke and perhaps how to get in touch with them?

-Cameron

On Thu, Apr 21, 2016 at 4:34 PM, Edwin Hoogerbeets <ehoogerbeets at gmail.com<mailto:ehoogerbeets at gmail.com>> wrote:
Chris, you can see the data at:

https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/

Under there is https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/und/<countrycode> directories which contain the phone files for 22 countries. The phone files are phonefmt.json for the progressive formats designed to be used for format partial and full numbers while dialing digits in a phone UI, numplan.json for the basic numbering plan information, states.json which is a generated trie used for parsing area codes, and area.json which maps area codes to geolocations. A special case is the North American Number Plan (NANP) countries (Canada, US, Bermuda, and many Caribbean nations) which are all configured together in the https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/und/US directory for convenience.

Mike M, I can imagine that the area codes and geolocations change very regularly, but the formats do not. "(XXX) XXX-XXXX" has been the de facto standard American format for many, many years for example. Ilib contains multiple styles of format as well, since the format is often a matter of user preference instead of government mandate. See https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/und/DE/phonefmt.json for a country with 5 different possible styles.

Also under https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/und/<countrycode> are the address.json files. These are meta-information plus a list of regular expressions and hard-coded lists used to parse the addresses. It doesn't get it right all the time (the US one has problems with two word localities like "San Francisco" for example), but it gets it reasonably close, and pretty much every country in the world is covered.

Under 55 of the locale dirs are the name.json files which configure the name formats and settings for those languages. The top level contains a western-centric fall-back file used when the language doesn't have its own parser: https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/name.json. An example of Asian formats: https://sourceforge.net/p/i18nlib/code/HEAD/tree/trunk/js/data/locale/ja/name.json

Almost all of the phone data was gleaned either from the documents on the International Telecommunications Union site which has the officially published numbering plan documents for many countries, as well as wikipedia which has information about the formats. The address and name information is gleaned almost exclusively from wikipedia.

Edwin



On 04/20/2016 11:27 PM, Chris Leonard wrote:
On Thu, Apr 21, 2016 at 1:34 AM, Edwin Hoogerbeets
<ehoogerbeets at gmail.com<mailto:ehoogerbeets at gmail.com>> wrote:
I heard talk 2 or 3 years ago about a proposal to add name, address, and
phone number formats to CLDR. What ever happened to those efforts? I don't
really see data in CLDR 29 about those.

In my i18n library for JS called "ilib", I have data about the address
formats for practically every country in the world, as well as the phone
formats and name formats for many countries. I would love to contribute this
data to CLDR and then later leverage other people's local knowledge to fill
in the gaps where my data is lacking...

Can someone direct me to the folks who are working on these? Thanks,



Dear Edwin.


I'd be interested in comparing your data to that in the glibc locales.

Is there a link to your repo you can provide?

cjl

_______________________________________________
CLDR-Users mailing list
CLDR-Users at unicode.org<mailto:CLDR-Users at unicode.org>
http://unicode.org/mailman/listinfo/cldr-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20160422/2f13786b/attachment-0001.html>


More information about the CLDR-Users mailing list