NB: I put Mr. Linger on this mail because he expressed an interest
in learning about data. What better place to learn than here at the
Imprecision Laudro-data-mat.

It was good to see today that you guys are still alive. I can't
describe how long it takes to recover from surgery like this, nor
my surprise in the discovery. Like Humpty Dumpty, I have had a lot
of stuff done over the years to put me back together, but when you
can feel the hardware with your fingertips through your own skin, it
is astonishing to realize that one is nothing more than a biological
machine. And I'm not even entirely biological any longer. Think Borg.

But first, is the following a usability error? Take a look at
http://www.companyspotlight.com/31445/NORDEA-BANK-AB--FDR I clicked the
"Featured in Financial Services" button, and it took me to Aberdeen. I
gather that the purpose is to direct people to the microsite, but
the button needs to say something like "See the entire Financial
Services MicroSite" because it reads like you Nordea Bank is the
featured company.

...

As I have worked on this task, I have tried to keep in mind that access
to this database represents what we "sell," at least seen from the
standpoint of the end-user. And this pre-holiday afternoon seems like
a good time to give you a description of the Grand Unified Matching
Scheme, or GUMS. It's a fine name, don't you think? As in "Hey,
these data are gumming up the works." But seriosly, I believe that we
can achieve an error rate ≤ 1%. There is a proposed architectural
change to the backend of Company Spotlight, so please be sure to read
the message.

- The new Factset feed is working well, and the data seem to be
reliable, both individually and collectively. In other words, spot
checks of a datum show it to be correct, and taken as a whole I don't
find conflicts like duplicate keys like I do in MorningStar.

- The MorningStar feed is also working well. I don't know what they
have done to their data, but it's just a problem we will have to
work around.

- The SECs feed is rich, but their concept of sequence numbers make
it a little hard to be sure things are in the right order, and they
definitely have duplicate keys.

- The new vendor ( something Alpha?) we don't yet know about.

I have put a double star (**) by the steps below about which I would
like to get management and technical confirmation. I would like to
confirm for everyone that none of the following has anything to do with
"clients."

Part I:

The Factset, SEC and MorningStar data go into raw tables that save
everything we can get.

I have taken the logic to generate slugs (partial names) into
a separate PHP module. It is not terribly CPU intensive and the
algorithm needs to be isolated, so I generate all the slugs before
we do anything else.

(**) Remove the inactive securities from the Factset data. This has
the side effect of removing the company if there are no securities.

(**) Remove documents older than 3 financial years (currently "before
2009") from the MorningStar data. Then, remove the companies that
are left empty of documents.

(**) Remove documents older than 3 financial years from the SEC data.

(**) And before we move on, we make sure that we remove all the
clients by querying some Company Spotlight database that is thought
to have an up to scratch list of the current clients.

Part II:

We match info to try to associate the three key /company/ identifiers:
FSID, MSID, CIK (for the SEC).

ISIN => some Factset, some MorningStar
Ticker/Exch => some Factset, some MorningStar
Names => all 
Country => all

There are quite a few possible matches:

Full name -- very good bet to be identical across the systems,
particularly in USA and Canada.

Partial name (slug) -- very good bet to be correct when talking Western
Europe. And this is the part where we can apply computer science and
get some returns on our investment.

Country -- This is a little harder to say. Other than Factset, no
one says what this means. There are a lot of companies overseas that
have to file something with the SEC, which is why I now have so many
more matches.

ISIN -- We once thought this was very good, until Sir Rob the Debugger
found the Expedia problem. Worse still, when we get wrong ISINs in MS,
there is no reason to think it is actually wrong, although a match
does strengthen the evidence.

 Ticker/Exchange -- the problems here are duplicate keys (the tickers)
 and missing data (the exchanges).

 The result of all these wash cycles is neatly stacked laundry in
 the xref table that associates FSID - MSID - CIK ... and the CSID
 (Company Spotlight ID). I intend to keep the coefficient of confidence
 in the correlation (and C^3 for you, Rudy) score in there. (**) This
 gives us an opportunity to have a "dial" or "knob" that we can turn
 to experiment with the results because it can change the threshold.

 Part III:

 We also have the map that is a conveniently denormalized
 representation of the Factset data of interest to us. (**) To it,
 we apply the prioritization rules. They are essentially a rationing
 system by which prevent ourselves from over-running the contractual
 limitations. It is also the place where we will need to re-apply the
 click counting of the content to determine the most popular companies.

 Just as a thought experiment, consider how much things would change if
 we shifted the priorities to Australia, Japan and China. In the first
 case, a lot of those companies file in the United States, and in the
 latter two we shift away from Roman alphabet languages in increase the
 reliance on MorningStar's data. [[ BTW: a counterintuitive finding on
 names is that both Australia and China have some of the hardest name
 matching problems. With China, the problem is deciding how to spell
 the name, and with Australia it is deciding the order of the words. ]]

 As I have discovered during the past few days, the situation is
 going to get complicated with the MorningStar v. SEC data. (**) I
 think we should consider a system in which a kit could contain data
 from each source, perhaps using the year as a dividing line. SEC gets
 data before MorningStar does, and SEC often has data that antedates
 the MorningStar data. Many MorningStar companies have sparse data,
 so I can envision a collection that has SEC docs from 2009, and nice
 looking PDFs of the annual reports from 2010 and 2011.

 Part IV:

 Paul, you have been working on the Code Igniter embedded model. Rudy,
 you know the business logic of application as well as anyone. Seth,
 you know the client. If the application is naïve about the tables'
 names, I suggest that we use the model layer to point to the new
 tables instead of trying to reorganize them with the importation
 procedure. This problem of deleting something and adding something
 else in the spotlight database doesn't make a lot of sense. Nor does
 marking all the records in the gkfstatus column, nor does the crazy
 way that the Potentialists decided to give the securities still more
 IDs that we don't need.

 To steal a line from Sir Salman Rushdie's very funny novel /The Moor's
 Last Sigh/, the kit-ification can be done with more finesse if we
 are not trying to delete and replace. Additionally, we just keep
 everything in the database and use business rules rather than the
 database content to drive what the users can and do see. My point
 is that we could adjust the business rules to see what happens
 (ex: SEC v. MorningStar docs) rather than having to reload the
 database. Every time we reload the database with fresh data we have
 the issue of its not being quite like it was, and the differences
 are more common down at the border of show/don't show than they are
 up at the Apple and Exxon end of the continuum.

 Part V:

 Oh yes, the xanax. The difference between MorningStar data and
 the Higgs particle is that with the right tools you can identify a
 Higgs particle.