NB: I put Mr. Linger on this mail because he expressed an interest in learning about data. What better place to learn than here at the Imprecision Laudro-data-mat. It was good to see today that you guys are still alive. I can't describe how long it takes to recover from surgery like this, nor my surprise in the discovery. Like Humpty Dumpty, I have had a lot of stuff done over the years to put me back together, but when you can feel the hardware with your fingertips through your own skin, it is astonishing to realize that one is nothing more than a biological machine. And I'm not even entirely biological any longer. Think Borg. But first, is the following a usability error? Take a look at http://www.companyspotlight.com/31445/NORDEA-BANK-AB--FDR I clicked the "Featured in Financial Services" button, and it took me to Aberdeen. I gather that the purpose is to direct people to the microsite, but the button needs to say something like "See the entire Financial Services MicroSite" because it reads like you Nordea Bank is the featured company. ... As I have worked on this task, I have tried to keep in mind that access to this database represents what we "sell," at least seen from the standpoint of the end-user. And this pre-holiday afternoon seems like a good time to give you a description of the Grand Unified Matching Scheme, or GUMS. It's a fine name, don't you think? As in "Hey, these data are gumming up the works." But seriosly, I believe that we can achieve an error rate ≤ 1%. There is a proposed architectural change to the backend of Company Spotlight, so please be sure to read the message. - The new Factset feed is working well, and the data seem to be reliable, both individually and collectively. In other words, spot checks of a datum show it to be correct, and taken as a whole I don't find conflicts like duplicate keys like I do in MorningStar. - The MorningStar feed is also working well. I don't know what they have done to their data, but it's just a problem we will have to work around. - The SECs feed is rich, but their concept of sequence numbers make it a little hard to be sure things are in the right order, and they definitely have duplicate keys. - The new vendor ( something Alpha?) we don't yet know about. I have put a double star (**) by the steps below about which I would like to get management and technical confirmation. I would like to confirm for everyone that none of the following has anything to do with "clients." Part I: The Factset, SEC and MorningStar data go into raw tables that save everything we can get. I have taken the logic to generate slugs (partial names) into a separate PHP module. It is not terribly CPU intensive and the algorithm needs to be isolated, so I generate all the slugs before we do anything else. (**) Remove the inactive securities from the Factset data. This has the side effect of removing the company if there are no securities. (**) Remove documents older than 3 financial years (currently "before 2009") from the MorningStar data. Then, remove the companies that are left empty of documents. (**) Remove documents older than 3 financial years from the SEC data. (**) And before we move on, we make sure that we remove all the clients by querying some Company Spotlight database that is thought to have an up to scratch list of the current clients. Part II: We match info to try to associate the three key /company/ identifiers: FSID, MSID, CIK (for the SEC). ISIN => some Factset, some MorningStar Ticker/Exch => some Factset, some MorningStar Names => all Country => all There are quite a few possible matches: Full name -- very good bet to be identical across the systems, particularly in USA and Canada. Partial name (slug) -- very good bet to be correct when talking Western Europe. And this is the part where we can apply computer science and get some returns on our investment. Country -- This is a little harder to say. Other than Factset, no one says what this means. There are a lot of companies overseas that have to file something with the SEC, which is why I now have so many more matches. ISIN -- We once thought this was very good, until Sir Rob the Debugger found the Expedia problem. Worse still, when we get wrong ISINs in MS, there is no reason to think it is actually wrong, although a match does strengthen the evidence. Ticker/Exchange -- the problems here are duplicate keys (the tickers) and missing data (the exchanges). The result of all these wash cycles is neatly stacked laundry in the xref table that associates FSID - MSID - CIK ... and the CSID (Company Spotlight ID). I intend to keep the coefficient of confidence in the correlation (and C^3 for you, Rudy) score in there. (**) This gives us an opportunity to have a "dial" or "knob" that we can turn to experiment with the results because it can change the threshold. Part III: We also have the map that is a conveniently denormalized representation of the Factset data of interest to us. (**) To it, we apply the prioritization rules. They are essentially a rationing system by which prevent ourselves from over-running the contractual limitations. It is also the place where we will need to re-apply the click counting of the content to determine the most popular companies. Just as a thought experiment, consider how much things would change if we shifted the priorities to Australia, Japan and China. In the first case, a lot of those companies file in the United States, and in the latter two we shift away from Roman alphabet languages in increase the reliance on MorningStar's data. [[ BTW: a counterintuitive finding on names is that both Australia and China have some of the hardest name matching problems. With China, the problem is deciding how to spell the name, and with Australia it is deciding the order of the words. ]] As I have discovered during the past few days, the situation is going to get complicated with the MorningStar v. SEC data. (**) I think we should consider a system in which a kit could contain data from each source, perhaps using the year as a dividing line. SEC gets data before MorningStar does, and SEC often has data that antedates the MorningStar data. Many MorningStar companies have sparse data, so I can envision a collection that has SEC docs from 2009, and nice looking PDFs of the annual reports from 2010 and 2011. Part IV: Paul, you have been working on the Code Igniter embedded model. Rudy, you know the business logic of application as well as anyone. Seth, you know the client. If the application is naïve about the tables' names, I suggest that we use the model layer to point to the new tables instead of trying to reorganize them with the importation procedure. This problem of deleting something and adding something else in the spotlight database doesn't make a lot of sense. Nor does marking all the records in the gkfstatus column, nor does the crazy way that the Potentialists decided to give the securities still more IDs that we don't need. To steal a line from Sir Salman Rushdie's very funny novel /The Moor's Last Sigh/, the kit-ification can be done with more finesse if we are not trying to delete and replace. Additionally, we just keep everything in the database and use business rules rather than the database content to drive what the users can and do see. My point is that we could adjust the business rules to see what happens (ex: SEC v. MorningStar docs) rather than having to reload the database. Every time we reload the database with fresh data we have the issue of its not being quite like it was, and the differences are more common down at the border of show/don't show than they are up at the Apple and Exxon end of the continuum. Part V: Oh yes, the xanax. The difference between MorningStar data and the Higgs particle is that with the right tools you can identify a Higgs particle.