PDA

View Full Version : Trying to mine names and addresses (tricky)



MikeMetta
November 24th, 2009, 10:48 AM
I have converted some PDF files into two-column text with Monarch but the data is inconsistent. Here is a representative sample. For each PDF there was a graphic in the upper left hand corner which is why there is a blank area there:


Smith Company
10 Company Street, Sacramento CA 95630-6798
Telephone: (555) 555-0000
URL: www.website1.com


J and K Company
123456 New Road, Los Altos Hills CA 94022-4599
Telephone: (555) 555-2468
URL: www.website2.com


East Company XYZ Company
Branch campus of North Orange County Community Company District Subsidiary of Johnson and James Company District Office located in
located in Anaheim, CA Fresno, CA
9999 East Street, Cypress CA 90630-5897 1111 East XYZ Avenue, Fresno CA 93741-0002
Telephone: (555) 555-2222 Telephone: (555) 555-1357
URL: www.anotherwebsite.com URL: www.website3.com


West Company ABC123 Company
112233 West Boulevard, Cupertino CA Subsidiary of Smith and Jones Company District Office located in
95014-5793 Anaheim, CA
Telephone: (555) 555-4444 98765 Another Avenue, Fullerton CA 92832-2095
URL: www.websitehere.com Telephone: (555) 555-7777
URL: www.website4.com


Each entry may or may not have 2 extra descriptive lines of information after the company name.

There is always a telephone number and a URL for each entry.

And sometimes the zip code will appear in its own line :(

This is just one part of one of many PDF pages (separate files), but as far as I know there are no other anomolies with the data besides the ones listed above.

Your help in creating model(s) to get the data please? Thank you :D

OllyInMunich
November 24th, 2009, 01:20 PM
Hello Mike,

I fear you might need two passes at this one.

Firstly, create a two column multi column region that fits the PDF scaling, or use the trick to handle variable column widths written up elsewhere.

Then trap using a blank in an empty column to capture every single line as one row of data.

Then use Page(), Line() and Column() functions to order the data correctly, then reexport, possibly using a summary with Page() as a key value, as a single column fixed width text report.

Then use standard address block trapping.

HTH

Olly

Grant Perkins
November 24th, 2009, 07:58 PM
Hi Mike and welcome to the forum.

Taking Olly's lead here and using your sample layout I figure it looks like you need a max of 6 lines to be able to create a Detail template that will capture all fields in a single template using the MCR concept.
The smalles record will, presumably, be 4 lines and there are 2 blank lines between records. If that is consistent - big if? - then this idea may work ... or it may not.

Set up the MCR stuff and then select a six line sample ending with a URL line (for reference). Create a trap using the word "Telephone:" and tell Monarch that is is on the 5th line of the 6 line detail block.

'Paint' the field for Phone number and URL on lines 5 and 6. Be suer ethe URL field is wide enough for all possibilities!

In line 1 paint a full width field (for the column width) for the address. Right click the field and go to the advanced properties and set it to end after 4 lines. OK that, and head for the table to see what it gives you.

Using the sample I had a slight problem with the record at the top of the second column because there are not enough lines available to fit he template. (Think of the second column being shifted to be under the first column and read as one long page.) Adding 2 or 3 blank rows to the top or bottom of the report fixed that. You may not have the problem with your extract from pdf. If you do ... hmm. You could create the text from the conversion and edit in (or auto that and concatenate) a couple of extra lines but whether that is a practical solution for you I don't know. Depends on how many files you have and how often the process is required.

I'll leave it at that point for now since if that does not work for you that rest of the process, such as it is, will be no help either.

HTH.



Grant

Data Kruncher
November 27th, 2009, 11:17 PM
Greetings all.

I have a solution to this challenge that is different enough from those that have been posted here to date to warrant suggesting it as an alternative.

As certain components of the solution reinforce some topics recently posted on ExcelWithMonarch.com, I've posted the details of this proposed solution as part of the "30 Days to Become a Better Monarch Modeler" (http://excelwithmonarch.com/calcfield/a-challenging-reinforcement) series.

HTH,
Kruncher