MikeMetta
November 24th, 2009, 10:48 AM
I have converted some PDF files into two-column text with Monarch but the data is inconsistent. Here is a representative sample. For each PDF there was a graphic in the upper left hand corner which is why there is a blank area there:
Smith Company
10 Company Street, Sacramento CA 95630-6798
Telephone: (555) 555-0000
URL: www.website1.com
J and K Company
123456 New Road, Los Altos Hills CA 94022-4599
Telephone: (555) 555-2468
URL: www.website2.com
East Company XYZ Company
Branch campus of North Orange County Community Company District Subsidiary of Johnson and James Company District Office located in
located in Anaheim, CA Fresno, CA
9999 East Street, Cypress CA 90630-5897 1111 East XYZ Avenue, Fresno CA 93741-0002
Telephone: (555) 555-2222 Telephone: (555) 555-1357
URL: www.anotherwebsite.com URL: www.website3.com
West Company ABC123 Company
112233 West Boulevard, Cupertino CA Subsidiary of Smith and Jones Company District Office located in
95014-5793 Anaheim, CA
Telephone: (555) 555-4444 98765 Another Avenue, Fullerton CA 92832-2095
URL: www.websitehere.com Telephone: (555) 555-7777
URL: www.website4.com
Each entry may or may not have 2 extra descriptive lines of information after the company name.
There is always a telephone number and a URL for each entry.
And sometimes the zip code will appear in its own line :(
This is just one part of one of many PDF pages (separate files), but as far as I know there are no other anomolies with the data besides the ones listed above.
Your help in creating model(s) to get the data please? Thank you :D
Smith Company
10 Company Street, Sacramento CA 95630-6798
Telephone: (555) 555-0000
URL: www.website1.com
J and K Company
123456 New Road, Los Altos Hills CA 94022-4599
Telephone: (555) 555-2468
URL: www.website2.com
East Company XYZ Company
Branch campus of North Orange County Community Company District Subsidiary of Johnson and James Company District Office located in
located in Anaheim, CA Fresno, CA
9999 East Street, Cypress CA 90630-5897 1111 East XYZ Avenue, Fresno CA 93741-0002
Telephone: (555) 555-2222 Telephone: (555) 555-1357
URL: www.anotherwebsite.com URL: www.website3.com
West Company ABC123 Company
112233 West Boulevard, Cupertino CA Subsidiary of Smith and Jones Company District Office located in
95014-5793 Anaheim, CA
Telephone: (555) 555-4444 98765 Another Avenue, Fullerton CA 92832-2095
URL: www.websitehere.com Telephone: (555) 555-7777
URL: www.website4.com
Each entry may or may not have 2 extra descriptive lines of information after the company name.
There is always a telephone number and a URL for each entry.
And sometimes the zip code will appear in its own line :(
This is just one part of one of many PDF pages (separate files), but as far as I know there are no other anomolies with the data besides the ones listed above.
Your help in creating model(s) to get the data please? Thank you :D