fantafan
Active member
Note: The 2nd novel is too new and my process is broken so I had to manually extract the text which was a giant pain. Also explains formatting and maybe the occaisional mistakes.
Yes, Amazon Kindle. The newest releases won't download with Kindle 2.4.0 which is what you need to use with ePubOr at least, which is what I've used to crack the encryption. No solution as of yet.What file format was the original? original amazon format?
for anyone that doesn't know, epubs are zip files with a different file extension. If you rename the file from book.epub to book.zip, you can open it in any zip program. Within them, you will find some html files, either in the root folder or a sub-folder, and one or more of those will contain the story text, extract those htmil files into a windows folder and open them in a web browser. Then you can copy the txt using Ctrl-A, Ctrl-C.
Hope one comes soon. I was generally okay paying for books off Amazon, and then converting them, but I'm less likely to buy now with the extreme DRM. My best solution for converting them now is OCR - take screen shots and copy out the text from each page. Very slow going so only doing this for my favourite authors.Yes, Amazon Kindle. The newest releases won't download with Kindle 2.4.0 which is what you need to use with ePubOr at least, which is what I've used to crack the encryption. No solution as of yet.
Yeah the change was on ~ May 1st.Yes, Amazon Kindle. The newest releases won't download with Kindle 2.4.0 which is what you need to use with ePubOr at least, which is what I've used to crack the encryption. No solution as of yet.
Yup. I'm using SnagIt for now, and its pretty painful. No screenshot required, just drag a square and it OCRs the text and puts it in the clipboard. Does a really good job. Paid program.Hope one comes soon. I was generally okay paying for books off Amazon, and then converting them, but I'm less likely to buy now with the extreme DRM. My best solution for converting them now is OCR - take screen shots and copy out the text from each page. Very slow going so only doing this for my favourite authors.
I just tried the textExtractor widget. Turns out I'd already installed it with the powertoy package I got.Microsoft has a powertoy named TextExtractor that OCRs screengrabs too. You'd have to test it to compare how reliable it is compared to SnagIt.
I'd like an efficient way to programmatically bypass their "click for more" nonsense and scrape StoriesOnline.net.
It does not recognize any formatting of the text (e.g., italicizing), so you would loose that.
Second, it put a paragraph mark at the end of each line as it appears on the screen. No extra paragraph marks to indicate a new paragraph on the original text. So, it's going to take a bit of editing to fix that.
I've seen similar issues with some text files from ASSTR, when it was up, in the past, and I haven't figured out a simple way to work around that.
notepad++ supports most regex features and you can ask Microsoft's AI co-pilot in MS Edge for help on some pretty funky search and replace schemes. use that to create a macro if you find you have to do the repair frequently.
H;
${x;
s:\r::g;
#hypenated endings...
s:([a-zA-Z])-\n([a-zA-Z]):\1\2:g;
#Everything else...
s:([^\n])\n([^\n]):\1 \2:g;
#k basic formatting to fix... Modify to preference.
s:-?ā-?: ā :g;
s: +: :g;
s:[āā]:":g;
# s:[āā]:":g;
# s:ā:":g;
# s:ā\B:":g;
s:[āā]:'"'"':g;
# s:ā:'"'"':g;
s:Ā©:\©:g
#add html formatting
s:\n([^\n]+):\n<p>\1</p>:g
s:\n<p>-----</p>:\n<hr/>:g
s:\n<p>(C[Hh][Aa][Pp][Tt][Ee][Rr] [a-zA-Z_0-9.\: ?'"'"'-]+|P[Rr][Oo][a-zA-Z]+|E[Pp][Ii][Ll][a-zA-Z]+|ABOUT THE AUTHOR)</p>:\n<center><h3>\1</h3></center>:g
#close em-phesis tags <em>test<em> to <em>test</em> (then i dont have to close them manually)
s:<em>([^\n<>]+)<em>:<em>\1</em>:g
s:<strong>([^\n<>]+)<strong>:<strong>\1</strong>:g
#forceful header spots instead of chapters
s:<H([1-9])>([^\n<>]+):<h\1>\2</h\1>:g
s:<p><h:<center><h:g
s:(</h[1-9]>)</p>:\1</center>:g
#first line is likely title and author... prep it.
s:^\n+::
s:^<p>([^\n]+)</p>\n:<center><h1>\1</h1></center>\n:
#first letter dropcap classes...
s:<FL>(["'"'"']?\w):<div class="first-letter">\1</div>:g
#super/subscript, extends to the end of the word.
s:<(su[bp])>(\w+):<\1>\2</\1>:g
#email and link hyperlink
s:\b(https?\://[^<> \t\r\n]+):<a href="\1">\1</a>:g
s:\b(\w+@\w+\.\w{3}):<a href="mailto\:\1">\1</a>:g
#common fixes
s:\blll\b:I'"'"'ll:g
s:\blI\b:I:g
s:\bl('"'"')(ll|[dm])\b:I\1\2:g
s:\bhomy\b:horny:g
p
}
Its "click for more" ... "click for more"
Python:import pyautogui import time import os # --- Configuration (You might adjust these defaults) --- DEFAULT_OUTPUT_FOLDER = "ocr_screenshots" DEFAULT_PAGE_LOAD_DELAY = 2 # Seconds to wait for the next page to load DEFAULT_ACTION_DELAY = 0.5 # Seconds to wait between mouse actions # --- Helper function to get coordinates --- def get_coordinates(prompt_message): """ Prompts the user to move their mouse to a position and press Enter to capture the coordinates. """ input(f"{prompt_message}. Move your mouse to the desired position and press Enter...") x, y = pyautogui.position() print(f"Captured coordinates: X={x}, Y={y}\n") return x, y # --- Main Script --- def main(): print("Welcome to the Automated Screenshotter for OCR!") print("----------------------------------------------") # pyautogui.FAILSAFE = True # Optional: Move mouse to top-left corner to abort # 1. Get user input num_pages = int(input("How many pages do you want to capture? ")) output_folder = input(f"Enter output folder name (default: {DEFAULT_OUTPUT_FOLDER}): ") or DEFAULT_OUTPUT_FOLDER page_load_delay = float(input(f"Enter delay after clicking 'Next' (seconds, default: {DEFAULT_PAGE_LOAD_DELAY}): ") or DEFAULT_PAGE_LOAD_DELAY) action_delay = float(input(f"Enter delay between mouse actions (seconds, default: {DEFAULT_ACTION_DELAY}): ") or DEFAULT_ACTION_DELAY) print("\n--- Coordinate Setup ---") print("You will now be asked to identify key screen locations.") print("Make sure the old program window is visible and in the position it will be during capture.") # 2. Get screenshot region coordinates print("\nStep 1: Define Screenshot Area") x1, y1 = get_coordinates("Identify the TOP-LEFT corner of the content area to screenshot") x2, y2 = get_coordinates("Identify the BOTTOM-RIGHT corner of the content area to screenshot") # Calculate width and height for the screenshot region screenshot_width = x2 - x1 screenshot_height = y2 - y1 if screenshot_width <= 0 or screenshot_height <= 0: print("Error: Screenshot region coordinates are invalid (e.g., top-left is to the right of bottom-right).") return # 3. Get "Next Page" button coordinates print("\nStep 2: Identify 'Next Page' Button") button_x, button_y = get_coordinates("Identify the CENTER of the 'Next Page' button") # 4. Create output folder if it doesn't exist os.makedirs(output_folder, exist_ok=True) print(f"\nScreenshots will be saved in: '{os.path.abspath(output_folder)}'") print("\n--- Starting Capture Process ---") print("You have 5 seconds to switch to the old program and ensure the first page is visible.") print("DO NOT move the mouse or use the keyboard during the capture process.") time.sleep(5) for i in range(num_pages): current_page_num = i + 1 print(f"Processing page {current_page_num} of {num_pages}...") # a. Take screenshot try: # Region is defined by (left, top, width, height) screenshot = pyautogui.screenshot(region=(x1, y1, screenshot_width, screenshot_height)) time.sleep(action_delay) # Brief pause except Exception as e: print(f"Error taking screenshot for page {current_page_num}: {e}") print("Aborting.") return # b. Save screenshot filename = os.path.join(output_folder, f"page_{current_page_num:04d}.png") # e.g., page_0001.png try: screenshot.save(filename) print(f" Saved: {filename}") except Exception as e: print(f"Error saving screenshot {filename}: {e}") # Decide if you want to continue or abort # continue # c. Click "Next Page" button (if not the last page) if current_page_num < num_pages: try: pyautogui.moveTo(button_x, button_y, duration=0.2) # Move mouse smoothly time.sleep(action_delay) pyautogui.click() print(f" Clicked 'Next Page' button.") except Exception as e: print(f"Error clicking 'Next Page' button: {e}") print("Aborting.") return # d. Wait for the next page to load print(f" Waiting {page_load_delay} seconds for page to load...") time.sleep(page_load_delay) else: print(" Last page processed.") print("\n--- Capture Complete! ---") print(f"All {num_pages} pages captured and saved to '{output_folder}'.") if __name__ == "__main__": try: main() except KeyboardInterrupt: print("\nProcess interrupted by user. Exiting.") except Exception as e: print(f"\nAn unexpected error occurred: {e}")